Global Labor Database User Manual A guide to understanding, using, and interacting with the Global Labor Database May 15, 2024 Table of Contents Acknowledgements....................................................................................................................................... 4 1. Introduction to the GLD ........................................................................................................................ 5 1.1. What is GLD? ................................................................................................................................. 5 1.2. What is the objective of GLD?....................................................................................................... 5 1.3. Who is the intended audience? .................................................................................................... 6 1.4. What are the principles guiding GLD? .......................................................................................... 6 GLD coverage and expansion ................................................................................................................ 6 Transparency and data access .............................................................................................................. 8 Data quality and validation ................................................................................................................... 8 1.5. Complementarities with similar data efforts ................................................................................ 8 1.6. Sustainability ................................................................................................................................. 9 2. GLD content, storage, and access ....................................................................................................... 10 2.1. Data and information collected in GLD ....................................................................................... 10 2.2 Data storage platforms and access rules .......................................................................................... 13 The World Bank GLD Server ................................................................................................................ 13 Datalibweb .......................................................................................................................................... 15 Microdata Library................................................................................................................................ 17 GitHub 20 Summary of data and information storage ........................................................................................ 25 3. The GLD harmonization methodology ................................................................................................ 26 3.1. Defining the boundaries of GLD harmonization ......................................................................... 26 3.2. The structure of the harmonization code ................................................................................... 26 3.3. The GLD data dictionary.............................................................................................................. 29 Survey & ID module ............................................................................................................................ 29 Geography ........................................................................................................................................... 37 Demography........................................................................................................................................ 42 Migration............................................................................................................................................. 48 Education ............................................................................................................................................ 53 Training 58 Labour 60 4. Validation and quality checks ........................................................................................................... 107 4.1. How is the harmonization validated? ....................................................................................... 107 4.2. Guide to the GLD single survey quality checks ......................................................................... 108 4.2.3. Block 2 - GLD external checks.................................................................................................. 111 4.3. Guide to the GLD survey series quality checks .............................................................................. 127 5. Using the GLD.................................................................................................................................... 130 5.1. How to use the harmonized dta file ......................................................................................... 130 5.2. How to use the harmonization code? ....................................................................................... 130 5.3. How to cite the GLD .................................................................................................................. 134 6. Contributing to GLD’s quality and expansion ................................................................................... 136 6.1. General rules to collaborate with GLD ...................................................................................... 136 6.2. How to share new raw data with GLD ...................................................................................... 136 6.3. How to share a harmonization for a survey not covered by GLD ............................................. 136 6.4. How to collaborate with GLD on a new harmonization............................................................ 137 6.5. How to correct/ expand an existing harmonization ................................................................. 137 6.6. How to correct and expand the Country Survey Details........................................................... 141 6.7. How to correct and expand the GLD tools................................................................................ 141 6.8. How to communicate on any other issue ................................................................................. 142 Acknowledgements This user manual was prepared by Mario Gronert (Economist/Statistician) under the guidance of Maddalena Honorati (Senior Economist, TTL). The manual benefitted from enriching comments by peer reviewers Jose Montes (Data Scientist) and Michael Weber (Senior Economist). The team would like to acknowledge the Jobs Group's leadership, Ian Walker (previous manager) and Federica Saliola (current manager) for the overall guidance of the GLD project throughout its inception and implementation phase. Special thanks to Dino Merotto (Lead Economist) for encouraging and supporting the project, Jose Manuel Romero, Andreas Eberhard, Yann Bourgeois and the entire Jobs Group Team for continuous feedback and valuable comments. Finally, we are grateful to the dedicated team of consultants who have built the GLD through continuous data mining, revisions and quality improvements: Angelo Gabrielle Fernandez Santos, Alexandra Quiñones Nunura, Junying Tong, Tom Mosher, Elena Casanovas Navarro, and David Alejadro Bermudez Vasquez. 1. Introduction to the GLD 1.1. What is GLD? The Global Labor Database (GLD) is part of the World Bank initiatives to harmonize labor force surveys and household surveys with a relevant labor module. Its mission is to create an open and transparent harmonization with sufficient background information to allow data analysts to use, alter, and expand the harmonization. In this sense, background information goes beyond code, questionnaires, and reports, and includes documenting survey details learned during harmonization which are not recorded elsewhere. An example of this documenting changes to the currency or the administrative divisions. The GLD aims to be an open-source database, meaning that as much information should be accessible to as many people as possible. It also strives to be transparent, making all steps that create the harmonization traceable, from raw data acquisition to harmonized variable coding. Hence, all steps of the harmonization process are documented and made available, including the survey documentation, code and notes that allow users to fully comprehend the survey design and the choices made in the harmonization. The availability of the codes and documentation enables users to customize and add variables not in the GLD harmonization. Most harmonization efforts provide users with a "take it or leave it" option, but the GLD's open and transparent approach allows users to trace and deviate from the standard harmonization at any point, giving them a head start regardless of where they wish to jump in. Finally, the GLD follows up and expands on the previous initiative to harmonized household surveys, the International Income Distribution Database (I2D2). The I2D2 was superseded by the Global Monitoring Database (GMD), which however focused on household budget surveys and did not harmonize labor force surveys. The GLD was created to remedy this gap in the survey type coverage and complement it, with a stronger focus on labor market information through an expanded dictionary and more rigorous validation of labor indicators. 1.2. What is the objective of GLD? Labor force surveys represent a critical data source to generate key labor market indicators disaggregated by individual characteristics that policy makers monitor, target, and evaluate. Appended across time and space, they are used for comparison and benchmarking. The objective of GLD is to make the process of producing these estimates easier, traceable, and reproducible for World Bank staff and researchers worldwide. One major issue when generating survey-based indicators is the cross-country comparability and the time- consuming process of harmonization, which requires reading both data files and survey materials in detail to understand what to code and how, a structured and consistent harmonization methodology as well as many steps of validation. The first objective of GLD is thus to create a database of harmonized surveys with comprehensive and reliable labor market information that can be used in analytical work for cross-country and over time comparisons. By creating a harmonized output, this database can be fed into other products that automate analytical processes like country level jobs diagnostics. The second objective of GLD is to allow users to go beyond the standard dataset, to support them in delving deeper into their analyses and comparisons to find deeper insights. GLD empowers such a customized approach by providing all codes and technical reports, as well as documenting all intricacies of the survey discovered during harmonization so that users can focus on answering the questions they need answering, not on figuring out in what year an administrative boundary was changed and how the sample size was thus affected. 1.3. Who is the intended audience? Target users of GLD include researchers, data analysts and practitioners in the international development community, statistical offices, ministries of labor, of economy and planning and other relevant government agencies analyzing labor market data to monitor and analyze labor market outcomes, and to inform the design of labor policies. These users can exploit two kinds of uses of the GLD. The first use is the “as-is” harmonization. This refers to the user taking the harmonized data files as prepared by the data team and using those variables (or combinations thereof) for their analysis. The second use is the “amended” or “hacked” harmonization. This refers to the user wanting to go beyond the prepared harmonization. This may be, for example, because they are interested in another specific variable from the survey, present in the questionnaire but not harmonized as not common in most surveys. In this case, the user can still utilize the harmonization do file to standardize most variables (as concepts like education level or labor status are likely still going to be relevant) but in addition add other ones. This use entails editing the harmonization code and/or adding to it at specific points to serve the users purpose without them needing to process the survey entirely. 1.4. What are the principles guiding GLD? GLD follows a set of principles to guide its development and maintenance. In this introduction we focus on (a) GLD coverage and expansion, (b) transparency and data access, (c) data quality and validation. GLD coverage and expansion As of April 2024, the GLD holds 345 surveys from 24 countries (1 high-income countries, 9 upper medium- income, 11 lower middle-income, and 9 low-income countries). Table 1, gives an overview of the countries (by three digit ISO code), the number of surveys and years covered in GLD. Figure 1, below the table, shows the location of the countries in GLD. Table 1 - GLD coverage by country, number of surveys, and time range Country Number of surveys Range of years ARM 9 2014 - 2022 BGD 5 2005 - 2016 BOL 6 2015 - 2021 BRA 37 1981 - 2022 CHL 13 1990 - 2017 COL 23 1996 - 2021 EGY 14 2006 - 2019 ETH 4 1999 - 2021 GEO 6 2017 - 2022 IDN 30 1989 - 2019 IND 15 1983 - 2022 LKA 23 1992 - 2021 MEX 16 2005 - 2020 MNG 17 2002 - 2022 NPL 3 1998 - 2017 PAK 15 1992 - 2020 PHL 23 1997 - 2019 RWA 5 2017 - 2021 SLE 1 2014 - 2014 THA 36 1985 - 2021 TUR 20 2000 - 2019 TUN 15 1997 - 2017 TZA 6 2000 - 2020 ZAF 13 2008 - 2020 ZMB 9 2008 - 2022 ZWE 5 2011 - 2022 Figure 1 - Map of the world with the GLD countries highlighted Note: Boundaries shown on this map are not authoritative and should not be considered as an endorsement by The World Bank. The initial choice of countries was driven by the availability of multiple LFS over time for the same countries. Thereafter, the GLD team has established selection guidelines to try to balance the GLD country coverage across income groups and regions, and to keep the GLD updated with the latest surveys. Ensuring that the GLD is updated means to harmonize latest surveys once they become available for each country in the GLD. In general, we view a survey as in date if it is from the previous four years (e.g., from at least 2020 in 2024). Thus, both between regions and within regions, the choice in surveys to add to GLD should reflect an effort to not just be present at all income levels but to have up to date surveys for all. However, acquiring new surveys is mostly determined by data availability, that is, whether it is possible to obtain new data or whether the National Statistical Office (NSO) of countries do not permit the sharing of survey data. Hence, if NSOs of a region share very little of their data, the GLD’s imbalance due to a lack of surveys would not be correctable. Similarly, if NSOs do not run or only seldomly run a labor force survey, their countries’ GLD entry would missing or small despite our best efforts. Transparency and data access GLD is designed to be as accessible and transparent as possible. Every step of our harmonization should be transparent and traceable. In addition, all the outputs produced by the GLD team (harmonization code and documentation of survey details, choices made during harmonization) are shared freely to all on GitHub, a web platform for collaborative software development and version control. Access to the raw and harmonized microdata is restricted on a survey-by-survey basis in accordance with data license regulations. These limitations stem primarily from data privacy mandates issued by the National Statistics Offices, alongside other pertinent considerations. Such restrictions are imperative to ensure compliance with legal frameworks governing the confidentiality and usage of sensitive survey data. Adherence to these protocols not only upholds ethical standards but also safeguards the integrity and confidentiality of the information contained within the database. GLD data is stored on a server managed by the GLD team. The team aims to use data sources we can share at least with World Bank colleagues whenever possible. Data is also accessible via datalibweb and the microdata library. Currently all GLD surveys are accessible to all World Bank staff except for data from Egypt, where the publisher of the data has requested that raw data (and thus harmonized data) be only accessed via their portal. More details on this in section 2.2 Data storage platforms and access rules section. Data quality and validation Central to the GLD objective of being a reliable source for cross country comparisons and benchmarking is ensuring data are of the highest quality. Only then is it possible to leverage large datasets and use GLD as an input into automated analytical workflows. To validate the harmonization, the GLD team has three main tools. The first is the validation done with country office colleagues and NSO staff when harmonizing. GLD harmonizers are in touch with relevant colleagues with domain knowledge to understand the survey (knowledge they can then document and share) and ensure their mapping of variables from the raw data to the harmonized variables is sensible. Once the harmonization is finalized, there are two automated quality check procedures. The first checks the survey for integrity and coherence with external sources (e.g., is the calculated labor force participation in line with what ILO, WDI report). The second checks a series of surveys in a country over time to detect any unexpected jumps in the series. Finally, via direct exchange with the GLD team or on the online GitHub platform, users can alert the team of issues in the harmonization that had made it through, nonetheless. A process of updating the harmonization then kicks in to try to correct any issues as quickly as possible. For a more detailed description of all quality checks see section 4. Validation and quality checks. 1.5. Complementarities with similar data efforts Within the World Bank, there are other two harmonization initiatives: (1) the I2D2 harmonization of both LFS and household surveys which has been active for more than a decade and has been recently discontinued; and (2) the Global Monitoring Database (GMD) which is harmonizing only household budget surveys used for poverty and inequality analysis primarily. The GMD thus includes variables on household consumption and calculates certain income and consumption aggregates that are not present in GLD. On the other hand, GLD has more detailed labor variables, especially providing (wherever possible) industry and occupation information using ISIC and ISCO codes to the fullest depth possible. As both use a common set of variables, both can be used as inputs to automated analytical tools. This is the case, for example for the Jobs Indicators database (JOIN), which reads in data from GMD, GLD, and I2D2 to create country indicators. Moreover, the GLD offers a larger set of migration variables in its data dictionary than the GMD. Finally, in trying to strike a balance between harmonization and information in the raw data, the GLD also stores more “original” variables from the raw data than GMD (or I2D2). That is, there are more variables that directly provide users with the information as is to be found in the survey. For example, in addition to `occup` and `occup_isco` (occupation information by 10 main categories or the ISCO codes), `occup_orig` contains the occupation information found in the survey. Thus, users can deviate from our codes without need to redo the harmonization and evaluate how the mapping from raw to harmonized variable was done directly. Similarly, the ILO harmonizes LFSs to generate indicators published on the ILOSTAT data platform, though the underlying harmonized microdata and relative codes are not made public as in GLD. The ILOSTAT platform provides users with an extensive set of indicators, not just on labor markets, but also on other socioeconomic and sociodemographic information for more than 180 countries accessible in the ILO Survey Catalogue. However, the microdata are only accessible to ILO staff. Moreover, no harmonization codes are made available and thus what the user can access are indicators at the national level. Hence, while ILOSTAT is a great resource to obtain country level indicators for the most common topics, GLD can complement this via allowing users going deeper and leveraging the full microdata, running regressions or calculating indicators at sub-national level. 1.6. Sustainability Maintaining the level of detail provided by the GLD is a significant undertaking that requires a serious investment of resources. The World Bank is the right organization to house such an effort due to the externalities generated by a public effort to create accessible data. To maintain the costs of the effort low and ensure the sustainability of the project as it scales and requires more management, two things are necessary: 1) a strong collaboration with regional data teams and 2) the creation of a community of users on GitHub, a web platform for collaborative software development and version control, initially among World Bank staff with the objective to expand to all users through a collaborative yet still curated approach. 2. GLD content, storage, and access This section discusses the kinds of information that can be found on GLD, the rules that govern each type, and the formats they are in. 2.1. Data and information collected in GLD Raw microdata The raw microdata are the individual level data as received or downloaded by the GLD Team. It may come directly from the National Statistical Offices (NSO), an aggregator (like the African Development Bank Microdata Catalog or the World Bank Microdata Catalog) or from colleagues. They are collectively referred to as the data sharers in this document. The raw microdata reflects the original state of the data and its variables should mirror the questionnaire. In cases where no better alternative is found, the raw data the GLD team starts with may have been already processed by either the NSO or an intermediary. If this is the case, this should be described and noted in the background information (see the Error! Reference source not found. section for more details). The data sharer will determine the data privacy rights of the data and thus the access. Data access rights can be broadly categorized into three buckets. The first bucket is public domain data. This data is freely accessible to all users. This would be the case of directly downloadable on the NSO website. The second bucket is World Bank official use data. This refers to data that can be used by and shared with World Bank colleagues without any restriction but cannot be shared outside the organization. The third buckets is limited release data. This describes data that are part of GLD yet cannot be shared freely. Access needs to be decided on a case-by-case basis and often requires requesting permission from the data sharer. Raw microdata is taken in by the GLD Team in whatever format the data sharer choses (e.g., CSV, TXT, Stata, SPSS, …). IF the raw microdata is not in Stata .dta format, the GLD team will convert it to that format for further use. The code to do so is an example of the “other code” GLD produces (that is, GLD code that is not harmonization code – see the Other code section for more information). Harmonized microdata The harmonized microdata is the output of the harmonization process. It is individual level information where every row represents an individual. The GLD data dictionary variables are contained in the columns. The harmonized microdata will only contain those variables for which there are answers. That is, if a variable cannot be coded and thus would be missing for all individuals in the data it is dropped to keep the file size as small as possible. The access rights are inherited from the raw microdata. Whatever the rights are for the raw microdata will be applied to the harmonized data. The harmonized microdata is produced as a Stata ,dta file. Harmonization code The harmonization code is the set of instructions used to convert the raw microdata into the harmonized output and the comments explaining, wherever necessary, the rational for the specific coding instructions given. For a description of how to read the code please see section The structure of the harmonization code, for the definition of individual variables please see section The GLD data dictionary. The harmonization code is created by the GLD team and shared openly and free of charge under the provisions of the MIT License rules (see our License details here). The harmonization code is written as a Stata .do file. Other code In addition to the harmonization code and the bulk of the GLD code, various other code is written and made available. This code can be categorized into four types: • Raw Microdata Conversion Code: Code written to convert raw microdata, not in Stata .dta format, into this format. • GLD Quality Check Code: Code responsible for conducting quality checks on GLD data. For more details on these quality checks, refer to the Validation and quality checks section. • Code Templates: Predefined code structures or templates to facilitate coding tasks within the GLD framework. • GLD Ecosystem Tools Code: Code that builds the "GLD Ecosystem Tools," which are small software programs designed to address tasks commonly performed by GLD users on the GLD data. An example includes a tool supporting the conversion of ISIC and ISCO codes (classifications for industries and occupations) across different revisions of the classifications. All GLD tools can be found here. All code of each category, like the harmonization code, is shared under an MIT License free of charge. The code may be written using any software program. Please refer to the individual tools to determine the software type used. Additionally, requests for modifications to existing tools or new tool suggestions can be made (see the How to correct and expand the GLD tools section) Survey and other documentation The survey documentation refers to all documents that are published by the NSO (or other institution running the survey) that serve to understand the survey. This includes but is not limited to the questionnaire, intermediate and final reports, as well as enumerator manuals. The survey documentation is treated as public information and thus shared freely. It is stored and distributed in whatever format the GLD team received it (most commonly a PDF or a spreadsheet format). Background or contextual information refers to the information the GLD team has acquired in the process of harmonizing that is not (or not directly) in the survey documentation or cannot be embedded into the comments in the harmonization code itself. A common occurrence is the change of the employment definition. In its 19th session the International Conference of Labour Statisticians (ICLS), under the auspices of the International Labour Organization (ILO) changed the definition (more information here). Therefore, work for own consumption (for example subsistence agriculture) was no longer considered employment. This change has been implemented differently at different times by different countries. Per the rules of the GLD harmonization each survey will be harmonized using the definition of employment that was applied to each survey and not unified (see the Defining the boundaries of GLD harmonization section for more details). To help users understand these changes, its implication, and – if possible – try to align definitions with earlier ones, the GLD team prepares documentation to detail the changes, show the relevant questionnaire passages and propose alternative coding. Information like this is collectively referred to as Country Survey Details. The Country Survey Details are text files that accessible to all and may be freely shared. You may see them here online. 2.2 Data storage platforms and access rules This section details which GLD information type is stored where and how it is organized. To see what types of information are stored on GLD please see Data and information collected in GLD section. The World Bank GLD Server The raw microdata, harmonization codes, harmonized output, as well as the documentation is stored on a dedicated server. It is organized following the file and folder naming convention of the World Bank Microdata Library (accessible here). The image below exemplifies the structure with the case of Armenia. Within GLD the first level is the country, identified the country’s three-letter code (number 1 in Figure 2, below). The surveys in a country follow the structure CCC_YYYY_[SurveyName], where YYYY is the year of survey start and SurveyName the name or acronym of the survey. In this case the Armenian Labor Force Survey from 2014 is ARM_2014_LFS (number 2 in the image). Figure 2 - Example of GLD Server folder structure Inside the survey folder are the master data and the harmonized data folders. Both begin with the same name as the folder above in the nested logic, but the master only has the vintage of the master data (here V01_M, number 3 in the image, as this is the first (and only so far) version of the raw data we have). Master data may be updated if the NSO publishes, for example, a revision to the data. The harmonized folder (number 4 in the image) starts like the master data folder it harmonizes from (see green arrow in the image above) but adds the vintage of the harmonization (red arrow). It also adds the initials of the collection we are harmonizing towards (GLD). Inside the master data folder there are three folders: Data, Doc, and Programs. Data itself is divided into Original and Stata. The former contains all files as downloaded if the original download is not in Stata format (i.e., not a .dta file), the latter contains the Stata survey microdata. Any code changing the raw data is stored in the Programs folder. For example, code converting raw data from other formats to dta, that is, reading from Data/Original, converting it, and storing it in Data/Stata, would be stored under Programs. The last folder, Doc, contains all further documentation that is needed to work on the survey. It should be divided into two further folders: Questionnaires, containing the questionnaires and all other necessary document to understand the questionnaire and its flow; and Technical, containing all other technical information (e.g., reports, national occupation classifications, etc.). As a best practice, it is advised to leave in the Doc folder a small Readme file (commonly titled “Where is this data from – ReadMe.txt”) to give information about where the source material is from. This is important for future colleagues, so they can trace information establish the access policy. The structure for each of the harmonization folders is roughly the same, only with added version numbering and collection name (GLD). The Data/Harmonized folder shall contain the harmonized output in ‘.dta’ form (in this case ARM_2014_LFS_V01_M_V01_A_GLD.dta). The Data/Additional Data folder is an optional contains data not in the raw data that is needed to create the harmonization. For example, if the conversion of the national industry classification to the international version is done via merging in an extra file, this file would be placed under Data/Additional Data. If no such files were used the folder need not exist. The harmonization code (i.e., the code that takes the Data/Stata input from the master folder system and saves output in Data/Harmonized) is stored in the Programs folder (and in this case would be ARM_2014_LFS_V01_M_V01_A_GLD_ALL.do). The Doc folder contains any other documentation necessary to describe and understand the survey. Note this is the same content as in ARM_2014_LFS_V01_M/Doc in the example above. Content should be in both at the same time – a small price on duplication we believe is worth for ease of finding for the user. The Work folder contains any output created during the harmonization that is not the final harmonization. For example, if you needed to create a subfile of the survey containing only households from a certain region for inspection or any other process you may need during your work, these outputs should be stored here. Data/Harmonized should only contain finalized files, here you may store any intermediate results. The GLD server is closed to members of the GLD team and access cannot be granted other than for exceptional circumstances. To allow World Bank staff member access to the server structure, the GLD team has created a copy of the GLD server, called the GLD WB Staff Server, only containing the subset of the GLD surveys that can be shared. The GLD WB Staff Server is a subset in two ways. Firstly, it only contains surveys whose raw microdata can be freely shared with World Bank colleagues. Secondly, it only contains the latest harmonized version. For example, while for the 2020 Indian Periodic Labour Force Survey (PLFS), GLD contains the raw data folder (IND_2020_PLFS_V01_M) and four vintages (IND_2020_PLFS_V01_M_V01_M_A_GLD to IND_2020_PLFS_V01_M_V04_A_GLD), the GLD Staff server only contains the master and IND_2020_PLFS_V01_M_V04_A_GLD. This reduces space on the server and ensures users are using the latest files are being used. If a user was running some code, calling from the GLD WB Staff Server the IND_2020_PLFS_V01_M_V0*3*_A_GLD files, it would not run and call an error, forcing them to update to the latest vintage. Other than these two differences, the GLD WB Staff Server is organized like the full GLD server. Access to the GLD Staff Server can be requested and granted by the GLD Focal Point (please reach out to gld@worldbank.org). Access is available to any staff member with an active World Bank email address and access to a World Bank laptop or Virtual Desktop. Once a user has mapped the server (see instructions here on mapping) they do not need to take further steps as the GLD team updates the GLD WB staff server. What is present on the server should always represent the latest vintage of any harmonization available. Datalibweb datalibweb is an application programming interface (API) with two components. It has a website (internal to the World Bank) for data exploration and request and a set of API endpoints to securely access to granted microdata. Currently, the API endpoints are integrated with Stata through the datalibweb Stata package. For more details and information, please visit the datalibweb GitHub repository or, from the World Bank intranet, type in “datalibweb/” into your browser. To integrate with datalibweb, GLD has granted its API access to the full GLD server. It thus contains the same information types (raw and harmonized microdata, harmonization codes, and documentation) and can monitor GLD for update. The datalibweb interface allows users to see what surveys are available either via a click and select menu system that appears upon entering “datalibweb” into the Stata command line or programmatically (using the Stata package’s syntax). . For details on how to use the Stata package please see the GitHub page; for questions on datalibweb kindly visit the intranet site (datalibweb/) or reach out to the team (datalibweb@worldbank.org). For details on the syntax and default behavior of datalibweb (e.g., unless requested otherwise it will load the latest vintage of a harmonization) see the help file by executing “help datalibweb” in the command line. Access to datalibweb is controlled via the datalibweb intranet site. Uploaders to datalibweb (people who provide data to datalibweb, like the GLD Focal Point) can set data to public (accessible to all with access to datalibweb) or private. If the survey is set to private, the survey information will appear as “Not Subscribed” (see Figure 3 below for Egypt). Figure 3 - Example of EGY GLD surveys accessed through datalibweb In such cases users can navigate the datalibweb intranet site (accessible by entering “datalibweb/” into their browser) and request access to it. This triggers an email to the data owner who can approve or deny the request. Note that datalibweb may automatically set surveys that are public to expire. This is the case for the 1983 EUS shown below (Figure 4), for example. Figure 4 - Example of IND GLD surveys accessed through datalibweb This is because datalibweb automatically expires data access after a certain time. However, since the data is still declared as public, it can be accessed if requested directly. In the case of the latest vintage of the 1983 survey this would be: datalibweb, country(IND) year(1983) type(GLD) vermast(01) veralt(07) survey(EUS) module(ALL) clear This command grants access to the data via datalibweb despite the expiration. Microdata Library The Microdata Library (MDL) is the World Bank internal microdata catalog. Note that there is an equally named external looking website (its internet presence is microdata.worldbank.org as opposed to microdatalib.worldbank.org). In this manual we refer only to the internal site. The microdata library contains both raw and harmonized surveys. Each type of data would be treated as a separate entry in the MDL. Each survey entry in turn contains additional resources, which most often consists of the survey documentation, but may include other documents, like a harmonization code file. To see the MDL users are not obliged to login, but this will be necessary to download or request data (more details on how to access data on MDL at the end of this section). The image below shows the homepage of the MDL and where to login (red box in Figure 5 below). Figure 5 - Header of the Microdata Library The Microdata Library Catalog itself can be navigated using either by searching for keywords (green box in Figure 6 below) or using the filters on the right had side like the year (red box) or the country (yellow box). Figure 6 - Navigating the Microdata Library catalog GLD data can be found directly via the purpose built GLD collection. To look for different harmonization collections users need to click on the option “Collections” on the top of the page (red box in Figure 7below). Figure 7 - Accessing collections on the Microdata Library Once on the collections site, users need to scroll down to the “Specialized Collections” header, where GLD is present (see green box in Figure 8 below). Figure 8 - Screenshot of collections available on the Microdata Library Finding surveys works the same way as finding surveys on the MDL in general, by using the search box or the panes explained above. Access to surveys on the Microdata Library (MDL) is regulated by the data licenses. Figure 9, below, shows the license options users can filter for when searching for surveys. Figure 9 - License options on MDL Depending on the license type, users will be able to download the data directly, be asked to fill a form, or barred from accessing. All processes are internal to the MDL. Forms are sent to the MDL team and reviewed by them. GitHub Before delving into what information is store on GitHub we first introduce the reader to GitHub. If you are already familiar with GitHub please feel free to skip the next four paragraphs. GitHub is a web-based platform that allows developers and programmers to store, manage, and collaborate on software projects. It is a central hub where people can upload, download, and work on code files, as well as track changes, report issues, and discuss ideas. GitHub is built on the Git version control system, which lets multiple people work on the same project simultaneously without overwriting each other's work. The core feature of GitHub is the "repository," which is essentially an online folder that contains all the files and revision history for a particular software project. Users can create their own repositories to store their code, or contribute to existing repositories created by others. GitHub provides a user-friendly web interface, as well as desktop and mobile apps, to make it easy for people to access and manage their repositories. When you interact with a GitHub repository, you may "clone" it, which creates a local copy on your own computer. From there, you can make changes, add new files, and commit those changes back to the central repository. GitHub also supports "branching," which allows multiple versions of a project to be developed in parallel. Users can then submit "pull requests" to propose incorporating their changes into the main branch of the project. Beyond just code storage and version control, GitHub enables collaboration by allowing users to report bugs, suggest improvements, and discuss the direction of a project through the repository's online tools. This collaborative nature has made GitHub an essential platform for open-source software development, as well as a valuable resource for teams working on a project concurrently. The GLD team does not story any microdata on GitHub but holds the full set of harmonization codes for all GLD surveys, the templates, and the Country Survey Details as well as the validation’s quality check codes and the GLD ecosystem tools. The code to convert raw microdata received in non-Stata .dta format to .dta are not kept on GitHub either. On the landing page (shown below in Figure 10) there are two relevant folders that lead to the information: GLD (red box) and Support folder (green box). The former contains the harmonization codes for the GLD surveys while the latter contains all additional information to understand and leverage both the GLD surveys and the GLD ecosystem (quality checks, Country Survey Details, Tools and GLD documentation like the data dictionary). The other folders contain information to make the repository work and are not further discussed here. Figure 10 - GLD GitHub repository landing page The GLD folder follows the logic described above for the GLD Server, with a `CCC/CCC_YYYY_Survey- Name/CCC …` structure. The only difference is that it only contains the harmonized folder (i.e., `_V##_M_V##_A_GLD`) and inside this folder there is only the Programs folder. The Support folder has six subfolders as shown below (Figure 11). Subfolder A – Guides and Documentation contains individual documents that explain the functioning and rules of GLD. In part they are constituents of this manual (i.e., the information in them is also here) but are kept separate for easier sharing. For example, the next section will explain all the variables in GLD, but the data dictionary is available there as a standalone document for ease of review and sharing. In part these are documents referenced in this manual (e.g., the World Bank Micro Data Library folder and file naming convention). Figure 11 - Structure of the GLD GitHub repository's Support folder The subfolder B – Country Survey Details contains the metainformation on surveys that cannot be coded into the harmonization or the harmonization code – or need more explaining to be understood. The structure of the folder starts with the three-letter code of the country, within which are the surveys, listed by their name or initials (e.g., LFS, ENE, SAKERNAS). Inside each survey folder there should always be at least one element: the introduction file. All other files and folders are optional and depend on the situation. The example below (Figure 12) shows that the structure starts, like the GLD folder, with the three-letter code (green box). Inside of it are all the surveys (LFS and QLFS in this case, red boxes). Each survey will have a Utilities folder (yellow box) where images or documents referenced in the CSD are referred. All surveys have an introductory text (purple box) followed by additional texts that expand on the issues from the introduction (if necessary). Figure 12 - Structure of the B - Country Survey Details folder The introduction file starts with “1.” as GitHub orders files alphanumerically and this file should be at the top always. It should always read as Introduction to CCC (or country name) Survey. It contains the basic information every user should read before starting work on a survey. The first part is standardized and is the same for all surveys – further down the template for this document is introduced. The standardized part informs the user of what the survey is, where the information is from (and if public, where to get it), what the sampling procedure was and to what geographic level the information is statistically significant. The latter part called “Other noteworthy aspects” allows the harmonizers to expand on non-standard issues that are of relevance to that survey in that specific country. If the introduction contains images or references some document, these are stored in the *utilities* folder, a sort of catch-all for the survey. Similarly, if a topic requires more in-depth discussion, it will be referenced in the introduction, but then discussed in detail in a separated file (e.g., how to deal with a break in the definition of employment over a series). Any images or documents referenced here should also be stored in the utilities folder. The Support subfolder “C – Templates” contains the GLD templates. Currently, there are three. The harmonization code template, the code to create new GLD structured folders, and the template for the CSD introduction. These should allow new harmonizers to create new harmonizations using the same structures. The subfolder “D – Q Checks” contains the GLD quality checks. There are two types of checks. The checks to be done to each survey (e.g., the 2019 Indian PLFS) and the checks to be done (once all are ready) to a series of years of surveys (e.g., years 2017 to 2022 of the Indian PLFS). The former are in the *Single survey checks* folder, while the latter are in the *Survey series checks* folder. The details of how the checks work is contained in the [Validation section](../GLD Manual/ GLD_Manual_Validation.md). The subfolder “E – Community Guidelines” contains the guidelines to interact with the GLD team on GitHub. This refers to the conduct we expect from users as well as checklists to ensure any action is as impactful as possible. Further details of how interact with the team are contained in the Contributing to GLD’s quality and expansion section. The GLD GitHub repository is an open access repository that can be used by any person without restrictions. Users can also contribute to the GLD. Contributions are vetted and reviewed before being included in the GLD. Summary of data and information storage Table 2 below provides an overview of what information is stored where and how accessible the information is. The rows refer to the GLD server, the GLD public server, datalibweb, the World Bank Microdata Library, and the GLD GitHub repository. The columns refer to the different information types. A dark-green colored cell means the information is directly accessible and comprehensive of the surveys on GLD. A light-green colored cell means the information is directly accessible but not comprehensive of GLD (i.e., it represents a subset). A yellow-colored cells refers to information that may require requesting access but should (if properly updated in the case of the MDL) be comprehensive of the GLD. Lastly, grey colored cells mean that the information type is not available from that source (e.g., there are no microdata on GitHub). Table 2 - Overview of data storage platforms Raw Harm. Documen- Harm. Other Microdata Microdata tation Code Code CSD Server ` Public S DLW MDL GitHub 3. The GLD harmonization methodology For the purposes of this section, harmonization refers only to the act of converting the raw microdata into the variables of the GLD data dictionary. 3.1. Defining the boundaries of GLD harmonization The scope of the GLD harmonization is the data dictionary. However, not always the information in the survey fits the concept in the data dictionary. For example, if the questionnaire asks respondents whether the respondent or their employer “contribute to social or private security” is that sufficient to code the socialsec variable? Similarly, with a view to the objective of comparison and benchmarking, how should the harmonization handle change to concepts over time, like changes to the definition of employment or administrative areas? To define the boundaries of the harmonization, the GLD team follows two principles: (i) each survey is harmonized independently; and (ii) in unclear situations, users are empowered to take informed decisions, rather than having the GLD team making choices for users. The principle that surveys are harmonized independently means that each survey is harmonized based on the standard and realities present at the time of its collection and processing. Changes introduced in subsequent surveys are not applied retrospectively. For example, if new geographical regions or occupation classifications are introduced, previous surveys will reflect the older versions without any retroactive adjustments. The principle to allow users to make informed decisions means to provide comprehensive information about changes and document their impact on the data (for example, the updated occupation classifications in Pakistan and changes in the definition of employment in Tanzania). However, the GLD harmonization takes a conservative approach, making minimal assumptions and deferring significant decisions to users. This ensures data accuracy and transparency, allowing users to make well-informed choices based on their research objectives. Unifying elements that have changed over time across multiple surveys or taking leaps on questions that fit the data dictionary only partially is outside the scope of the harmonization. Users are provided with the necessary context and information to bridge these gaps according to their requirements. 3.2. The structure of the harmonization code The GLD harmonization code template begins with a header or preamble summarising key survey aspects (see Box 1, next page). It contains four blocks: 1. Information on the code, the author, and the creation date. 2. Details on the survey context 3. Details on the versions of the standard classifications used 4. Version control history, detailing the date and the contents of any changes performed Box 1 - GLD Harmonization Template Preamble /*%%================================================================= 0: GLD Harmonization Preamble ===================================================================*/ /* ----------------------------------------------------------------- <_Program name_> [Name of your do file] <_Application_> [Name of your software (STATA) and version] <_Application_> <_Author(s)_> [Name(s) of author(s)] <_Date created_> YYYY-MM-DD ------------------------------------------------------------------ <_Country_> [Country_Name (CCC)] <_Survey Title_> [SurveyName] <_Survey Year_> [Year of start of the survey] <_Study ID_> [Microdata Library ID if present] <_Data collection from_> [MM/YYYY] <_Data collection to_> [MM/YYYY] <_Source of dataset_> [Source of data, e.g. NSO] <_Sample size (HH)_> [#] <_Sample size (IND)_> [#] <_Sampling method_> [Brief description] <_Geographic coverage_> [To what level is data significant] <_Currency_> [Currency used for wages] -------------------------------------------------------------------- <_ICLS Version_> [Version of ICLS for Labor Questions] <_ISCED Version_> [Version of ISCED used to code] <_ISCO Version_> [Version of ISCO used to code] <_OCCUP National_> [Version of national occupation code] <_ISIC Version_> [Version of ISIC used to code] <_INDUS National_> [Version of national industry code] --------------------------------------------------------------------- <_Version Control_> * Date: [YYYY-MM-DD] - [Description of changes] * Date: [YYYY-MM-DD] - [Description of changes] -------------------------------------------------------------------*/ After the box, the harmonization code is divided into 9 sections. Section 1 contains the codes to set up file and folder paths and assemble the dataset to harmonize from the raw data. It is at this step users would need to update folder and file paths and name if they are using a different storing system (e.g., their server is labelled “E” instead of “Y”). Sections 2 to 8 cover variables of the different blocks of the data dictionary (e.g., geography, socio- demographic, education, …). Section 9 does the final clean up to keep only variables in the dictionary that have values (i.e., we do not keep variables that have missing values for all respondents), in the correct order. Data is also compressed and unused labels are discarded so the final output is as size-efficient as possible. Each section is be tagged according to the following rules (see example in Box 2): - The section marker starts with /*%% followed by the equal sign to pad out the line - The section title is indented, starting with a number, colon, and section title - The section marker closes with a line of equal signs ended by %%*/ (inverse of start) Box 2 - Section header example /*%%================================================================= 1: Setting up of program environment, dataset =================================================================%%*/ Within the “variable” sections 2 to 8, all harmonized variables in the in each section in the data are tagged according to the following convention (see example in Box 3 - Variable tagging example): - The beginning of the code relating to a harmonized variable should be proceeded by *<_var_> where “var” is the harmonized variable being created. - The end of the code relating to the variable creation should read *. - Variables that are already named (e.g. if "hhid" exists in raw data file) should be noted. Between the "open" and "close" tags, a starred outline should read:"*'var' present in 'source'". - If a variable requires more extensive or explicit comments, these should be written between note tags. The note tag is the same as the variable tag only followed by “_note” (e.g., “note_var”). For example, the variable “lstatus” is created in an uncommon or unexpected way, then harmonizers can add the variable- specific note as follow: *<_lstatus_note_> Text explaining issues with variable, why which choices made * (see Box 3 below). Box 3 - Variable tagging example *<_wage_no_compen_> /*<_wage_no_compen_note_> The wage questions in the questionnaire are organized into two parts: the first part asks for the specific number of income if the given respondent could recall and was willing to answer; the second part provides different categories of income range if the given respondent could not recall or was not willing to answer the first part. The general logic here is to impute wage values for people who only answered an income range. We used industry, occupation, income categories and gender to estimate their specific income values. 21.86% of total non-missing wage values were imputed using this method. *<_wage_no_compen_note_>*/ * Overall --> wage info * Set values of 0 to missing gen wage14=E14_1+E14_2 replace wage14=. if wage14==0 * First replace outliers by winsor2 wage14, suffix(_w) cuts(1 99) * Create salary categories based on winsor values gen salary_cat=. replace salary_cat=1 if inrange(wage14_w, 1, 55000) […] * Keep only for employed employees, label replace wage_no_compen=. if lstatus!=1|empstat==2 label var wage_no_compen "Last wage payment primary job 7 day recall" * There are two useful purposes of tagging the harmonization variables: (1) tagging is useful when cross checking the definitions of harmonized variables overtime, and when comparing the comparability of such variables with different countries; (2) tagging will improve the automated updating of the DDI by adding the block of codes used for generating the harmonized variables in the variable description of the DDI. Tagging will also improve the transparency of the metadata DDI for basic users in the Microdata Library. Tagging should be done for one variable at a time, not a group of variables. 3.3. The GLD data dictionary This section defines one by one each variable in the data dictionary and how they should be harmonized. It is divided into blocks as is done in the harmonization code. Each block section then also contains some lessons learned if any and a tabular overview of the variables. Survey & ID module countrycode countrycode is a string variable that specifies the 3-character country code used by the World Bank to identify each country. Although there are different naming conventions, it is necessary to use those specified to ensure that the data for each country is appropriately labeled. survname survname codes the acronym of the survey. survey survey codes the type of survey (e.g., LFS for Labor Force Survey). icls_v Underlying version of the International Conference of Labour Statisticians that is being used in the survey to code concepts of work and employment. Most commonly, surveys harmonized to GLD will either follow ICLS-13 or ICLS-19, that is, the directives set out during the 13th or the 19th conference, especially pertaining employment. In ICLS-13 all work – other than household work – is seen as employment. Thus, subsistence farmers are as employed as the CEO of an international conglomerate. The below screenshot (Figure 13) is from the questionnaire of the Zimbabwean 2014 LFS, where any yes answer will skip to questions on main employment (Q25). As highlighted, work for a wage is treated in the same than work on any agricultural holding. This survey follows ICLS-13. Figure 13 - Example of 2014 ZWE LFS questionnaire Five years later, the Zimbabwean statistics office, ZimStat, switched to ICLS-19. In ICLS-19, only work for market exchange is considered employment (treating subsistence farming in the same way as household labour). Thus, an additional question is added to differentiate what kind of farming is taking place. The below screenshot (Figure 14) is part of the set of agriculture questions. If the agricultural work on the own agricultural holding is only or mostly for market exchange (codes 1 and 2) the individual should be asked about their first main job (MJ1). If agricultural production is only or mainly for own consumption, then the questionnaire continues, here asking whether they work for others for hire. Figure 14 - Example of 2019 ZWE LFS questionnaire If the survey asks questions to understand what kind of farming takes place (subsistence or market exchange) and defines a skip pattern to lead to employment questions based on that, the survey questionnaire follows ICLS-19, otherwise it follows ICLS-13. isced_version Underlying version of the International Standard Classification of Education (ISCED) used in the survey. Acceptable values are either isced_1997 or isced_2011. isco_version Underlying version of the International Standard Classification of Occupations (ISCO) used in the survey. Acceptable values are either isco_1988 or isco_2008 isic_version Underlying version of the International Standard Industrial Classification of All Economic Activities (ISIC) used in the survey. Acceptable values are either isic_2, isic_3, isic_3.1, or isic_4. year year is a numeric variable that denotes the year in which the implementation of the household survey was begun. For example, if a survey was implemented during October 2018 and September 2019, the year would be 2018. vermast vermast codes the version of the master file (original data) being used in the harmonization. veralt veralt codes the version of the harmonization. harmonization harmonization codes the kind of harmonization (GLD or GMD). For GLD surveys this will always be GLD. int_year int_year is a numeric variable that specifies the year when the survey questionnaire was administered to the household. int_month int_month is a numeric variable that specifies the month when the survey questionnaire was administered to the household. hhid hhid specifies the unique household identification number in the data file. The original format, string or numeric, of original data should be kept. If there is Household ID in the original data, hhid and hhid_orig should be the same. If hhid_orig is missing, it is constructed by "variable names in raw data" variables. pid This variable allows identification of individuals. Variable will vary in length depending on how the identification code was constructed in each country. Depending on individual countries, this variable may be a concatenation of several variables in the raw data file. Keep format (string or numeric) of original data. If there is Personal ID in the original data, pid and pid_orig should be the same. If pid_orig is missing, it is constructed by "variable names in raw data" variables. weight weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population. weight_m weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population for each month. To be added only if present in the raw data and survey reports estimate results per month. weight_q weight contains household weights, typically inversely proportional to the probability of the household being selected for the sample, that should be applied to all analysis to make the results representative of the population for each quarter. To be added only if present in the raw data and survey reports estimate results per quarter. psu Primary sampling unit (psu) refers to sampling units that are selected in the first (primary) stage of multi- stage sample design. These sampling units typically correspond to a number of large aggregate units (clusters), each of which contains sub-units. For example, a primary sampling unit can represent the set of all housing units contained in a well-defined geographic area, such as a municipality or a group of contiguous municipalities. Primary sampling units are numeric and country-specific. A unique identifier is created for each primary sampling unit. In Stata, users are advised to specify the primary sampling unit through the svyset command. ssu Secondary sampling unit code (if present). strata Unit defining the first stage stratification strategy. wave In case of the survey being rolled out over several waves (e.g. quarterly), codes the information of the iteration of the survey. panel A string variable denoting which panel the individual belongs to. A panel is defined as all individuals who entered a survey at the same time (e.g., Q3 of 2020) and are scheduled to exit at the same time after a fixed number of survey waves (e.g., after four quarters). Note that due to attrition not all intakes may exit at the same time. This variable is only to be coded if the concept is present in the raw data already. visit_no A numeric variable denoting the visit number (e.g., first visit coded as 1, second visit as 2, …) within a panel. This variable is only to be coded if the concept is present in the raw data already. Lessons Learned and common challenges Coding IDs correctly is integral to allow for analytical tools to leverage the information. The coding of the identifications should follow from the survey structure and should not be built via a sequential index (e.g., gen hhid = _n). Since observations may be ordered differently, possibly even across different vintages of the same file, the coding may lead to different outcomes. * Create hhid like this: gen hhid = psu * 100 + hh * Note like this: gen hhid = _n When creating hhid and pid, especially from string variables or from group(varlist) or concat(varlist) functions, users should try to create them from roster data files first where all information or observations are available. In addition, the order of the variables in the varlist option above must be the same across the files. Across the data files, the order and the sort on the variables in the varlist must be done in the same way across files. When the hhid and pid are in numeric format but less precision, it is recommended to bring them the accurate precision level so it can be used in the merging correctly. For example, the value of the hhid for an observation might be 100021210121 (a long number), users should format the variable by “format %15.0g hh”. In case a household survey is conducted more than once per year – e.g. quarterly HH surveys – you may want to use this as panel data, in which case the household ID can remain as is. However, if you want to use the data as cross-sectional, then new HHIDs can be constructed for each HH for each quarter. Quarter Quarter 1 Quarter 2 Quarter 3 Quarter 4 hhid_orig hhid=1 hhid=1 hhid=1 hhid=1 hhid hhid=1Q1 hhid=1Q2 hhid=1Q3 hhid=1Q4 hhid should never be missing and if there is any missing this variable should be checked. assert missing(hhid) It is recommended to check the uniqueness level of the data files with identifier variables at the corresponding level of the data (i.e. household vs individual level data). hhid and pid need to be unique in the database. isid hhid pid cap destring pid, replace duplicates report hhid pidlocal n=r(unique_value) `N'!= `n' Ensure that country is a three-letter country code. cap confirm str3 var country _rc!=0 Harmonizers should also ensure that country codes are updated according to the ISO country codes. Some common adjustments include the following: replace countrycode="XKX" if countrycode=="KSV" replace countrycode="TLS" if countrycode=="TMP" replace countrycode="PSE" if countrycode=="WBG" replace countrycode="COD" if countrycode=="ZAR" Furthermore, harmonizers should check that the years used are in an appropriate range. The year needs to be a four-digit number in the range of 1980 to the current year (assumed here to be 2020). assert missing(hhid) Overview of Variables Module Code Variable label Variable name Notes Survey & ID ISO 3 Letter country code countrycode Survey & ID Survey acronym survname No spaces, no underscores, split sections by "-" (e.g. "ETC-II") Survey & ID Survey long name survey Possible names are: LFS, LSMS, … [I am unsure of this difference, some surveys contain either this or the previous variable, have yet to see one with both] Survey & ID Version of the ICLS followed icls_v Defines the labor force definitions used according to the rules set out by the nth International Conference of Labour Statisticians. Survey & ID Version of ISCED used isced_version Survey & ID Version of ISCO used isco_version Survey & ID Verstion of ISIC used isic_version Survey & ID Year of survey start year Survey & ID Master (Source) data version vermast Survey & ID Alternate (Harmonized) data veralt version Survey & ID Kind of harmonization harmonization Survey & ID Year of interview start int_year For HH and Individual interviews in that HH earliest possible date Survey & ID Month of interview start int_month For HH and Individual interviews in that HH earliest possible date Survey & ID Household ID hhid Survey & ID Personal ID pid Survey & ID Survey weights weight Survey & ID Primary sampling unit psu Survey & ID Secondary sampling unit ssu Survey & ID Stratification (of PSU) strata Survey & ID Wave of the survey (e.g., Q1 for wave quarter 1) Survey & ID Panel the individual belongs to panel Only code if concept already in survey Survey & ID Visit number in panel order visit_no Only code if concept already in survey Geography urban urban is a dummy variable that specifies the location type – urban or rural - of the household. This variable is country specific as each country uses its own criterion to distinguish urban from rural areas. In many cases there is no clear division between urban and rural areas, and areas are classified as “semi-urban” or “mixed”. Harmonizers are advised to classify such categories as “urban.” Urban categories: 1 = Urban 0 = Rural subnatid1 subnatid1 refers to a subnational identifier at the highest level within the country’s administrative structure. This is typically a province or state. The variable is string and country-specific categorical. Numeric entries are coded in string format using the following naming convention: “1 – Hatay”. That is, the variable itself is to be string, not a labelled numeric vector. subnatid2 subnatid2 refers to a subnational identifier at which survey is representative at the second highest level within the country’s administrative structure. This is typically a district. The variable is string and country- specific categorical. Numeric entries are coded in string format using the following naming convention: “1 – Hatay”. That is, the variable itself is to be string, not a labelled numeric vector. subnatid3 subnatid3 refers to a sub-national identifier at which survey is representative at the third level within the country’s administrative structure. This is typically a sub-district. The variable is string and country-specific categorical. Numeric entries are coded in string format using the following naming convention: “1 – Hatay”. That is, the variable itself is to be string, not a labelled numeric vector. subnatid4 subnatid4 refers to a sub-national identifier at which survey is representative at the lowest level within the country’s administrative structure. In some countries, this is effectively a village. The variable is string and country-specific categorical. Numeric entries are coded in string format using the following naming convention: “1 – Hatay”. That is, the variable itself is to be string, not a labelled numeric vector. subnatidsurvey subnatidsurvey is a string variable that refers to the lowest level of the administrative level at which the survey is representative. In most cases this will be equal to variable subnatid1 or subnatid2. However, in some cases the lowest level is classified in terms of urban, rural (i.e., variable urban) or any other regional categorization cannot be mapped to subnatid#. The below example (Table 3) shows how to code subnatidsurvey for a survey that is representative at the rural/urban level of the province (subnatid1). Table 3 - Example of a survey significant at subnatid1 and urban/rural level subnatid1 urban subnatidsurvey “1 – Province A” 1 “1 – Province A urban” “1 – Province A” 1 “1 – Province A urban” “1 – Province A” 1 “1 – Province A urban” “1 – Province A” 1 “1 – Province A urban” “1 – Province A” 0 “1 – Province A rural” … … … “2 – Province B” 0 “2 – Province B rural” “2 – Province B” 0 “2 – Province B rural” “2 – Province B” 0 “2 – Province B rural” “2 – Province B” 0 “2 – Province B rural” “2 – Province B” 1 “2 – Province B urban” “2 – Province B” 1 “2 – Province B urban” While the below (Table 4) is the example of a survey representative nationally, nationally at urban / rural level and at province level. Table 4 - Example of a survey significant at subnatid1 level subnatid1 urban subnatidsurvey “1 – Province A” 1 “1 – Province A” “1 – Province A” 1 “1 – Province A” “1 – Province A” 1 “1 – Province A” “1 – Province A” 1 “1 – Province A” “1 – Province A” 0 “1 – Province A” … … … “2 – Province B” 0 “2 – Province B” “2 – Province B” 0 “2 – Province B” “2 – Province B” 0 “2 – Province B” “2 – Province B” 0 “2 – Province B” “2 – Province B” 1 “2 – Province B” “2 – Province B” 1 “2 – Province B” The variable would contain survey representation at lowest level irrespective of its mapping to subnatids. subnatid1_prev subnatid1_prev is coded as missing unless the classification used for subnatid1 has changed since the previous survey. In that case, it refers to the subnatid1 code used in the previous survey. This provides a way of tracking splits. For example, if province “32 – West Java” split into province “32 – West Java” and “36 – Banten” since the most recent survey, subnatid1_prev would be "32 – West Java” for cases when subnatid1 is either “32 – West Java” or “36 – Banten”. subnatid2_prev subnatid2_prev is coded as missing unless the classification used for subnatid2 has changed since the previous survey. In that case, it refers to the subnatid2 code used in the previous survey. subnatid3_prev subnatid3_prev is coded as missing unless the classification used for subnatid3 has changed since the previous survey. In that case, it refers to the subnatid3 code used in the previous survey. subnatid4_prev subnatid4_prev is coded as missing unless the classification used for subnatid4 has changed since the previous survey. In that case, it refers to the subnatid4 code used in the previous survey. strata strata refer to the division of the target population – typically the census sample frame -- into subpopulations based on auxiliary information that is known about the full population. Sampling is conducted separately for each stratum. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Sampling strata need to be considered when constructing the variance (or confidence intervals) of population estimates. The strata variable is needed for the correct calculation of standard deviation for each sample design. Strata is numeric and country specific. A unique identifier is created for each stratum. In STATA, users are advised to specify strata through the svyset command. The variable is in string format with the following naming convention “code of stratum – stratum name”, for example: “1 – Dar-es-salaam”. gaul_adm1_code gaul_adm1_code is numeric and country-specific based on the GAUL database. It should be taken from the same data in the GAUL database where the geographical area can be identified in the survey based on the name of the location/area. The number of unique values from the subnatid1 and the gaul_adm1_code could be different or the same. For example, in the case of a fictional country, if the highest-level representation is the state level (53 states) and Gaul also has 53 states, it is the same in this case. In a different example, the survey is representative at the level of statistical regions (7) while the identifiable GAUL code is at state level (53 states); with this setup, one can know how the seven statistical regions are constructed. gaul_adm2_code gaul_adm2_code is numeric and country-specific based on the GAUL database. It should be taken from the same data in the GAUL database where the geographical area can be identified in the survey based on the name of the location/area. Lessons Learned and Challenges • subnatid codes should reflect the most recent codes that pertain to that survey. subnatid_prev codescan be used to track splits and new administrative units that have been introduced since the previoussurvey. It is important to ensure there is consistency in geographic variables across time. Sub- nationally representative units may be added in later additions of surveys, so names of subnational units must be consistent across time. This will allow analysts to make the current administrative units“backwards-compatible” with little additional effort. • Harmonizers should ensure the subnatid1 through subnatid4 are string variables NOT categorical. • The urban variable cannot be different from zero or one. urban!= 1 & urban!= 0 Overview of Variables Module Code Variable name Variable label Notes Geography urban Binary - Individual in urban area Geography Subnational ID at the ith level, listing as many as subnatid[i] Subnational ID - [ith] level available Geography subnatidsurvey is a string variable that refers to the lowest level of the administrative level at which the survey is representative. In most cases this will be equal to “subnatid1” or “subnatid2”. However, in subnatidsurvey Lowest level of Subnational ID some cases the lowest level is classified in terms of urban, rural or any other regional categorization cannot be mapped to subnatids. The variable would contain survey representation at lowest level irrespective of its mapping to subnatids. Geography subnatid[i]_prev Subnatid previous - [ith] level Previous subnatid if changed since last survey Geography strata Strata Geography See gaul_adm[i]_code GAUL ADM[i] code en.wikipedia.org/wiki/Global_Administrative_Unit_L ayers Demography hsize hsize codes the size of the household. This should not include individuals who may be living in the household but do not form an economic unit with the other household members (e.g., a live-in maid is not part of the household). age age refers to the interval of time between the date of birth and the date of the survey. Every effort should be made to determine the precise and accurate age of each person, particularly of children and older persons. Information on age may be secured either by obtaining the date (year, month, and day) of birth or by asking directly for age at the person’s last birthday. In addition, in the case of children aged less than or equal to 60 months, variable age should be expressed in the number of completed years and months in decimals. For example, If the interview of a 4 years old was in December and he was born in June, his age should be recorded as 4.5. Lastly, if the information on age is not available, it should be coded as missing rather than some other value such as “99” or “999”. male male is a dummy variable that specifies the sex – male or female – of an individual within a household. While constructing this variable, it is important to make sure that all relevant values are included. Variable values coded as ‘98’ or other numeric characters should be excluded from the values of the `male’ variable. Sex of household member, two categories after harmonization: 1 = male 0 = female relationharm relationharm is a string variable that indicates a relationship to the reference person of household (usually the head of household). Variable values coded as ‘98’ or other numeric characters should be excluded from the values of relationharm variable. Relationship to head of household, six categories after harmonization: 1=head 2=spouse 3=children 4=parents 5=other relatives 6=non-relatives Note: In cases where head is missing or a migrant, we assign spouse as the head of the household. If spouse is also not available, then we will use oldest member of the household as the head and recode all the relations to head accordingly. relationcs relationcs is a country-specific categorical variable that indicates the relationship to the head of the household. The categories for relationship to the head of the household are defined according to the region or country requirements. marital Marital is a categorical variable that refers to the personal status of each individual in relation to the marriage laws or customs of the country. The categories of marital status to be identified should include at least the following: (a) single (in other words, never married); (b) married; (c) married but separated; (d) windowed and remarried; (e) divorced and not remarried. In some countries, category (b) may require a subcategory of persons who are contractually married but not yet living as man and wife. In all countries, category (c) should comprise both the legally and the de facto separated, who may be shown as separate subcategories if desired. The marital variable should not be imputed but rather calculated only for those to whom the question was asked (in other words, the youngest age at which information is collected may differ depending on the survey). The consistency between age and marital needs to be cross-checked. In most countries, there are also likely to be persons who were permitted to marry below the legal minimum age because of special circumstances. To permit international comparisons of data on marital status, however, any tabulations of marital status not cross-classified by exact age should at least distinguish between persons under 15 years of age and over. If it is not possible to distinguish between married and living together, then it should be assumed that the individual is married. Variable values coded as ‘98’ or other numeric characters should be excluded from the values of the ‘marital’ variable. Marital status, five categories after harmonization: 1=married 2=never married 3=living together 4=divorced/separated 5=widowed eye_dsablty eye_dsablty is a numerical variable that indicates whether an individual has any difficulty in seeing, even when wearing glasses. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all hear_dsablty hear_dsablty is a numerical variable that indicates whether an individual has any difficulty in hearing even when using a hearing aid. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all walk_dsablty walk_dsablty is a numerical variable that indicates whether an individual has any difficulty in walking or climbing steps. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all conc_dsord conc_dsord is a numerical variable that indicates whether an individual has any difficulty concentrating or remembering. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all slfcre_dsablty slfcre_dsablty is a numerical variable that indicates whether an individual has any difficulty with self-care such as washing all over or dressing. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all comm_dsablty comm_dsablty is a numerical variable that indicates whether an individual has any difficulty communicating or understanding usual (customary) language. Categories after harmonization: 1 = No – no difficulty 2 = Yes – some difficulty 3 = Yes – a lot of difficulty 4 = Cannot do at all Lessons Learned and Challenges Data sets that are harmonized incorrectly can lead to skewed and/or incorrect data analysis. Harmonizers should run a series of checks to ensure data is harmonized properly, including the following: Check to make sure that age is an integer since 5 years old. age/int(age)!= 1 & age!= . & age > 5 age cannot have negative or extreme values (>120) (age < 0 | age>120) & age<. Age cannot be missing age==. Male variable can only take one of two values 0 or 1 (or missing). male!=. & male!= 1 & male!= 0 Check if male is missing. male==. Check to make sure that there is variation in male egen sdmale = sd(male) // sdmale should be 0 relationharm must be an integer in the range [1,6]. relationharm<1 & relationharm>6 & mod(relationharm, 1) == 1 marital must be an integer in the range [1,5]. marital<0 & marital>5 & mod(marital, 1) == 1 Children are “Never married” and should be coded as so even though it may be perceived as obvious. The marital status of individuals should be harmonized for all individuals. Harmonizers should check to make sure children are not systematically left with missing values for marital. tab age marital, missing weight cannot be missing weight==. Additionally, harmonizers should ensure that the household size variable is calculated correctly. Not all the individuals reported in a household that form the raw data are current household members. For example, for the EU-SILC survey, a household contains the current member, but also the members of the previous survey who have left the household for reasons such as death or migration. Overview of Variables Module Code Variable name Variable label Notes Demography hsize Household size Demography age Age in years Demography male Binary - Individual is male Demography relationharm Relationship to head of GMD - Harmonized categories across all household harmonized regions. Same as I2D2 categories. across all regions Demography relationcs Relationship to head of household country/region country or regionally specific categories specific Demography marital Marital status Demography eye_dsablty Difficulty seeing Demography hear_dsablty Difficulty hearing Demography See "Recommended Short Set of Questions" on walk_dsablty Difficulty walking / steps https://www.cdc.gov/nchs/washington_group Demography conc_dsord Difficulty concentrating /wg_questions.htm Demography slfcre_dsablty Difficulty w/ selfcare Demography comm_dsablty Difficulty communicating Migration migrated_mod_age Codes the minimum age the migration module questions of the survey apply to (e.g., if the migration questions are for all 5 years and above this would be 5). migrated_ref_time Codes the reference period the migration questions cover. If the migration questions only apply after an introductory time window questions (e.g., have you moved in the past five years) and then questions are only asked for those who fall within the time window, code the length of that window (e.g., 5 in the example). If migration questions are posed regardless of time (i.e., no time window) code 99. migrated_binary Binary question coding whether the individual has ever migrated (within the reference time set out above). migrated_years Number of full years since the last migration. Often surveys ask how long a person has lived at their current domicile since the last migration. Both questions cover the same information. migrated_from_urban Codes whether the individual migrated to their current domicile from an urban area. 1 = Yes (i.e., came from urban area) 0 = No (i.e., came from rural area) migrated_from_cat If the survey contains information on the area from where the person migrated, use the concept of administrative division to inform the migration pattern. The codes are: 1 = From same admin3 area 2 = From same admin2 area 3 = From same admin1 area 4 = From other admin1 area 5 = From other country To exemplify the use, Spain is divided into Communities (admin1 level), Provinces (admin2 level) and municipalities (names change within provinces, but rough concept holds – admin3 level). A person moving within the municipality, for example, from one village to the next, codes 1. A person moving within the province, say from a rural municipality to the province capital codes 2. A person moving within the same community yet leaving their province codes 3. A person moving from one community to another, say from Andalusia to Galicia, codes 4. If the person moved from outside the country (regardless of their nationality) codes 5. migrated_from_code Based on the logic set out in the migrated_from_cat variable, codify the areas of migration using the survey subnation id classification. For example, if a person migrated from one admin1 area to another, use the subnatid1 codes to inform from which admin1 area they migrated to their current residence (which is codified in subnatid1). This only codifies information within the country. Set to missing if migrated_from_cat is 5. Note that most surveys will only provide information of last residence to a higher administrative level (e.g., admin1 level). Codify the information up to the highest level possible. See an example in 0 below. migrated_from_country Codes the country (if migrated_from_cat is 5) from where the person migrated from as a three letter ISO country code or a clear string for regions (“Other Europe”, “Other World”, …) migrated_reason Codifies the reason why a person migrated. The codes are: 1 = Family reasons 2 = Educational reasons 3 = Employment 4 = Forced (political reasons, natural disaster, …) 5 = Other reasons Lessons Learned and Challenges Codifying migrated_from_* questions In the Indian LFS from 1999 (NSS Schedule 10) there are two questions that allow us to codify the four migrated_from_[text] variables (migrated_from_urban, migrated_from_cat, migrage_from_code, and migrated_from_country). Question 15 of Block 4 asks interviewer to enter the “location code” for the kind of migration the interviewees claim to have made. The codes are: location of last usual residence: same district: rural-1, urban-2; same state but another district: rural-3, urban-4; another state: rural-5, urban-6; another country-7 Question 17 of Block 4 then asks for the state code of migration (as codified in subnatid) and adds additional codes for countries from where people commonly migrated into India. With these two questions we can harmonize the two variables in the following way: *<_migrated_from_urban_> gen migrated_from_urban = . replace migrated_from_urban = 1 if inlist(B4_q15,"2","4","6") & migrated_binary == 1 replace migrated_from_urban = 0 if inlist(B4_q15,"1","3","5") & migrated_binary == 1 label de lblmigrated_from_urban 0 "Rural" 1 "Urban" label values migrated_from_urban lblmigrated_from_urban label var migrated_from_urban "Migrated from area" * *<_migrated_from_cat_> gen migrated_from_cat = . replace migrated_from_cat = 2 if inlist(B4_q15,"1","2") & migrated_binary == 1 replace migrated_from_cat = 3 if inlist(B4_q15,"3","4") & migrated_binary == 1 replace migrated_from_cat = 4 if inlist(B4_q15, "5", "6") & migrated_binary == 1 replace migrated_from_cat = 5 if inlist(B4_q15, "7") & migrated_binary == 1 label de lblmigrated_from_cat 1 "From same admin3 area" 2 "From same admin2 area" 3 "From same admin1 area" 4 "From other admin1 area" 5 "From other country" label values migrated_from_cat lblmigrated_from_cat label var migrated_from_cat "Category of migration area" * *<_migrated_from_code_> destring B4_q17, gen(helper_var) gen migrated_from_code = . replace migrated_from_code = subnatid1 if inrange(migrated_from_cat,1,3) replace migrated_from_code = helper_var if migrated_from_cat == 4 label var migrated_from_code "Code of migration area as subnatid level of migrated_from_cat" drop helper_var * *<_migrated_from_country_> gen migrated_from_country = "" replace migrated_from_country = "BGD" if migrated_from_cat == 5 & B4_q17 == "51" replace migrated_from_country = "NPL" if migrated_from_cat == 5 & B4_q17 == "52" replace migrated_from_country = "PAK" if migrated_from_cat == 5 & B4_q17 == "53" replace migrated_from_country = "LKA" if migrated_from_cat == 5 & B4_q17 == "54" replace migrated_from_country = "BTN" if migrated_from_cat == 5 & B4_q17 == "55" replace migrated_from_country = "Gulf countries" if migrated_from_cat == 5 & B4_q17 == "56" replace migrated_from_country = "Other Asian" if migrated_from_cat == 5 & B4_q17 == "57" replace migrated_from_country = "USA" if migrated_from_cat == 5 & B4_q17 == "58" replace migrated_from_country = "CAN" if migrated_from_cat == 5 & B4_q17 == "59" replace migrated_from_country = "Other Americas" if migrated_from_cat == 5 & B4_q17 == "60" replace migrated_from_country = "UK" if migrated_from_cat == 5 & B4_q17 == "61" replace migrated_from_country = "Other Europe" if migrated_from_cat == 5 & B4_q17 == "62" replace migrated_from_country = "African countries" if migrated_from_cat == 5 & B4_q17 == "63" replace migrated_from_country = "Other World" if migrated_from_cat == 5 & B4_q17 == "64" label var migrated_from_country "Code of migration country (ISO 3 Letter Code)" * Overview of Variables Module Code Variable name Variable label Notes Migration migrated_mod_age Migration module application age Migration migrated_ref_time Reference time applied to If migrated_ref_time = 5 means questions about migration questions migration refer to any migration in the last 5 years Migration migrated_binary Individual has migrated Migration migrated_years Years since last migration is the same as how long Years since latest migration lived at current location Migration migrated_from_urban Migrated from area No means migrated from rural area Migration migrated_from_cat Category of migration area Migration migrated_from_code Code of migration area Migration migrated_from_count Code of migration country ry Migration migrated_reason Reason for migrating Education ed_mod_age Codifies the minimum age for which education questions are asked. For example, if education information is only requested from those 4 years and older the variable should be set to 4. school Codifies whether the person is currently (i.e., at the time of the survey) attending formal education. The codes are: 0 = No 1 = Yes literacy Codifies whether person can read and write in at least one language. The codes are: 0 = No 1 = Yes educy Codifies the number of years spent in education. educat7 Classifies the highest level of education attained by the respondent to seven levels. The codes are: 1 = No education 2 = Primary incomplete 3 = Primary complete 4 = Secondary incomplete 5 = Secondary complete 6 = Higher than secondary but not university 7 = University incomplete or complete The concept of secondary complete includes all students who have had attended at least one year (complete or incomplete) of upper secondary education (as defined in the ISCED Mappings of UNESCO). That is attendance and completion of “junior high” shall be coded as secondary incomplete while attendance to “senior high school” even if for one year, will be coded as secondary complete. educat5 Classifies the highest level of education attained by the respondent to five levels. The codes are: 1 = No education 2 = Primary incomplete 3 = Primary complete but secondary incomplete 4 = Secondary complete 5 = Some tertiary/post-secondary educat4 Classifies the highest level of education attained by the respondent to four levels. The codes are: 1 = No education 2 = Primary (complete or incomplete) 3 = Secondary (complete) 4 = Tertiary (complete or incomplete) Note: Code as primary education anyone who has undergone some schooling but has not finished secondary education. educat_orig Original education code as in the raw survey data. If the original survey has a single variable coding the education information, simply copy (gen educat_orig = survey_education_var). If the survey splits the respondents into different groups (commonly a question for people attending school and a different one for those no longer in education), then educat_orig should be made up of both variables. Figure 15 below, shows the example from the 2022 Bangladesh LFS. The skip pattern from variable EDU_02 is not clear, but in the data people who are attending (EDU_02 code 01) answer question EDU_03 and skip EDU_04, while those who state that they attended in the past but are not currently enrolled (EDU_02 code 02) skip EDU_03 and answer EDU_04. People who never attended school (EDU_02 code 03) skip both EDU_03 and EDU_04. In this case educat_orig should be coded as the union of the information of either group (i.e., educat_orig ought to contain the information of both EDU_03 and EDU_04). Figure 15 - Education variables in the 2022 Bangladeshi LFS In other cases, the information may be split into two sub-questions. Figure 16, below, shows the question on education from the 2021 Zimbabwean LFS. Here the information is split into two variables: a level (which we will refer to as ED3a) and a grade within that level (ED3b). Figure 16 - Education question in the 2021 ZWE LFS If it is possible to combine the information (e.g., by concatenating both variable) this should be tried and explicitly commented in the code. If it is not possible to confidently convey the information split over two (or more) variables, the harmonizer shall leave the educat_orig variable missing. educat_isced Code of the highest educational level attained as per the International Standard Classification of Education (ISCED). Note that the preamble to the harmonization code should record what version of ISCED is being used. Moreover, the code should always be as long as the longest depth available for the ISCED version. For example, the latest version at the time of writing (ISCED 2011) has up to three digits. Where the first digit is the level, the first two digits are the category, and all three digits codify the sub-category. As an example, level 2 codifies “Lower secondary education”, 24 “Lower secondary general education”, and 242 “Sufficient for partial level completion, without direct access to upper secondary education”. Every code should be three digits long. If we only know the level (here 2) add two zeroes after it (here: 200). If we only have the category information (here 24) add a zero to reach three digits (here 240). Overview of Variables Module Code Variable name Variable label Notes Education ed_mod_age Education module minumum age Education school Currently in school Education literacy Individual can read and write Education educy Years of education No option for "Other", as opposed to I2D2, anything Education educat7 Level of education 7 categories not in these categories is to be set to missing Education educat5 Level of education 5 categories Education educat4 Level of education 4 categories Code if there is a single original education variable (as Education educat_orig Original education code is in most cases). If there are two or more variables, leave missing, make a note of it. Codes are for example: 2 Lower secondary education International Standard 24 General Education educat_isced Classification of Education (ISCED 242 Partial level completion, without direct A) access to upper secondary education Should be coded as 200, 240, and 242 respectively. Training vocational Codifies whether the person ever attended vocational training. The codes are: 0 = No 1 = Yes vocational_type Codifies whether the vocational training took place within the enterprise or was administered by an external party. The codes are: 1 = Inside Enterprise 2 = External vocational_length_l Codifies how long the training was in months. Divided into lower and upper in case the information is coded as a range (e.g., 0-3 months, 6-12 months, …). If it is an exact number, lower and upper length are equal. vocational_length_u see vocational_length_u vocational_field_orig Information on the field of training as stored originally in the survey. This variable is to be a string variable. If numeric convert to string while preserving its original structure. vocational_financed Text. The codes are: 1 = Employer 2 = Government 3 = Mixed Employer-Government 4 = Own funds 5 = Other Overview of Variables Module Code Variable name Variable label Notes Education vocational Ever received vocational training Education vocational_type Type of vocational training Education vocational_length_l Length of training, lower limit Education vocational_length_u Length of training, upper limit Original field of training Education vocational_field_orig information If funded with different sources, chose main Education vocational_financed How training was financed source Labour minlaborage This is the lowest age for which the labor module is implemented in the survey or the minimum working age in the country. For this reason, the lower age cutoff at which information is collected will vary from country to country. Labour status, 7-day reference period lstatus lstatus is an individual’s labor status in the last 7 days. The value must be missing for individuals less than the required age (minlaborage). Three categories are used after harmonization: 1 = Employed 2 = Unemployed 3 = Not-in-labor force All persons are considered active in the labor force if they presently have a job (formal or informal, i.e., employed) or do not have a job but are actively seeking work (i.e., unemployed). 1 = Employed Employed is defined as anyone who worked during the last 7 days or reference week, regardless of whether the employment was formal or informal, paid or unpaid, for a minimum of 1 hour. Individuals who had a job, but for any reason did not work in the last 7 days are considered employed. 2 = Unemployed A person is defined as unemployed if he or she is, presently not working but is actively seeking a job. The formal definition of unemployed usually includes being ‘able to accept a job’ (i.e., passively seeking a job). This last question is not asked in all surveys but should be included if present.. A person presently not working but waiting for the start of a new job is considered unemployed. 3 = Not-in-labor force A person is defined as not-in-labor force if he or she is, presently not working and it is not actively seeking a job during the last 7 days or reference week. Over the past decade, a significant alteration has occurred in survey methodologies regarding the definition of employment, specifically affecting labor status. This change stems from the deliberations of the International Conference of Labour Statisticians (ICLS), a forum for standardization in labor statistics convened by the International Labour Organization (ILO) every five years. During the 19th session of the ICLS in 2013, delegates passed a resolution pertaining to the categorization of work, employment, and labor underutilization. In essence, the ICLS-19 resolution redefines employment as work performed for others in exchange for pay or profit, excluding activities such as subsistence agriculture or self-housing construction, which were previously classified as employment. This revision necessitates a careful adjustment in coding practices for variables associated with labor status, such as the "lstatus" variable. To ensure accurate coding, harmonizers should discern the underlying ICLS version utilized in each survey through thorough examination of questionnaire text, skip patterns, and survey reports. Consequently, harmonizers must code the "lstatus" variable in accordance with the pertinent ICLS version (and annotate the version used via the "icls_v" variable). It is imperative to emphasize that each survey undergoes independent harmonization as explained in section Defining the boundaries of GLD harmonization. Therefore, any disparities arising from the adoption of different employment definitions in previous surveys should not impede the harmonization process for the current survey under consideration. potential_lf The ILO defines unemployment (as stated above) as seeking *and* available for a job. The potential labour force is formed by the “available potential job seekers”, who are available but not looking for a job and the “unavailable job seekers”, that is those looking but not available. The below image (Figure 17 – source here) shows the different definitions. Figure 17 - Definition of different labor force status The variable potential_lf thus codifies whether the person is not in the labour force over the past 7 days (lstatus=3, missing otherwise) but could potentially be they are i) available but not searching or ii) searching but not immediately available to work. The codes are: 0 = No (not potentially in the labour force) 1 = Yes underemployment Codifies whether the person is in the labour force and working over the past 7 days (lstatus=1, missing otherwise) but would take on more jobs or more hours at their job if possible/available. The codes are: 0 = No (not underemployed) 1 = Yes nlfreason nlfreason is the reason an individual was not in the labor force in the last 7 days. This variable is constructed for all those who are not presently employed and are not looking for work (lstatus=3) and missing otherwise. Five categories after harmonization: 1= Student (a person currently studying.) 2= Housekeeper (a person who takes care of the house, older people, or children) 3= Retired 4 = Disabled (a person who cannot work due to physical conditions) 5 = Other (a person does not work for any other reason) Fill this information for all people interviewed in the labor section of the questionnaire regardless of their age. unempldur_l unempldur_l is a continuous variable specifying the duration of unemployment in months (lower bracket). The variable is constructed for all unemployed persons (lstatus=2, otherwise missing). If it is specified as continuous in the survey, it records the numbers of months in unemployment. If the variable is categorical it records the lower boundary of the bracket. Missing values are allowed for everyone who is not unemployed. Other missing values are also allowed. unempldur_u unempldur_u is a continuous variable specifying the duration of unemployment in months (upper bracket). The variable is constructed for all unemployed persons (lstatus=2, otherwise missing). If it is specified as continuous in the survey, it records the numbers of months in unemployment. If the variable is categorical it records the upper boundary of the bracket. If the right bracket is open a missing value should be inputted. Missing values are allowed for everyone who is not unemployed. Other missing values are also allowed. If the duration of unemployment is not reported as a range, but as continuous variables, the unempldur_l and unempldur_u variables will have the same value. If the high range is open-ended the unempldur_u variable will be missing. Primary Employment, 7-day reference period empstat empstat is a categorical variable that specifies the main employment status in the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all individuals. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. The definitions are taken from the International Labor Organization’s Classification of Status in Employment with some revisions to take into account the data available. Five categories after harmonization: 1 = Paid Employee 2 = Non-Paid Employee 3 = Employer 4 = Self-employed 5 = Other, workers not classifiable by status 1 = Paid Employee Paid employee includes anyone whose basic remuneration is not directly dependent on the revenue of the unit they work for, typically remunerated by wages and salaries but may be paid for piece work or in- kind. The ‘continuous’ criteria used in the ILO definition is not used here as data are often absent and due to country specificity. 2 = Non-Paid Employee Non-paid employee includes contributing family workers who hold a self-employment job in a market- oriented establishment operated by a related person living in the same households who cannot be regarded as a partner because of their degree of commitment to the operation of the establishment, in terms of working time or other factors, is not at a level comparable to that of the head of the establishment. All apprentices should be mapped as ‘non-paid employee’ 3 = Employer An employer is a business owner (whether alone or in partnership) with employees. If the only people working in the business are the owner and contributing family workers, the person is not considered an employer (as has no employees) and is, instead classified as self-employed. 4 = Self-employed Own account or self-employment includes jobs where remuneration is directly dependent from the goods and service produced (where home consumption is considered to be part of the profits) and where one has not engaged any permanent employees to work for them on a continuous basis during the reference period. Members of producers’ cooperatives are workers who hold a self-employment job in a cooperative producing goods and services, in which each member takes part on an equal footing with other members in determining the organization of production, sales and/or other work of the establishment, the investments and the distribution of the proceeds of the establishment amongst the members. 5 = Other, workers not classifiable by status Other, workers not classifiable by status include those for whom insufficient relevant information is available and/or who cannot be included in any of the above categories. ocusec ocusec is a categorical variable that specifies the sector of activity in the last 7 days. It classifies the main job's sector of activity of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. Four categories after harmonization: 1 = Public sector, Central Government, Army (including armed forces) 2 = Private, NGO 3 = State-owned 4 = Public or State-owned, but cannot distinguish 1 = Public Sector, Central Government, Army (including armed forces) Public sector The part of economy run by the government. 2 = Private, NGO Private sector is that part of the economy which is both run for private profit and is not controlled by the state, it also includes non-governmental organizations 3 = State-owned enterprises State-owned includes para-state firms and all others in which the government has control (participation over 50%). 4 = Public or State-owned, but cannot distinguish Select this option is the questionnaire does not ask for State-owned enterprises, and only for Public sector. Additionally, recall the fact that, in common English usage, a public company (often denoted as public limited company or PLC) are private companies in the private sector but whose ownership is organized via stocks tradeable in a public market, i.e., accessible to all, not run by the public sector. Figure 18 below is an example of the relevant question in the 2010 Pakistani LFS: Figure 18 - Excerpt of PAK LFS questionnaire Here code 4 represents a public enterprise (explicitly mentioned as a corporation created by a legislative body) and code 6 a public limited company (a private sector company whose shares can be bought by the general public). Code 6 should not be considered part of the public sector. Coding would then be (assuming the original question is called var_sector): gen ocusec = . replace ocusec = 1 if inrange(var_sector,1,3) replace ocusec = 2 if inrange(var_sector,5,9) replace ocusec = 3 if var_sector == 4 replace ocusec = 4 if var_sector == 10 Notes: Do not code basis of occupation (ISCO) or industry (ISIC) codes. industry_orig industry_orig is a string variable that specifies the original industry codes in the last 7 days for the main job provided in the survey (the actual question) and should correspond to whatever is in the original file with no recoding. It will contain missing values for people below the working age. Other missing values are allowed. It classifies the main job of any individual with a job (lstatus=1) and is missing otherwise industrycat_isic Code (string variable) of the industry according to the International Standard Industry Classification (ISIC) in the last 7 days for the main job of any individual with a job (lstatus=1) and is missing otherwise. Note that the preamble to the harmonization code should record what version of ISIC is being used. The code should always be as long as the longest depth available for the ISIC version. For example, the latest version at the time of writing (ISIC Rev 4, available here) codes industries by sections, divisions, groups, and classes, in decreasing order of hierarchy. Figure 19 shows the classification structure for the manufacture of machinery and equipment. The letter C codes the Manufacturing section, while the code 28 represents “Manufacture of machines and equipment n.e.c” division. This division has two groups (281 and 282), containing one and three classes respectively. Figure 19 - Example of ISIC classification A single section will often cover several divisions. While D has only one division (35 – electricity, gas, steam and air conditioning supply) and could be potentially shortened to “3”, section C covers divisions 10 to 33. If the information in the survey is only present at section level (or can only be translated from the national industry classification to section level) this variable should be a string with the letter coding the correct section. In most cases, information will be coded as a set of digits. In this case, the codification should be a string of four digits with a zero padding before for division 1 through 9 (i.e., 01, to 09). As an example, Figure 20, puts together a few excerpts from ISIC Rev.4. Note that, if we do not codify correctly, group 14 (Animal production) may be misunderstood for division 14 (Manufacture of wearing apparel). If we only have information up to the group label, fill out the reminder of the digits with zeros. Hence the purple box in Figure 20 would be coded as “0140”. The red box, as we have all digits, including the zero padding at the start codes as “0142”. The act of adding zeroes to the end is standard if the lower level hierarchy has no further distinctions and can be seen in the yellow box, where group 142 has no classes (or just one class) and thus is coded as “1420”. Some groups do in fact have several classes, as can be seen for group 151. Again, if we only had information up to group level, we ought to code “1510”. If we have more detailed information, for example identifying the industry as “Tanning and dressing of leather; dressing and dyeing of fur” (green box in Figure 20 below) we would code “1511”. Figure 20 - Examples of different ISIC codes industrycat10 industrycat10 is a categorical variable that specifies the 1-digit industry classification in the last 7 days for the main job of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. The codes for the main job are given here based on the UN International Standard Industrial Classification. It classifies the main job of any individual with a job (lstatus=1) and is missing otherwise Ten categories after harmonization: 1 = Agriculture, Hunting, Fishing, etc. 2 = Mining 3 = Manufacturing 4 = Public Utility Services 5 = Construction 6 = Commerce 7 = Transport and Communications 8 = Financial and Business Services 9 = Public Administration 10 = Other Services, Unspecified Notes: • In the case of different classifications (former Soviet Union republics, for example), recoding has been done to best match the ISIC codes. • Category 10 is also assigned for unspecified categories or items. • If all 10 categories cannot be identified in the questionnaire create this variable as missing and proceed to create industrycat4. • Over the years, the different ISIC versions have changed. The original industrycat10 categories are largely based on ISIC Revision 2. The below Figure 21 shows how to classify the different ISIC revision codes into industrycat10. Figure 21 - Overview of coding of industrycat based on different ISIC revisions industrycat4 industrycat4 is a categorical variable that specifies the 1-digit industry classification in the last 7 days for the main job for Broad Economic Activities. This variable is either created directly from the data (if industry classification does not exist for ten categories) or created from industrycat10. Four categories after harmonization: 1 = Agriculture 2= Industry 3 = Services 4 = Other This variable is either created directly from the data (if industry classification does not exist for ten categories) or created from industrycat10. occup_orig occup_orig is a string variable that specifies the original occupation code in the last 7 days for the main job. This variable corresponds to whatever is in the original file with no recoding. occup_isco Code (string variable) of the occupation according to the International Standard Classification of Occupations (ISCO) in the last 7 days for the main job of any individual with a job (lstatus=1) and is missing otherwise. Note that the preamble to the harmonization code should record what version of ISCO is being used. The code should always be as long as the longest depth available for the ISCO) version. For example, the latest version at the time of writing (ISCO-08, available here) codes occupations by Major, Sub-major, Minor, and Unit groups, in decreasing order of hierarchy. ISCO code Major groups cover a single digit, running from 1 (Managers) to 9 (Elementary Occupations) with the additional category 0 (Armed Forces Occupations). Hence there is only need for zero-padding on the left side for Armed Forces Occupations. Figure 22 shows an example of the possible values that can be taken on. Figure 22 - Example of values for ISCO-08 If we only had information at Major Group level, a person working as a Services and Sales Worker ought to be coded as the number 5000. If the information is at Sub-major Group level, it should be codified as the number 5100 for a Personal Services Worker, while a Travel Attendant, a Conductor, or a Guide (if information at Minor Group level) should be coded as the number 5110. Information at the Unit Group level can be coded as is, since it already is at the maximum possible depth. occup_skill Categorical code for the broad skill level of workers at the main job in the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. It follows from the ISCO classification as shown in Figure 23. Figure 23 - ISCO broad skill level classification Thus, the codes are: 3 = High 2 = Medium 1 = Low . = Armed Forces and not elsewhere classified occup occup is a categorical variable that specifies the 1-digit occupational classification for the main job in the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. This variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. Most surveys collect detailed information and then code it, without keeping the original data, no attempt has been made to correct or check the original coding. The classification is based on the International Standard Classification of Occupations (ISCO). It classifies the main job of any individual with a job (lstatus=1) and is missing otherwise. Eleven categories after harmonization: 1 = Managers 2 = Professionals 3 = Technicians and associate professionals 4 = Clerical support workers 5 = Service and sales workers 6 = Skilled agricultural, forestry and fishery workers 7 = Craft and related trades workers 8 = Plant and machine operators, and assemblers 9 = Elementary occupations 10 = Armed forces occupations 99 = Other/unspecified wage_no_compen wage_no_compen is a continuous variable that specifies the last wage payment in local currency of any employed individual (lstatus=1) in its primary occupation at the reference period reported in the survey and it is missing otherwise. The wage should come from the main job, in other words, the job that the person dedicated most time in the week preceding the survey. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • For all those with self-employment or owners of own businesses, this should be net revenues (net of all costs EXCEPT for tax payments) or the amount of salary taken from the business. Due to the almost complete lack of information on taxes, the wage from main job is NOT net of taxes. • By definition, non-paid employees (empstat=2) should have wage=0. • The reference period of the wage_no_compen will be recorded in the unitwage variable. unitwage unitwage is a categorical variable that specifies the time reference for the wage_no_compen variable. It specifies the time unit measurement for the wages of any individual (lstatus=1 & empstat=1) and it is missing otherwise. Acceptable values include: 1 = Daily 2 = Weekly 3 = Every two weeks 4 = Every two months 5 = Monthly 6 = Quarterly 7 = Every six months 8 = Annually 9 = Hourly 10 = Other whours whours is a continuous variable that specifies the hours of work last week for the main job of any individual with a job (lstatus=1) and is missing otherwise. The main job defined as that occupation that the person dedicated more time to over the past week. The variable is constructed for all persons administered this module in each questionnaire. Notes: • If the respondent was absent from the job in the week preceding the survey due to holidays, vacation, or sick leave, then record the time worked in the previous 7 days that the person worked. • Sometimes the questions are phrased as, “on average, how many hours a week do you work?”. • For individuals who only give information on how many hours they work per day and no information on number of days worked a week, multiply the hours by 5 days. • In the case of a question that has hours worked per month, divide by 4.3 to get weekly hours. wmonths wmonths is a continuous variable that specifies the number of months worked in the last 12 months for the main job of any individual with a job (lstatus=1) and is missing otherwise. The main job is defined as that occupation that the person dedicated more time to over the past week. The variable is constructed for all persons administered this module in each questionnaire. wage_total wage_total is a continuous variable that specifies the annualized wage payment (regular wage plus bonuses, in-kind, compensation, etc.) for the primary occupation in local currency of any individual (lstatus=1 & empstat=1) and is missing otherwise. The wage should come from the main job, in other words, the job that the person dedicated most time in the week preceding the survey. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. wage_total should be equal to wage_no_compen in case there are no bonuses, tips etc. offered as part of the job. The variable is constructed for all persons administered this module in each questionnaire. The annualization of the wage_total should consider the number of months/weeks the persons have been working and receiving this income. Harmonizer should not assume the person has been working the whole year. Box 4 shows the creation of wage_total when there are no bonuses nor other compensations. Box 4 - Example of wage_total creation gen double wage_total=. replace wage_total=(wage_no_compen*5*4.3)*wmonths if unitwage==1 //Wage in daily unit replace wage_total=(wage_no_compen*4.3)*wmonths if unitwage==2 //Wage in weekly unit replace wage_total=(wage_no_compen*2.15)*wmonths if unitwage==3 //Wage in every two weeks unit replace wage_total=(wage_no_compen)/2*wmonths if unitwage==4 //Wage in every two months unit replace wage_total=( wage_no_compen)*wmonths if unitwage==5 //Wage in monthly unit replace wage_total=( wage_no_compen)/3*wmonths if unitwage==6 //Wage in every quarterly unit replace wage_total=(wage_no_compen)/6*wmonths if unitwage==7 //Wage in every six months unit replace wage_total= wage_no_compen/12*wmonths if unitwage==8 //Wage in annual unit replace wage_total=(wage_no_compen*whours*4.3)*wmonths if unitwage==9 //Wage in hourly unit Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. contract contract is a dummy variable that classifies the contract status (yes/no) of any individual with a job (lstatus=1) and is missing otherwise. It indicates whether a person has a signed (formal and written – not verbal) contract, regardless of duration. The variable is constructed for all persons administered this module in each questionnaire. Two categories after harmonization: 0 = No 1 = Yes healthins healthins is a dummy variable that classifies the health insurance status (yes/no) of any individual with a job (lstatus=1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. However, this variable is only constructed if there is an explicit question about health insurance provided by the job. Two categories after harmonization: 0 = No 1 = Yes socialsec socialsec is a dummy variable that classifies the social security status (yes/no) of any individual with a job (lstatus=1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. However, this variable is only constructed if there is an explicit question about pension plans or social security. Two categories after harmonization: 0 = No 1 = Yes union union is a dummy variable that classifies the union membership status (yes/no) of any individual with a job (lstatus=1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. However, this variable is only constructed if there is an explicit question about trade unions. Two categories after harmonization: 0 = No 1 = Yes firmsize_l firmsize_l specifies the lower bracket of the firm size. The variable is constructed for all persons who are employed in the last 7 days for the main job. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the lower boundary of the bracket. firmsize_u firmsize_u specifies the upper bracket of the firm size. The variable is constructed for all persons who are employed in the last 7 days for the main job. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the upper boundary of the bracket. If the right bracket is open, this variable should be missing. Secondary Employment, 7-day reference period empstat_2 empstat_2 is a categorical variable that specifies employment status of the secondary job with reference period of last 7 days of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all individuals. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. The definitions are taken from the International Labor Organization’s Classification of Status in Employment with some revisions to take into account the data available. Five categories after harmonization: 1 = Paid Employee 2 = Non-Paid Employee 3 = Employer 4 = Self-employed 5 = Other, workers not classifiable by status 1 = Paid Employee Paid employee includes anyone whose basic remuneration is not directly dependent on the revenue of the unit they work for, typically remunerated by wages and salaries but may be paid for piece work or in- kind. The ‘continuous’ criteria used in the ILO definition is not used here as data are often absent and due to country specificity. 2 = Non-Paid Employee Non-paid employee includes contributing family workers who hold a self-employment job in a market- oriented establishment operated by a related person living in the same households who cannot be regarded as a partner because of their degree of commitment to the operation of the establishment, in terms of working time or other factors, is not at a level comparable to that of the head of the establishment. All apprentices should be mapped as non-paid employee. 3 = Employer Employer is a business owner (whether alone or in partnership) with employees. If the only people working in the business are the owner and ‘contributing family workers, the person is not considered an employer (as has no employees) and is, instead classified as own account. 4 = Self-employed Own account or self-employment includes jobs are those where remuneration is directly dependent from the goods and service produced (where home consumption is considered to be part of the profits) and have not engaged any permanent employees to work for them on a continuous basis during the reference period. Members of producers’ cooperatives are workers who hold a self-employment job in a cooperative producing goods and services in which each member takes part on an equal footing with other members in determining the organization of production, sales and/or other work of the establishment, the investments and the distribution of the proceeds of the establishment amongst the members. 5 = Other, workers not classifiable by status Other, workers not classifiable by status include those for whom insufficient relevant information is available and/or who cannot be included in any of the above categories. ocusec_2 ocusec_2 is a categorical variable that specifies the sector of activity in the last 7 days. It classifies the secondary job's sector of activity of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. Four categories after harmonization: 1 = Public sector, Central Government, Army (including armed forces) 2 = Private, NGO 3 = State-owned 4 = Public or State-owned, but cannot distinguish 1 = Public Sector, Central Government, Army (including armed forces) Public sector is the part of economy run by the government. 2 = Private, NGO Private sector is that part of the economy which is both run for private profit and is not controlled by the state, it also includes non-governmental organizations 3 = State-owned enterprises State-owned includes para-state firms and all others in which the government has control (participation over 50%). 4 = Public or State-owned, but cannot distinguish Select this option is the questionnaire does not ask for State-owned enterprises, and only for Public sector. Notes: Do not code basis of occupation (ISCO) or industry (ISIC) codes. industry_orig_2 industry_orig_2 is a string variable that specifies the original industry codes for the second job with reference period of the last 7 days and should correspond to whatever is in the original file with no recoding. Do not put missing values for people below the working age. Other missing values are allowed. It classifies the main job of any individual with a job (lstatus=1) and is missing otherwise. industrycat_isic_2 Code (string variable) of the industry according to the International Standard Industry Classification (ISIC) in the last 7 days for the second job of any individual with a job (lstatus=1) and is missing otherwise. See industrycat_isic for the details. industrycat10_2 industrycat10_2 is a categorical variable that specifies the 1-digit industry classification that classifies the second job with reference period of the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. The codes for the second job are given here based on the UN International Standard Industrial Classification. Ten categories after harmonization: 1 = Agriculture, Hunting, Fishing, etc. 2 = Mining 3 = Manufacturing 4 = Public Utility Services 5 = Construction 6 = Commerce 7 = Transport and Communications 8 = Financial and Business Services 9 = Public Administration 10 = Other Services, Unspecified Notes: • In the case of different classifications (former Soviet Union republics, for example), recoding has been done to best match the ISIC codes. • Category 10 is also assigned for unspecified categories or items. • For details on how to code different ISIC versions to industrycat10_2 see the industrycat10 entry. industrycat4_2 industrycat4_2 is a categorical variable that specifies the 1-digit industry classification for Broad Economic Activities for the second job with reference period of the last 7 days. This variable is either created directly from the data (if industry classification does not exist for 10 categories) or created from industrycat10_2. Four categories after harmonization: 1 = Agriculture 2= Industry 3 = Services 4 = Other This variable is either created directly from the data (if industry classification does not exist for 10 categories) or created from industrycat10. occup_orig_2 occup_orig_2 is a string variable that specifies the original occupation code in the last 7 days for the secondary job. This variable corresponds to whatever is in the original file with no recoding. occup_isco_2 Code (string variable) of the occupation according to the International Standard Classification of Occupations (ISCO) in the last 7 days for the second job of any individual with a job (lstatus=1) and is missing otherwise. See occup_isco for the details. occup_skill_2 Categorical code for the broad skill level of workers at the second job in the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. See occup_skill for details. occup_2 occup_2 is a categorical variable that specifies the 1-digit occupation classification. It classifies the second job of any individual with a job (lstatus=1) and is missing otherwise. This variable is constructed for all persons administered this module in each questionnaire. Most surveys collect detailed information and then code it, without keeping the original data. No attempt has been made to correct or check the original coding. The classification is based on the International Standard Classification of Occupations (ISCO). In the case of different classifications, re-coding has been done to best match the ISCO. Eleven categories after harmonization: 1 = Managers 2 = Professionals 3 = Technicians and associate professionals 4 = Clerical support workers 5 = Service and sales workers 6 = Skilled agricultural, forestry and fishery workers 7 = Craft and related trades workers 8 = Plant and machine operators, and assemblers 9 = Elementary occupations 10 = Armed forces occupations 99 = Other/unspecified wage_no_compen_2 wage_no_compen_2 is a continuous variable that specifies the last wage payment in local currency of any individual (lstatus=1 & empstat=1) in its secondary occupation and is missing otherwise. The wage should come from the second job, in other words, the job that the person dedicated the second most amount of time in the week preceding the survey. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • For all those with self-employment or owners of own businesses, this should be net revenues (net of all costs EXCEPT for tax payments) or the amount of salary taken from the business. Due to the almost complete lack of information on taxes, the wage from main job is NOT net of taxes. • By definition, non-paid employees (empstat_2=2) should have wage=0. • The reference period of the wage_no_compen_2 will be recorded in the unitwage_2 variable • Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. unitwage_2 unitwage_2 is a categorical variable that specifies the time reference for the wage_no_compen_2 variable. It specifies the time unit measurement for the wages for the secondary job of any individual (lstatus=1 & empstat=1) and is missing otherwise. Ten categories after harmonization: 1 = Daily 2 = Weekly 3 = Every two weeks 4 = Every two months 5 = Monthly 6 = Quarterly 7 = Every six months 8 = Annually 9 = Hourly 10 = Other whours_2 whours_2 is a continuous variable that specifies the hours of work in last week for the second job with reference period of the last 7 days of any individual with a job (lstatus=1) and is missing otherwise. The second job defined as that occupation that the person dedicated the second most amount of time to over the past week. The variable is constructed for all persons administered this module in each questionnaire. The lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. Notes: • If the respondent was absent from the job in the week preceeding the survey due to holidays, vacation, or sick leave, then record the time worked in the previous 7 days that the person worked. • Sometimes the questions are phrased as, “on average, how many hours a week do you work?”. • For individuals who only give information on how many hours they work per day and no information on number of days worked a week, multiply the hours by 5 days. • In the case of a question that has hours worked per month, divide by 4.3 to get weekly hours. wmonths_2 wmonths_2 is a continuous variable that specifies the number of months worked in the last 12 months for the secondary job of any individual with a job (lstatus=1) and is missing otherwise. The secondary job is defined as that occupation in which the person dedicated less time than the primary job over the past week. The variable is constructed for all persons administered this module in each questionnaire. wage_total_2 wage_total_2 is a continuous variable that specifies the annualized wage payment (regular wage plus bonuses, in-kind, compensation, etc.) in local currency of any individual (lstatus=1 & empstat=1) in its secondary occupation and is missing otherwise. The wage should come from the secondary job, in other words, the job that the person dedicated the second most amount of time in the week preceding the survey. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. wage_total_2 should be equal to wage_no_compen_2 in case there are no bonuses, tips etc. offered as part of the job. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • The annualization of the wage_total_2 should consider the number of months/weeks the persons have been working and receiving this income. Harmonizer should not assume the person has been working the whole year. • For an example on how to annualize wage see 5.2.2 Primary Employment last 7-days variable wage_total (Example: Creation of wage_total when there are no bonuses nor other compensations). firmsize_l_2 firmsize_l_2 specifies the lower bracket of the firm size. The variable is constructed for all persons who are employed. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the lower boundary of the bracket. firmsize_u_2 firmsize_u_2 specifies the upper bracket of the firm size. The variable is constructed for all persons who are employed. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the upper boundary of the bracket. If the right bracket is open, a missing value should be inputted. Other Employment, 7-day reference period t_hours_others t_hours_others is a continuous variable that specifies the hours of work in last 12 months in all jobs excluding the primary and secondary ones. t_wage_nocompen_others t_wage_nocompen_others is a continuous variable that specifies the annualized wage in all jobs excluding the primary and secondary ones. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. t_wage_others t_wage_others is a continuous variable that specifies the annualized wage in all jobs excluding the primary and secondary ones. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. wage_others should be equal to wage_nocompen_ others in case there are no bonuses, tips etc. offered as part of any of the jobs. Total Employment Earnings, 7-day reference period t_hours_total t_hours_total is a continuous variable that specifies the hours of work in last 12 months in all jobs including primary, secondary and others. t_wage_nocompen_total t_wage_nocompen_total is a continuous variable that specifies the total annualized wage income in all jobs including primary, secondary and others. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. t_wage_total t_wage_total is a continuous variable that specifies the total annualized wage income in all jobs including primary, secondary and others. This income includes tips, compensations such as bonuses, dwellings or clothes, and other payments. t_wage_total should be equal to t_wage_nocompen_total in case there are no bonuses, tips etc. offered as part of any of the jobs. If the number of months worked in this job is missing the harmonizer could assumed that the person worked the whole year in this job. Labor status, 12-month reference period lstatus_year lstatus_year is an individual’s labor status in the last 12 months. The Value must be missing for individuals less than the required age (minlaborage). Three categories are used after harmonization: 1 = Employed 2 = Unemployed 3 = Not-in-labor force All persons are considered active in the labor force if they presently have a job (formal or informal, i.e., employed) or do not have a job but are actively seeking work (i.e., unemployed). 1 = Employed Employed is defined as anyone who worked during the last 12 months or reference week, regardless of whether the employment was formal or informal, paid or unpaid, for a minimum of 1 hour. Individuals who had a job, but for any reason did not work in the last 7 days are considered employed. 2 = Unemployed A person is defined as unemployed if he or she is, presently not working but is actively seeking a job. The formal definition of unemployed usually includes being ‘able to accept a job.’ This last question was asked in a minority of surveys and is, thus, not incorporated in the present definition. A person presently not working but waiting for the start of a new job is considered unemployed. 3 = Not-in-labor force A person is defined as not-in-labor force if he or she is, presently not working and it is not actively seeking a job during the last 12 months or reference week. potential_lf_year Codifies whether the person is not in the labour force over the past year (lstatus_year=3, missing otherwise) but could potentially be they are i) available but not searching or ii) searching but not immediately available to work. The codes are: 0 = No (not potentially in the labour force) 1 = Yes underemployment_year Codifies whether the person was in the labour force and working in the past 12 months (lstatus_year=1, missing otherwise) but would take on more jobs or more hours at their job if possible/available. The codes are: 0 = No (not underemployed) 1 = Yes nlfreason_year nlfreason_year is the reason an individual was not in the labor force in the last 12 months. This variable is constructed for all those who are not presently employed and are not looking for work (lstatus_year=3) and missing otherwise. Five categories after harmonization: 1= Student (a person currently studying.) 2= Housewife (a person who takes care of the house, older people, or children) 3= Retired 4 = Disabled (a person who cannot work due to physical conditions) 5 = Other (a person does not work for any other reason) Do not put missing values for people below the working age, employed, and unemployed. Other missing values allowed. unempldur_l_year unempldur_l_year is a continuous variable specifying the duration of unemployment in months (lower bracket). The variable is constructed for all unemployed persons (lstatus_year=2, otherwise missing). If it is specified as continuous in the survey, it records the numbers of months in unemployment. If the ariable is categorical it records the lower boundary of the bracket. Missing values are allowed for everyone who is not unemployed. Other missing values are also allowed. unempldur_u_year unempldur_u_year is a continuous variable specifying the duration of unemployment in months (upper bracket). The variable is constructed for all unemployed persons (lstatus_year=2, otherwise missing). If it is specified as continuous in the survey, it records the numbers of months in unemployment. If the variable is categorical it records the upper boundary of the bracket. If the right bracket is open a missing value should be inputted. Missing values are allowed for everyone who is not unemployed. Other missing values are also allowed. If the duration of unemployment is not reported as a range, but as continuous variables, the unempldur_l_year and unempldur_u_year variables will have the same value. If the high range is open- ended the unempldur_u_year variable will be missing. Primary Employment, 12-month reference period empstat_year empstat is a categorical variable that specifies the main employment status in the last 12 months of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all individuals. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. The definitions are taken from the International Labor Organization’s Classification of Status in Employment with some revisions to take into account the data available. Five categories after harmonization: 1 = Paid Employee 2 = Non-Paid Employee 3 = Employer 4 = Self-employed 5 = Other, workers not classifiable by status 1 = Paid Employee Paid employee includes anyone whose basic remuneration is not directly dependent on the revenue of the unit they work for, typically remunerated by wages and salaries but may be paid for piece work or in- kind. The ‘continuous’ criteria used in the ILO definition is not used here as data are often absent and due to country specificity. 2 = Non-Paid Employee Non-paid employee includes contributing family workers who hold a self-employment job in a market- oriented establishment operated by a related person living in the same households who cannot be regarded as a partner because of their degree of commitment to the operation of the establishment, in terms of working time or other factors, is not at a level comparable to that of the head of the establishment. All apprentices should be mapped as ‘non-paid employee’ 3 = Employer An employer is a business owner (whether alone or in partnership) with employees. If the only people working in the business are the owner and contributing family workers, the person is not considered an employer (as has no employees) and is, instead classified as self-employed. 4 = Self-employed Own account or self-employment includes jobs where remuneration is directly dependent from the goods and service produced (where home consumption is considered to be part of the profits) and where one has not engaged any permanent employees to work for them on a continuous basis during the reference period. Members of producers’ cooperatives are workers who hold a self-employment job in a cooperative producing goods and services, in which each member takes part on an equal footing with other members in determining the organization of production, sales and/or other work of the establishment, the investments and the distribution of the proceeds of the establishment amongst the members. 5 = Other, workers not classifiable by status Other, workers not classifiable by status include those for whom insufficient relevant information is available and/or who cannot be included in any of the above categories. ocusec_year ocusec_year is a categorical variable that specifies the sector of activity in the last 12 months. It classifies the main job's sector of activity of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. Four categories after harmonization: 1 = Public sector, Central Government, Army (including armed forces) 2 = Private, NGO 3 = State-owned 4 = Public or State-owned, but cannot distinguish 1 = Public Sector, Central Government, Army (including armed forces) Public sector is the part of economy run by the government. 2 = Private, NGO Private sector is that part of the economy which is both run for private profit and is not controlled by the state, it also includes non-governmental organizations 3 = State-owned enterprises State-owned includes para-state firms and all others in which the government has control (participation over 50%). 4 = Public or State-owned, but cannot distinguish Select this option is the questionnaire does not ask for State-owned enterprises, and only for Public sector. Notes: Do not code basis of occupation (ISCO) or industry (ISIC) codes. industry_orig_year industry_orig_year is a string variable that specifies the original industry codes in the last 12 months for the main job provided in the survey (the actual question) and should correspond to whatever is in the original file with no recoding. It will contain missing values for people below the working age. Other missing values are allowed. It classifies the main job of any individual with a job (lstatus_year =1) and is missing otherwise industrycat_isic_year Code (string variable) of the industry according to the International Standard Industry Classification (ISIC) in the last 12 months for the main job of any individual with a job (lstatus_year=1) and is missing otherwise. See industrycat_isic for the details. industrycat10_year industrycat10_year is a categorical variable that specifies the 1-digit industry classification in the last 12 months for the main job of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. The codes for the main job are given here based on the UN International Standard Industrial Classification. It classifies the main job of any individual with a job (lstatus_year =1) and is missing otherwise Ten categories after harmonization: 1 = Agriculture, Hunting, Fishing, etc. 2 = Mining 3 = Manufacturing 4 = Public Utility Services 5 = Construction 6 = Commerce 7 = Transport and Communications 8 = Financial and Business Services 9 = Public Administration 10 = Other Services, Unspecified Notes: Notes: • In the case of different classifications (former Soviet Union republics, for example), recoding has been done to best match the ISIC codes. • Category 10 is also assigned for unspecified categories or items. • If all 10 categories cannot be identified in the questionnaire create this variable as missing and proceed to create industrycat4_year. • For details on how to code different ISIC versions to industrycat10_year see the industrycat10 entry. industrycat4_year industrycat4_year is a categorical variable that specifies the 1-digit industry classification in the last 12 months for the main job for Broad Economic Activities. This variable is either created directly from the data (if industry classification does not exist for ten categories) or created from industrycat10_year. Four categories after harmonization: 1 = Agriculture 2= Industry 3 = Services 4 = Other This variable is either created directly from the data (if industry classification does not exist for ten categories) or created from industrycat10_year. occup_orig_year occup_orig_year is a string variable that specifies the original occupation code in the last 12 months for the main job. This variable corresponds to whatever is in the original file with no recoding. occup_isco_year Code (string variable) of the occupation according to the International Standard Classification of Occupations (ISCO) in the last 12 months days for the main job of any individual with a job (lstatus_year=1) and is missing otherwise. See occup_isco for the details. occup_skill_year Categorical code for the broad skill level of workers at the main job in the last 12 months of any individual with a job (lstatus_year=1) and is missing otherwise. See occup_skill for details. occup_year occup_year is a categorical variable that specifies the 1-digit occupational classification for the main job in the last 12 months of any individual with a job (lstatus_year =1) and is missing otherwise. This variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. Most surveys collect detailed information and then code it, without keeping the original data, no attempt has been made to correct or check the original coding. The classification is based on the International Standard Classification of Occupations (ISCO). It classifies the main job of any individual with a job (lstatus_year=1) and is missing otherwise. Eleven categories after harmonization: 1 = Managers 2 = Professionals 3 = Technicians and associate professionals 4 = Clerical support workers 5 = Service and sales workers 6 = Skilled agricultural, forestry and fishery workers 7 = Craft and related trades workers 8 = Plant and machine operators, and assemblers 9 = Elementary occupations 10 = Armed forces occupations 99 = Other/unspecified wage_no_compen_year wage_no_compen_year is a continuous variable that specifies the last wage payment in local currency of any individual (lstatus_year =1 & empstat_year =1) in its primary occupation at the reference period reported in the survey and it is missing otherwise. The wage should come from the main job, in other words, the job that the person dedicated most time in the 12 months preceding the survey. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • For all those with self-employment or owners of own businesses, this should be net revenues (net of all costs EXCEPT for tax payments) or the amount of salary taken from the business. Due to the almost complete lack of information on taxes, the wage from main job is NOT net of taxes. • By definition, non-paid employees (empstat_year=2) should have wage=0. • The reference period of the wage_no_compen_year will be recorded in the unitwage_year variable. • Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. unitwage_year unitwage_year is a categorical variable that specifies the time reference for the wage_no_compen_year variable. It specifies the time unit measurement for the wages of any individual (lstatus_year =1 & empstat_year =1) and it is missing otherwise. Acceptable values include: 1 = Daily 2 = Weekly 3 = Every two weeks 4 = Every two months 5 = Monthly 6 = Quarterly 7 = Every six months 8 = Annually 9 = Hourly 10 = Other whours_year whours_year is a continuous variable that specifies the hours of work last week for the main job of any individual with a job (lstatus_year =1) and is missing otherwise. The main job defined as that occupation that the person dedicated more time to over the past 12 months. The variable is constructed for all persons administered this module in each questionnaire. Notes: • Sometimes the questions are phrased as, “on average, how many hours a week do you work?”. • For individuals who only give information on how many hours they work per day and no information on number of days worked a week, multiply the hours by 5 days. • In the case of a question that has hours worked per month, divide by 4.3 to get weekly hours. wmonths_year wmonths_year is a continuous variable that specifies the number of months worked in the last 12 months for the main job of any individual with a job (lstatus_year =1) and is missing otherwise. The main job is defined as that occupation that the person dedicated more time to over the past 12 months. The variable is constructed for all persons administered this module in each questionnaire. wage_total_year wage_total_year is a continuous variable that specifies the annualized wage payment (regular wage plus bonuses, in-kind, compensation, etc.) for the primary occupation in local currency of any individual (lstatus_year =1 & empstat_year =1) and is missing otherwise. The wage should come from the main job, in other words, the job that the person dedicated most time in the year preceding the survey. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. wage_total_year should be equal to wage_no_compen_year in case there are no bonuses, tips etc. offered as part of the job. The variable is constructed for all persons administered this module in each questionnaire. The annualization of the wage_total_year should consider the number of months/weeks the persons have been working and receiving this income. Harmonizer should not assume the person has been working the whole year. For an example on how to annualize wage see 5.2.2 Primary Employment last 7-days variable wage_total (Example: Creation of wage_total when there are no bonuses nor other compensations). contract_year contract_year is a dummy variable that classifies the contract status (yes/no) of any individual with a job (lstatus_year =1) and is missing otherwise. It indicates whether a person has a signed (formal) contract, regardless of duration. The variable is constructed for all persons administered this module in each questionnaire. Two categories after harmonization: 0 = No 1 = Yes healthins_year healthins_year is a dummy variable that classifies the health insurance status (yes/no) of any individual with a job (lstatus_year =1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. However, this variable is only constructed if there is an explicit question about health insurance provided by the job. Two categories after harmonization: 0 = No 1 = Yes socialsec_year socialsec_year is a dummy variable that classifies the social security status (yes/no) of any individual with a job (lstatus_year =1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. However, this variable is only constructed if there is an explicit question about pension plans or social security. Two categories after harmonization: 0 = No 1 = Yes union_year union_year is a dummy variable that classifies the union membership status (yes/no) of any individual with a job (lstatus_year =1) and is missing otherwise. Variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. However, this variable is only constructed if there is an explicit question about trade unions. Two categories after harmonization: 0 = No 1 = Yes firmsize_l_year firmsize_l_year specifies the lower bracket of the firm size. The variable is constructed for all persons who are employed in the last 12 months for the main job. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the lower boundary of the bracket. firmsize_u_year firmsize_u_year specifies the upper bracket of the firm size. The variable is constructed for all persons who are employed in the last 12 months for the main job. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the upper boundary of the bracket. If the right bracket is open, this variable should be missing. Secondary Employment, 12-month reference period empstat_2_year empstat_2_year is a categorical variable that specifies employment status of the secondary job with reference period of last 12 months of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all individuals. For this reason, the lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. The definitions are taken from the International Labor Organization’s Classification of Status in Employment with some revisions to take into account the data available. Five categories after harmonization: 1 = Paid Employee 2 = Non-Paid Employee 3 = Employer 4 = Self-employed 5 = Other, workers not classifiable by status 1 = Paid Employee Paid employee includes anyone whose basic remuneration is not directly dependent on the revenue of the unit they work for, typically remunerated by wages and salaries but may be paid for piece work or in- kind. The ‘continuous’ criteria used in the ILO definition is not used here as data are often absent and due to country specificity. 2 = Non-Paid Employee Non-paid employee includes contributing family workers who hold a self-employment job in a market- oriented establishment operated by a related person living in the same households who cannot be regarded as a partner because of their degree of commitment to the operation of the establishment, in terms of working time or other factors, is not at a level comparable to that of the head of the establishment. All apprentices should be mapped as non-paid employee. 3 = Employer Employer is a business owner (whether alone or in partnership) with employees. If the only people working in the business are the owner and ‘contributing family workers, the person is not considered an employer (as has no employees) and is, instead classified as own account. 4 = Self-employed Own account or self-employment includes jobs are those where remuneration is directly dependent from the goods and service produced (where home consumption is considered to be part of the profits) and have not engaged any permanent employees to work for them on a continuous basis during the reference period. Members of producers’ cooperatives are workers who hold a self-employment job in a cooperative producing goods and services in which each member takes part on an equal footing with other members in determining the organization of production, sales and/or other work of the establishment, the investments and the distribution of the proceeds of the establishment amongst the members. 5 = Other, workers not classifiable by status Other, workers not classifiable by status include those for whom insufficient relevant information is available and/or who cannot be included in any of the above categories. ocusec_2_year ocusec_2_year is a categorical variable that specifies the sector of activity in the last 12 months. It classifies the secondary job's sector of activity of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. Four categories after harmonization: 1 = Public sector, Central Government, Army (including armed forces) 2 = Private, NGO 3 = State-owned 4 = Public or State-owned, but cannot distinguish 1 = Public Sector, Central Government, Army (including armed forces) Public sector is the part of economy run by the government. 2 = Private, NGO Private sector is that part of the economy which is both run for private profit and is not controlled by the state, it also includes non-governmental organizations 3 = State-owned enterprises State-owned includes para-state firms and all others in which the government has control (participation over 50%). 4 = Public or State-owned, but cannot distinguish Select this option is the questionnaire does not ask for State-owned enterprises, and only for Public sector. Notes: Do not code basis of occupation (ISCO) or industry (ISIC) codes. industry_orig_2_year industry_orig_2_year is a string variable that specifies the original industry codes for the second job with reference period of the last 12 months and should correspond to whatever is in the original file with no recoding. Do not put missing values for people below the working age. Other missing values are allowed. It classifies the main job of any individual with a job (lstatus_year=1) and is missing otherwise. industry_isic_2_year Code (string variable) of the industry according to the International Standard Industry Classification (ISIC) in the last 12 months for the second job of any individual with a job (lstatus_year=1) and is missing otherwise. See industrycat_isic for the details. industrycat10_2_year industrycat10_2_year is a categorical variable that specifies the 1-digit industry classification that classifies the second job with reference period of the last 12 months of any individual with a job (lstatus_year =1) and is missing otherwise. The variable is constructed for all persons administered this module in each questionnaire. The codes for the second job are given here based on the UN International Standard Industrial Classification. Ten categories after harmonization: 1 = Agriculture, Hunting, Fishing, etc. 2 = Mining 3 = Manufacturing 4 = Public Utility Services 5 = Construction 6 = Commerce 7 = Transport and Communications 8 = Financial and Business Services 9 = Public Administration 10 = Other Services, Unspecified Notes: • In the case of different classifications (former Soviet Union republics, for example), recoding has been done to best match the ISIC codes. • Category 10 is also assigned for unspecified categories or items. • For details on how to code different ISIC versions to industrycat10_2_year see the industrycat10 entry. industrycat4_2_year industrycat4_2_year is a categorical variable that specifies the 1-digit industry classification for Broad Economic Activities for the second job with reference period of the last 12 months. This variable is either created directly from the data (if industry classification does not exist for 10 categories) or created from industrycat10_year. Four categories after harmonization: 1 = Agriculture 2= Industry 3 = Services 4 = Other This variable is either created directly from the data (if industry classification does not exist for 10 categories) or created from industrycat10_2_year. occup_orig_2_year occup_orig_2_year is a string variable that specifies the original occupation code in the last 12 months for the secondary job. This variable corresponds to whatever is in the original file with no recoding. occup_isco_2_year Code (string variable) of the occupation according to the International Standard Classification of Occupations (ISCO) for the second job of any individual with a job in the last 12 months (lstatus_year=1) and is missing otherwise. See occup_isco for the details. occup_skill_2_year Categorical code for the broad skill level of workers at the second job in the last 12 months of any individual with a job (lstatus_year=1) and is missing otherwise. See occup_skill for details. occup_2_year occup_2_year is a categorical variable that specifies the 1-digit occupation classification. It classifies the second job of any individual with a job (lstatus_year =1) and is missing otherwise. This variable is constructed for all persons administered this module in each questionnaire. Most surveys collect detailed information and then code it, without keeping the original data. No attempt has been made to correct or check the original coding. The classification is based on the International Standard Classification of Occupations (ISCO). In the case of different classifications, re-coding has been done to best match the ISCO. Eleven categories after harmonization: 1 = Managers 2 = Professionals 3 = Technicians and associate professionals 4 = Clerical support workers 5 = Service and sales workers 6 = Skilled agricultural, forestry and fishery workers 7 = Craft and related trades workers 8 = Plant and machine operators, and assemblers 9 = Elementary occupations 10 = Armed forces occupations 99 = Other/unspecified wage_no_compen_2_year wage_no_compen_2_year is a continuous variable that specifies the last wage payment in local currency of any individual (lstatus_year =1 & empstat_year =1) in its secondary occupation and is missing otherwise. The wage should come from the second job, in other words, the job that the person dedicated the second most amount of time in the week preceding the survey. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • For all those with self-employment or owners of own businesses, this should be net revenues (net of all costs EXCEPT for tax payments) or the amount of salary taken from the business. Due to the almost complete lack of information on taxes, the wage from main job is NOT net of taxes. • By definition, non-paid employees (empstat_year_2 =2) should have wage=0. • The reference period of the wage_no_compen_year_2 will be recorded in the unitwage_2_year variable. • Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. unitwage_2_year unitwage_2_year is a categorical variable that specifies the time reference for the wage_no_compen_2_year variable. It specifies the time unit measurement for the wages for the secondary job of any individual (lstatus_year =1 & empstat_year =1) and is missing otherwise. Ten categories after harmonization: 1 = Daily 2 = Weekly 3 = Every two weeks 4 = Every two months 5 = Monthly 6 = Quarterly 7 = Every six months 8 = Annually 9 = Hourly 10 = Other whours_2_year whours_2_year is a continuous variable that specifies the hours of work in last week for the second job with reference period of the last 12 months of any individual with a job (lstatus_year =1) and is missing otherwise. The second job defined as that occupation that the person dedicated the second most amount of time to over the past year. The variable is constructed for all persons administered this module in each questionnaire. The lower age cutoff (and perhaps upper age cutoff) at which information is collected will vary from country to country. Notes: • Sometimes the questions are phrased as, “on average, how many hours a week do you work?”. • For individuals who only give information on how many hours they work per day and no information on number of days worked a week, multiply the hours by 5 days. • In the case of a question that has hours worked per month, divide by 4.3 to get weekly hours. wmonths_2_year wmonths_2_year is a continuous variable that specifies the number of months worked in the last 12 months for the secondary job of any individual with a job (lstatus_year =1) and is missing otherwise. The secondary job is defined as that occupation in which the person dedicated less time than the primary job over the past year. The variable is constructed for all persons administered this module in each questionnaire. wage_total_2_year wage_total_2_year is a continuous variable that specifies the annualized wage payment (regular wage plus bonuses, in-kind, compensation, etc.) in local currency of any individual (lstatus_year =1 & empstat_year =1) in its secondary occupation and is missing otherwise. The wage should come from the secondary job, in other words, the job that the person dedicated the second most amount of time in the year preceding the survey. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. wage_total_2_year should be equal to wage_no_compen_2_year in case there are no bonuses, tips etc. offered as part of the job. The variable is constructed for all persons administered this module in each questionnaire. For this reason, the lower age cutoff (and perhaps upper age cutoff) will vary from country to country. Notes: • The annualization of the wage_total_2_year should consider the number of months/weeks the persons have been working and receiving this income. Harmonizer should not assume the person has been working the whole year. • For an example on how to annualize wage see 5.2.2 Primary Employment last 7-days variable wage_total (Example: Creation of wage_total when there are no bonuses nor other compensations). firmsize_l_2_year firmsize_l_2_year specifies the lower bracket of the firm size. The variable is constructed for all persons who are employed. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the lower boundary of the bracket. firmsize_u_2_year firmsize_u_2_year specifies the upper bracket of the firm size. The variable is constructed for all persons who are employed. If it is continuous, it records the number of people working for the same employer. If the variable is categorical, it records the upper boundary of the bracket. If the right bracket is open, a missing value should be inputted. Other Employment, 12-month reference period t_hours_others_year t_hours_others_year is a continuous variable that specifies the hours of work in last 12 months in all jobs excluding the primary and secondary ones. t_wage_nocompen_others_year t_wage_nocompen_others_year is a continuous variable that specifies annual wage in all jobs excluding the primary and secondary ones. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. t_wage_others_year t_wage_others_year is a continuous variable that specifies the annual wage in all jobs excluding the primary and secondary ones. This wage includes tips, compensations such as bonuses, dwellings or clothes, and other payments. t_wage_others should be equal to t_wage_nocompen_ others in case there are no bonuses, tips etc. offered as part of any of the jobs. Total Employment Earnings, 12-month reference period t_hours_total_year t_hours_total_year is a continuous variable that specifies the hours of work in last 12 months in all jobs including primary, secondary and others. t_wage_nocompen_total_year t_wage_nocompen_total_year is a continuous variable that specifies the total annualized wage income in all jobs including primary, secondary and others. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. t_wage_total_year t_wage_total_year is a continuous variable that specifies the total annualized wage income in all jobs including primary, secondary and others. This income includes tips, compensations such as bonuses, dwellings or clothes, and other payments. t_wage_total should be equal to t_wage_nocompen_total in case there are no bonuses, tips etc. offered as part of any of the jobs. Total Labor Income njobs njobs is a numeric variable that specifies the total number of jobs. t_hours_annual t_hours_annual is a continuous variable that specifies the annual numbers of hours worked in all the jobs including primary, secondary and others regardless of their period of reference. linc_nc linc_nc is a continuous variable that specifies the total annualized wage income in all the jobs including primary, secondary and others regardless of their period of reference. This excludes tips, bonuses, other compensation such as dwellings or clothes, and other payments. Note: Use gross wages when available and net wages only when gross wages are not available. This is done to make it easy to compare earnings in formal and informal sectors. laborincome laborincome is a continuous variable that specifies the total annualized individual labor income in all jobs including primary, secondary and others regardless of their period of reference. This income includes tips, compensations such as bonuses, dwellings or clothes, and other payments. This variable should be used as the total annual labor income of an individual. Lessons Learned and Challenges For any variable not collected in a country, the variable should be created and left as missing (.) in the final harmonized file. Variables in the data files must follow the sequence in which they appear in the manual. Labor status during last 12 months only reflects if the person has worked during last year or not, as the vast majority of the surveys do not provide enough information to distinguish those individuals that are unemployed from those that are out of the labor force. Individuals working in cooperatives are considered as “paid employee” in the employment status variable. Several checks should be conducted to ensure that the data is harmonized correctly. lstatus should be an integer in the range [1,3]. lstatus<0 & lstatus>3 & mod(lstatus, 1) == 1 If lstatus is classified as employed then the employment type needs to be defined. lstatus==1 & empstat>5 minlaborage should be an integer. mod(minlaborage,1)==1 The minimum age for employment should not be higher than 20. minlaborage >20 & minlaborage <. empstat should be an integer in the range [1,5]. mod(empstat, 1) != 0 & empstat<1 & empstat>5 If employment type is defined then labor force status should be employed. empstat<=5 & lstatus!=1 industrycat10 should be an integer in the range [1,10]. industrycat10<0 & industrycat10>10 & mod(industrycat10, 1) == 1 industrycat4 should be an integer in the range [1,4]. industrycat4<0 & industrycat4>4 & mod(industrycat4, 1) == 1 There should not be a mismatch between industry and industrycat4. ((industrycat4==1 & industrycat10!=1 ) | (industrycat4==2 & (industrycat10 <2 | industrycat10 >5)) | (industrycat4==3 & (industrycat10 <6 | industrycat10 >9)) | (industrycat4==1 & industrycat10 !=1 ) ) & industrycat10 !=. firmsize_u should not be lower than firmsize_l firmsize_u < firmsize_l Overview of Variables Module Variable name Variable label Notes Labor minlaborage Labor module application age In some GMD harmonizations (not in their dictionaries) this is given as lstatus_7. In general, the 7 day reference ones are then for example Labor lstatus Labor status (7-day ref period) ocusec_7 and ocusec_2. This is then not neat with ocusec_year, ocusec_year_2. Either ocusec_week, ocusec_year or ocusec_7, ocusec_365. A binary indicator taking a value only if the person is not in the labour force (missing if in LF Potential labour force Labor potential_lf or unemployed). Codes 1 if i) available but not (7-day ref period) searching or ii) searching but not immediately available to work. Codes 0 otherwise. A binary indicator taking value only if the person is in the labour force and working (missing if not Underemployment Labor underemployment or unemployed). Codes 1 if person would take on (7-day ref period) more jobs or more hours at their job if possible/available, 0 otherwise. Reason not in the labor force Labor nlfreason (7-day ref period) Unemployment duration (months) Labor unempldur_l lower bracket (7-day ref period) Unemployment duration (months) Labor unempldur_u upper bracket (7-day ref period) Module Variable name Variable label Notes Employment status, primary job Labor empstat (7-day ref period) Sector of activity, primary job NGOs were classified in I2D2 as public sector, Labor ocusec (7-day ref period) switched to private. Original industry code, primary Labor industry_orig job (7-day ref period) Code of ISIC - 4 for example is Q - Human health and social work activities 86 Human health activities ISIC code of the classification, 861 Labor industrycat_isic primary job (7-day ref period) 8610 Hospital activities 8620 Medical dental practice activities The four options would be coded as "Q", 8600, 8610, 8620 respectively. 1-digit industry classification, Labor industrycat10 primary job (7-day ref period) 4-category industry classification, Labor industrycat4 primary job (7-day ref period) Original occupational classification, Labor occup_orig primary job (7-day ref period) ISCO-08 codes are ISCO code of the classification, Labor occup_isco primary job (7-day ref period) 8 Plant and machine operators, and assemblers 81 Stationary plant machine operators 811 Mining and mineral processing plant operators 8111 Miners and quarriers Given the level of detail available, this are to be coded as 8000, 8100, 8110, and 8111 respectively. https://ilostat.ilo.org/resources/concepts-and- Labor occup_skill Skill level based on ISCO standard definitions/classification-occupation/ 1 digit occupational classification, Labor occup primary job (7-day ref period) Last wage payment, Labor wage_no_compen primary job, excl. bonuses, etc. (7-day ref period) Time unit of last wages Labor unitwage payment, primary job (7-day ref period) Hours of work in last week, Labor whours primary job (7-day ref period) This definition may appear confusing since it is months out of the past 12 months of work for the 7 day recall and there is a wmonths_year variable for the 12 month recall. It is not clearly defined in Months worked in the last 12 months, Labor wmonths the guidelines, yet I would read it as the number primary job (7-day ref period) of months in the main job for the 7 day recall job, which would be fewer than the wmonths_year number if the person switched jobs say 2 months ago, had the previous one since over a year. Annualized total wage, Labor wage_total primary job (7-day ref period) Labor contract Contract (7-day ref period) Health insurance Labor healthins (7-day ref period) Labor socialsec Social security (7-day ref period) Union membership Labor union (7-day ref period) Firm size (lower bracket), Labor firmsize_l primary job (7-day ref period) Firm size (upper bracket), Labor firmsize_u primary job (7-day ref period) Module Variable name Variable label Notes Labor empstat_2 Employment status, secondary job (7-day ref period) Sector of activity, Labor ocusec_2 secondary job (7-day ref period) Original industry code, Labor industry_orig_2 secondary job (7-day ref period) ISIC code of the classification, See industrycat_isic Labor industrycat_isic_2 secondary job (7-day ref period) 1 digit industry classification, Labor industrycat10_2 secondary job (7-day ref period) Labor industrycat4_2 4-category industry classification, secondary job (7-day ref period) Original occupational classification, Labor occup_orig_2 secondary job (7-day ref period) ISCO code of the classification, See occup_isco Labor occup_isco_2 secondary job (7-day ref period) https://ilostat.ilo.org/resources/concepts- Labor occup_skill_2 Skill level based on ISCO standard and-definitions/classification-occupation/ 1 digit occupational classification, Labor occup_2 secondary job (7-day ref period) wage payment, wage_no_compen Labor secondary job, excl. bonuses, etc. _2 (7-day ref period) Time unit of last wages payment, Labor unitwage_2 secondary job (7-day ref period) Hours of work in last week, Labor whours_2 secondary job (7-day ref period) Months worked in the last 12 See note on wmonths Labor wmonths_2 months, secondary job (7-day ref period) Annualized total wage, Labor wage_total_2 secondary job (7-day ref period) Firm size (lower bracket), Labor firmsize_l_2 secondary job (7-day ref period) Firm size (upper bracket), Labor firmsize_u_2 secondary job (7-day ref period) Module Variable name Variable label Notes t_hours_others Total hours of work in the last 12 Labor months in other jobs excluding the primary and secondary ones t_wage_nocompen_ Annualized wage in all jobs excluding Labor others the primary and secondary ones (excluding tips, bonuses, etc.). t_wage_others Annualized wage (including tips, bonuses, etc.) in all other jobs Labor excluding the primary and secondary ones. (7 day ref). Module Variable name Variable label Notes Annualized hours worked in all jobs Labor t_hours_total (7-day ref period) Annualized wage in all jobs excl. t_wage_nocompen_t Labor bonuses, etc. otal (7-day ref period) Annualized total wage for all jobs Labor t_wage_total (7-day ref period) Module Variable name Variable label Notes Labor lstatus_year Labor status (12-mon ref period) A binary indicator taking a value only if the person is not in the labour force (missing if in LF Potential labour force Labor potential_lf_year or unemployed). Codes 1 if i) available but not (12-mon ref period) searching or ii) searching but not immediately available to work. Codes 0 otherwise. A binary indicator taking value only if the person is in the labour force and working (missing if not underemployment_y Underemployment Labor or unemployed). Codes 1 if person would take on ear (12-mon ref period) more jobs or more hours at their job if possible/available, 0 otherwise. Reason not in the labor force Labor nlfreason_year (12-mon ref period) Unemployment duration (months) Labor unempldur_l_year lower bracket (12-mon ref period) Unemployment duration (months) Labor unempldur_u_year upper bracket (12-mon ref period) Employment status, primary job Labor empstat_year (12-mon ref period) Sector of activity, primary job Labor ocusec_year (12-mon ref period) Original industry code, Labor industry_orig_year primary job (12-mon ref period) ISIC code of the classification, primary Labor industrycat_isic_year See industrycat_isic job (12 month ref period) 1 digit industry classification, Labor industrycat10_year primary job (12-mon ref period) 4-category industry classification Labor industrycat4_year primary job (12-mon ref period) Original occupational Labor occup_orig_year classification, primary job (12-mon ref period) ISCO code of the classification, primary Labor occup_isco_year See occup_isco job (12 month ref period) https://ilostat.ilo.org/resources/concepts-and- Labor occup_skill_year Skill level based on ISCO standard definitions/classification-occupation/ 1 digit occupational classification, Labor occup_year primary job (12-mon ref period) Last wage payment, primary job, wage_no_compen_y Labor excl. bonuses, etc. ear (12-mon ref period) Time unit of last wages payment, Labor unitwage_year primary job (12-mon ref period) Hours of work in last week, Labor whours_year primary job (12-mon ref period) Labor Months worked in the last 12 months, wmonths_year primary job (12-mon ref period) Labor Annualized total wage, wage_total_year primary job (12-mon ref period) Labor contract_year Contract (12-mon ref period) Labor Health insurance healthins_year (12-mon ref period) Labor socialsec_year Social security (12-mon ref period) Labor Union membership union_year (12-mon ref period) Labor Firm size (lower bracket), firmsize_l_year primary job (12-mon ref period) Firm size (upper bracket), Labor firmsize_u_year primary job (12-mon ref period) Module Variable name Variable label Notes Employment status, Labor empstat_2_year secondary job (12-mon ref period) Sector of activity, secondary job Labor ocusec_2_year (12-mon ref period) Original industry code, Labor industry_orig_2_year secondary job (12-mon ref period) industrycat_isic_2_ye ISIC code of the classification, Labor See industrycat_isic ar secondary job (12 month ref period) 1 digit industry classification, industrycat10_2_yea Labor secondary job r (12-mon ref period) 4-category industry classification, Labor industrycat4_2_year secondary job (12-mon ref period) Original occupational classification, Labor occup_orig_2_year secondary job (12-mon ref period) ISCO code of the classification, Labor occup_isco_2_year See occup_isco seconddary job (12 month ref period) https://ilostat.ilo.org/resources/concepts-and- Labor occup_skill_2_year Skill level based on ISCO standard definitions/classification-occupation/ 1 digit occupational classification, Labor occup_2_year secondary job (12-mon ref period) last wage payment, secondary job, wage_no_compen_2 Labor excl. bonuses, etc. _year (12-mon ref period) Time unit of last wages payment, Labor unitwage_2_year secondary job (12-mon ref period) Hours of work in last week, Labor whours_2_year secondary job (12-mon ref period) Months worked in the last 12 months, Labor wmonths_2_year secondary job (12-mon ref period) Annualized total wage, Labor wage_total_2_year secondary job (12-mon ref period) Firm size (lower bracket), Labor firmsize_l_2_year secondary job (12-mon ref period) Firm size (upper bracket), Labor firmsize_u_2_year secondary job (12-mon ref period) Module Variable name Variable label Notes Annualized hours worked Labor t_hours_others_year in all but primary and secondary jobs (12-mon ref period) Annualized wage in all t_wage_nocompen_ but primary & secondary jobs Labor others_year excl. bonuses, etc. (12-mon ref period) Annualized wage in all Labor t_wage_others_year but primary and secondary jobs (12-mon ref period) Module Variable name Variable label Notes Annualized hours worked in all jobs Labor t_hours_total_year t_hours_total_year (12-mon ref period) Annualized wage in all jobs t_wage_nocompen_t Labor excl. bonuses, etc. t_wage_nocompen_total_year otal_year (12-mon ref period) Annualized total wage for all jobs Labor t_wage_total_year t_wage_total_year (12-mon ref period) Module Variable name Variable label Notes Total Labor income will be created based on either the 7 days or 12 months reference period Labor njobs Total number of jobs variables or a combination of both. Harmonizers should make sure that all jobs are included and none of them are double counted. Total hours worked in all jobs Labor t_hours_annual in the previous 12 months Total annual wage income in all Labor linc_nc Difference to t_wage_nocompen_total_year? jobs, excl. bonuses, etc. Total annual individual labor income Labor laborincome Difference to t_wage_total_year? in all jobs, incl. bonuses, etc. 4. Validation and quality checks 4.1. How is the harmonization validated? The harmonization is checked thoroughly and continuously, to ensure the output is the most accurate and reliable. There are three main processes. Chronologically the first are the checks performed during the harmonization. As the code is developed, the harmonizers evaluate the results they are getting on an ongoing basis. Moreover, they try to confirm at early stages some of the more difficult mappings, like the conversion of survey education codes to the categories of the harmonized education variables. This process includes reaching out to World Bank colleagues in country offices and partners at the statistical offices to validate choices. The second process are the automated quality checks, made up of a standard code that uses the harmonization as input and it for internal and external consistency, as well as checking a series over surveys over time. They are the main source of validation and are discussed further in detail in the remainder of the section. The third process starts after the harmonization is published by collecting users’ feedback. GLD encourages users to review the harmonization and the codes and alert the team of any issues that may still have slipped previous processes. Users can raise issues on GitHub (see How to communicate on any other issue section) or reach out to the GLD Focal Point. The GLD Team has set up code pipelines to try to update harmonizations to respond to user feedback as quickly as possible. Your feedback is an integral part of GLD! The automated quality checks are two sets of functions, one in `Stata` one in `R` to evaluate the output of the harmonization. The `Stata` code evaluates each survey individually, that is we check the particular “IND_2018_PLFS” harmonization for coherence and consistency, both internally (e.g., is the urban/rural variables only codes urban or rural; do people classified as professionals have, on average, the education level we expect them to have) and externally (is the labor force participation from the survey in line with ILO, WDI?). To interpret the results, users should differentiate between errors and flags. By errors we mean things in the harmonization that can be evaluated and are seen as wrong. An example would be an observation with value `3` in the `urban` variable, when the only possible values are `0`, `1`, or `NA` if the information is not available. Flags are issues with the data that hint at an issue. For example, if the variable `urban` has a high share of missing (i.e., `NA`) answers. This is uncommon and thus points to an issue, but it need not be incorrect: the underlying raw data may have an issue and the data are truly not available. For such cases, it is advisable to include the issue in the Country Survey Details, but no further correction of the harmonization code would be warranted. The quality checks do not specifically name issues as errors or flags. This is for the user to evaluate, based on what the checks do, which is covered in the next two subsections. 4.2. Guide to the GLD single survey quality checks This is a guide to the single survey quality checks for the Global Labor Database (GLD), one half of the automated quality control process after harmonization (yet before publication). After reading this subsection, users should be able to: - understand what the quality checks evaluate and how they work - run the checks by themselves on a newly harmonized file, and - read and interpret the output The guide is divided itself in three major sections, one for each learning objective. Understanding the quality checks The GLD quality checks intend to make sure that the GLD harmonized output is of sufficient quality to include the evaluated harmonized survey into the database. This is done in five blocks: - Block 1 checks adherence of the survey data to the GLD formatting requirements (e.g., binary 0/1 variables have no responses outside 0 and 1). - Block 2 compares key indicators from the harmonization to the same indicators from external sources (e.g., ILO, UN), to ensure external consistency. - Block 3 assesses whether major variables are missing. - Block 4 evaluates whether key relationships between pairs of variables align with ex-ante expectations (e.g., individuals occupied as professionals - `occup == 2` - chiefly have post-secondary education). - Block 5 analyses central aspects of wages to ensure wage/earnings information agrees with stylized facts on wage behaviour (e.g., lower skilled workers' wages are, on average, lower than those of high skilled workers). The next subsections discuss each block in more detail. We start, however, with the overall quality checks template. 4.2.1. Overall quality checks template The overall quality checks template is the only do-file the user needs to interact with regularly. It defines al relevant arguments and calls the do-files that run blocks 1 through 5. The code is accessible here and shown below in Figure 24: Figure 24 - GLD single survey quality checks template The quality checks template proceeds in three steps. Step 1 readies Stata by cleaning up any data that may still be stored in the memory. Step 2 defines the arguments. It is the only section requiring user input. Users need to define three `globals`. - `helper`: Define the path to the folder that contains all the files that run the quality checks (here, folder Helper_programs_1.5). It is recommended to have this be a central place so it applies to all surveys and can easily be updated if the quality checks are amended. - `mydata`: Define the path to the harmonized data the user wishes to check. - `output`: Define the path to the folder the user wants the output to be stored in. It is recommended to make this the `CCC_YYYY_SURV_V0#_M_V0#_A_GLD/Work/Output` folder for consistency so other users may always know where to find the checks output. Step 3 simply calls the do files from the helper path running the checks. This step no longer requires user involvement. It does, however, run through all the files that make up the checks. What these files cover is the subject of the next section. 4.2.2. Block 1 - GLD format checks This block (with all files starting with `B1`) is concerned with ensuring that the harmonized output conforms to the GLD data dictionary. The code for the format checks is divided into sections (overall checks, demography checks, education checks, ...) and should be commented enough for users to understand what each step is doing. If you feel more explanations are warranted, please raise and issue detailing the part not understood and we will endeavour to expand on it. Broadly, the checks first evaluate whether the survey is set up correctly: filenames follow the naming convention and variables are from the data dictionary. Once this is established, the checks go through variables and, if they are in teh data, investigates whether they are in line with formatting rules. Some examples are: - Check if vintages of survey (`vermast`/`veralt`) agree with the filename vintages (e.g., if file is `CCC_YYYY_SURV_V01_M_V02_A_GLD` `veralt` needs to be `"02"`). - Check there is a single household head per household. - Check that categorical variables have answers exclusively within the realm of feasible answers. - Check that ISCO/ISIC codes are in the universe of possible codes for the pertinent classification (e.g., in ISIC 3 codes starting with `45` may only have either `0` or `1` through `5` as third digit; anything else would be outside the possible universe and flagged). The format checks create a list of flagged issues. It is stored as a `.dta` file in the `Block1_Format` folder (file `CCC_YYYY_Other Household Survey_Q_Format_Checks.dta`), and as an Excel spreadsheet in the `01_summary` folder (file `B1_format_results.xlsx`). Figure 25 below shows a screenshot of the spreadsheet output: Figure 25 - Example of the output from the GLD format checks As shown, the output has five columns: - Column Block describes the data dictionary block the variable causing the flag belongs to (demography, education, ...). - Column Var_Name denotes the name of the variable flagged. - Column Test_Type describes the kind of test the variable has undergone. - Column Result denotes the number or share of cases with the issue (if this can be written). - Column Flag guides the reader whether the issue is at odds with the data dictionary (`Flag = 1`) or technically possible, yet odd (`Flag = 99`). 4.2.3. Block 2 - GLD external checks Overview The GLD external comparison checks are a series of checks that compare GLD data (aggregate moments) with comparable statistics from external sources. It is the most extensive block, covering - currently – 10 code files, all starting with B2. In particular, we check the following data from GLD surveys: Sub-Block 2.1. Demographics (file `B2.01`) 1. Total population (number) 2. Gender split (% of female population) 3. Urban population (% of total) 4. Children (0-14 years old, % of total population) 5. Working Age Population (15-64 years old, % of total population) 6. Seniors (65+ years old, % of total population) Sub-Block 2.2. Labor Force Variables (file `B2.02`) 7. Labor force size (number) 8. Labor force participation rate (labor force / working age population) 9. Employment (number) 10. Employment to population ratio (employed / working age population) 11. Unemployment rate (unemployed / labor force) 12. Agriculture (% of employment) 13. Industry (% of employment) 14. Services (% of employment) 15. Industry category (% of employment) 16. Status in Employment (% of employment) Sub-Block 3.3. Wage variables (file `B2.03`) 17. Average hourly wages Variable aggregation in GLD data To construct aggregate indicators from the GLD surveys, we weight sums and means by the weight GLD variable. The demographic variables are fully described in the previous section and computed in the do file “01_checks_demo_GLD.do”. The “Labor Force Variables” from block 2 are defined as follows: • Labor force size is the number of persons aged 15+ that are either employed or unemployed. • Labor force participation rate is the labor force size divided by the population aged 15+. • Employment is the number of persons aged 15+ that are employed. • Employment to population ratio is employment (as defined above) divided by the population aged 15+. • Unemployment rate is the number of unemployed people over the number of employed plus unemployed people. Note: no age requirement. • Agriculture is the number of people employed that work in the agricultural or primary sector, over total people employed, times 100. Note: no age requirement • Industry is the number of people employed that work in the industrial or secondary sector, over total people employed, times 100. Note: no age requirement • Services is the number of people employed that work in the service or tertiary sector, over total people employed, times 100. Note: no age requirement Notice that agriculture, industry, and services are three economic activity categories that cover all possible employment sectors. • Industry category consists of 10 economic activity categories that cover all possible employment sectors, at a more granular level than the categories in the previous point. This variable is actually a set of 10 variables, defined as the number of people in each industry category, divided by the number of people with information in the “industrycat10” variable in GLD. External comparison sources Each variable listed in Overview is matched with one or more comparable variables from external sources. The sources used are: • World Development Indicators (WDI), from the World Bank. This is a collection of development indicators compiled from international sources. The data is downloaded using the Stata module wbopendata. • ILOSTAT, from the International Labor Organization, accessed through the Stata module dbnomics, developed by the CEPREMAP. • UNdata, from the United Nations, also accessed through dbnomics. Note that the external data updates periodically. The GLD checks, will automatically use the latest data available in each of these sources. The Annex contains a complete list of the variables selected from these sources, along with the GLD variables they are matched with. Output: content overview The survey checks process will create a folder called `Block2_External`, containing two folders: • 01_data • 02_figures Each file in the 01_data folder can be matched to a figure in 02_figures. It contains the necessary data to generate that figure. Each .dta file consists of the following variables: • year: the year for each value in the data set. • value: the relevant point estimate for each source, represented as a dot in the figures. • ub: the upper bound of the value, set at 1.1*value (10% above) for most variables. • lb : the lower bound of the value, set at 0.1*value (10% below) for most variables. • countrycode: the 2 or 3-digit code corresponding to the country of the survey. • source: string name of the source behind each value. One of the entries in this column will be “GLD”, the rest will be the external sources we use to compare. The figures folder contains 25 pdf files named Fig1.pdf to Fig16.pdf, with 10 versions of Fig15, called Fig15_1.pdf to Fig15_10.pdf. All are graphical representations of the quality check comparisons (more details in next section). Additionally, there are four group figures: 2A collects Figures 1 through 6, 2B1 Figures 7 through 14, 2B2 Figures 15_1 through 15_10, and 2C Figure 16. Each figure in the folder “02_figures” is the graphical representation of and individual check run on the data from the input GLD survey, consisting of the following elements: • A title, indicating the variable being checked, • A y-axis, with the values the variable may take, • An x-axis, which represents the different data sources: GLD (as the left-most x-axis value) and the one or multiple comparison sources (to its right), • The data, in red or blue segments: o A red segment representing the GLD harmonized survey estimate. o A blue segment representing the estimate drawn from an external source, the name of which is denoted in the x-axis tick label. • A solid line means that the year of the data is the same as the year of the survey while a dashed line means that we are using data from a different year (as data for the given estimate was not available for the survey year) • Each segment consists of two distinct parts: The first is a central point (dot or triangle) showing the precise estimate for each variable and source. The second are the lines stretching out above and below the central point representing an upper and lower bound of the estimate (commonly 10% above and below the point estimate). These figures allow us to visually see if the values in GLD are close to the values from external sources. In the Flags Methodology section we describes precisely the circumstances under which we consider the values to be “too far off”. Additionally, the external check process places in the summary folder a PDF file ("B2_external_flags") with all figures that have been flagged. The role of this file is to present a snapshot of the variables whose estimates are clearly different from the external sources we use as comparison, which may indicate a problem in the harmonization or other issues that require our attention. As well, an overview Excel file (called "B2_external_results.xlsx") is created, detailing the information behind each check. The rows are made up of a first row denoting column names and up to 25 for each of the checks undertaken to the GLD data. If there are fewer than 25 check rows, then some checks have not been performed. This happens when some variables are not available in the original GLD dataset (e.g., no urban/rural data recorded). The file consists of the following 18 columns: 1. Year: Year of the GLD survey, as described in “section 3. Output”. 2. Country: ISO Alpha-3 country code of the country in the GLD survey. 3. Survey: Type of GLD survey (for instance: LFS, QLFS, etc). 4. Varchecked: Stands for “Variable checked” and is the variable we are testing in the GLD data. 5. Varorder: Stands for “Variable order” and is a numeric variable ranging from 1 to the number of checks for that survey (up to 25). It allows us to sort the checks in the default order (starting with “Total population” and ending with “Average hourly wages”). 6. Val_pe: Stands for “Value point-estimate”. This is the value for the variable in that row computed from GLD. 7. Val_lb: Stands for “Value lower-bound”. It is the result of multiplying the val_pe by a number 0.95 (0.90 for the variable “average hourly wages”). By construction, this results in a value 5% (or 10%) lower than val_pe. 8. Val_ub: Stands for “Value upper-bound”. It is the result of multiplying the val_pe by a number 1.05 (1.10 for the variable “average hourly wages”). By construction, this results in a value 5% (or 10%) higher than val_pe. 9. Com_pe: Stands for “Comparison point estimate”. It is the result of averaging all the values from external sources of the variable we are testing. That is, if we are using two external sources to test “total population” (eg. ILO and UN), com_pe will be the average of “total population” in each of these two sources. For whichever number of source we are using, com_pe will be the average of the value in the sources we are using. 10. Com_lb: Stands for “Comparison lower bound”. It is the result of multiplying the com_pe by a number 0.95 (0.90 for the variable “average hourly wages”). By construction, this results in a value 5% (or 10%) lower than com_pe. 11. Com_ub: Stands for “Comparison upper-bound”. It is the result of multiplying the com_pe by a number 1.05 (1.10 for the variable “average hourly wages”). By construction, this results in a value 5% (or 10%) higher than com_pe. 12. Com_cases: Indicates how many actual values we are using to establish the comparison. Often times, two different external sources will report the exact same value, because they take the data from one another. For instance, WDI takes data from sources such as ILO. Therefore, if we look at the labor force size in WDI and in ILO, we will obtain the exact same point. Com_cases keeps track of this, and indicates how many different values we are dealing with, rounded to the 0.01 value. 13. Diff: Reports the absolute difference between val_pe and com_pe. In the figures, this represents the distance between the dot in the GLD value and the average of the external sources’ dots. 14. Dist1: The distance between the val_lb and com_ub. That is, the distance from the lower bound of the GLD value to the upper bound of the (average of) external sources. When this distance is positive, the lower bound of the GLD value is above the upper limit we give to the external sources, ie, our value is way above the external sources. 15. Dist2: The distance between com_lb and val_ub. That is, the distance between the lower bound of the external sources and the upper bound of the GLD value. When this distance is positive, it means that the lower bound of the external sources is above the upper limit we give to the GLD value. That is, the external sources are (on average) well above the GLD value. That, our value is way below the external sources. 16. F1: An indicator value that takes value of 1 when dist1 is positive (we omit the 0s). That is, it indicates when the GLD value is well above the external sources. 17. F2: An indicator value that takes the value of 1 when dist2 is positive (again, we omit the 0s). It indicates when the GLD value is well below the external sources. 18. Flag: An indicator value that takes the value 1 when either F1 or F2 are 1, 0 (or missing/blank) otherwise. That is, it indicates when a GLD variable is either well above or below the external sources’ estimate. Graphically, it means that the segment of the GLD value does not intersect the imaginary segment generated by the average of the external sources. Flags Methodology This section details the flags methodology. Some parts have already been addressed in previous sections. Nonetheless this section explains the methodology in its entirety. One of the main goals of the checks process is to detect GLD variables that differ substantially from that same variable as reported from external sources. To conduct this analysis, we need the following elements: 1. A GLD variable, 2. A benchmark to compare the external variable with, 3. A measure of distance between (1) and (2), 4. A threshold above which the distance in (3) is considered too large. For point (1), we use the GLD variables described in [Overview](#overview). For point (2), we average the external variables listed in the Annex. Note that for each GLD variable we have one or more variables from external sources we wish to compare it to. By averaging all the variables from external sources, we guarantee we have one single benchmark per variable. Two further clarifications on the benchmark values: In the future, we may want to change this to a weighted average, to account for the fact that some sources matter more than others. In particular, we know not all external variables are from the same year. Hence, we may want to give a higher weight to the external sources that are from the same year as the variable. • We know that some external sources report the same value, because they ultimately come from the same source (e.g., WDI takes data from ILO). Note that at this point, we are not correcting for this: we are giving equal weights at all sources regardless of how “original” their values are. Finally, it is worth noting that neither of the points mentioned above is guaranteed to affect the results. In fact, the results have proven to be fairly robust, and it is likely that a more elaborate weighting process will not translate in any noticeable differences in the output. For point (3), to measure the distance between (1) GLD and (2) the average of the external sources, we consider two cases of particular interest: • For the cases in which the GLD data is above the external data, we calculate the distance between a lower bound of the GLD data and an upper bound of the external data. If this distance is positive, then the GLD is considerably above the external sources. If the distance is 0, it means that the lower bound of the GLD data is exactly the same as the upper bound of the external data. If the distance is negative, it means that the lower bound of the GLD data is below the upper bound of the external data. Thinking in terms of the segments represented in the figures, a negative distance represents an overlap between the GLD segment and the average of the external segments, and a positive distance means no intersection of the segments. • For the cases in which the GLD data is below the external data, we proceed symmetrically. That is, we compute the difference between lower bound of the average of external values and the upper bound of the GLD data. If this distance is positive, then the lower bound of the external variables is above the upper bound of the GLD variable. That is, the GLD value is very far away from the comparison values. If the distance is 0, then the lower bound of the external variables coincides exactly with the upper bound of the GLD data. If it is negative, then the upper bound of the GLD data is higher than the lower bound of the external data, i.e., the segments intersect. As explained above, a positive distance between the two points implies a large difference between the GLD values and the external values. Hence, these are the cases that we flag. That is, we will flag a GLD variable from the list in [Overview](##Overview) if any of the two following conditions hold: • The distance between the lower bound of the GLD data and the upper bound of average of the external data is positive. • The distance between the lower bound of the average of the external data and the upper bound of the GLD data is positive. It remains to explain how we chose these upper and lower bounds, which effectively determine the thresholds we use to flag the data (point (4) in the list of required elements). Here, we use a threshold of 5% to compute the upper/lower bounds for all the variables, and 10% for the wage related variables. That is, for most variables, we will flag a GLD variable if the difference between the value it takes and the average of the external sources is equal or higher than 5% de value of the GLD variable plus 5% the value of the external sources. Graphically, we will flag the variables where the GLD segment and the imaginary segment defined by the average of the external sources do not intersect. 4.2.4. Block 3 - GLD missing checks Block 3 records the share of responses that are missing for a series of variables. For variables `age`, `male`, and `urban`, the code evaluates if any answers are missing. For variable `lstatus` it is evaluated whether there any responses missing for those above the minimum age to respond to the labour module (variable `minlaborage`). Finally, variables `empstat`, `industrycat4`, `wage_no_compen`, and `unitwage` are checked for missing answers only for those employed (i.e., with `lstatus == 1`). The code outputs the data in "Block3_Missing/01_data" and places the figure directly in the "01_summary" folder in the user designated output folder. 4.2.5. Block 4 - GLD bivariate checks The bivariate data checks look at combinations of two or more variables and raise a flag when the expected behavior of the data under these combinations is violated. There are two groups of bivariate checks. The first group consists of three checks and looks at the conversions of a certain variables. The idea behind these three checks is that we have different versions of a particular variable (say “education” is a categorical variable that can present 4 levels, 5 levels and 7 levels), and we want to make sure that the correspondences are correct. This should be true mechanically, and these checks will confirm that it is the case. The second group of bivariate checks look at relationships between two variables and raise a flag with the relationship is different from what we expect. Consider, for instance the variables education and age. Common sense tells us that we should expect the number of children under 12 years old with postgraduate education to be very low. Hence, there is a check that computes this share, and flags it if it is above certain threshold. If many children appeared in our data with post-secondary education, we would need to make sure that the variables are correctly harmonized. Therefore, bivariate checks can be useful to identify potential errors in the harmonization process. In the next section, we list 8 groups of variables. The first three sections explain the *cross-categories* consistency checks while the latter five sections explain the *bivariate* checks conducted on pairs of variables. Education: cross-categories The GLD data includes three different education variables: educat7, educat5 and educat4. These are categorical variables with different number of levels: educat4 classifies education in 4 groups, educat5 does so in 5 groups, and educat7 in 7 groups. We expect these three variables to be coherent. That is, if an individual has completed primary education, all three variables should reflect this information. Hence, we construct a table with the equivalences between these variables and raise a flag when the data deviates from them. Industry: cross-categories This check follows the same structure as the “Education: cross-categories” check, with the variables industrycat_isic, industrycat10 and industrycat4. The appendix contains a table with the correspondences that the code checks for. Notice that these follow the Version 4 of ISIC. Whenever the GLD data follows a different version, or the version is not indicated in the GLD database, this check is skipped. Occupation: cross categories This check looks at the conversion of the occup_isco, occup_skill and occup variables. The correspondence between these variables follows the ISCO version 1988, listed in the appendix. Whenever the data presents a different conversion, a flag is raised. If the data follows a different ISCO version, or if the version is not known, the check is skipped. Labor force versus employment This check looks at the joint distribution of the variables lstatus and empstat. By definition, when lstatus equals “Employed”, empstat should contain information. If lstatus is either “Unemployed” or “Not in LF”, then empstat should be missing. Whenever this rule is not followed in the data, a flag is raised. Labor force status versus age This check looks at the joint distribution of labor force status (lstatus) and age. To make the check easier, we recode the variable age into a categorical variable with 5 groups, defined as follows: • Child, if an individual is between 0 and 14 years old; • Youth, if an individual is between 15 and 24 years old; • Adult, if an individual is between 25 and 54 years old; • Senior, if an individual is between 55 and 70 years old; • Retiree, if an individual is between 71 and 120 years old. We compute the join distribution of age and lstatus, and raise a flag whenever the data violates one of the following 6 conditions: 1. 90%+ of children should be out of the labor force or have lstatus missing; 2. 50%+ of youth should be out of the labor force; 3. 50%+ of adults should be in the labor force; 4. Seniors must be in the labor force at a lower rate than adults; 5. 80%+ of retirees should be out of the labor force 6. All lstatus groups should be positive (>0) for working-age people (youth, adult, and senior). Education versus age This check looks at the joint distribution of education (as defined by educat4) and age (as re-defined in section 2.5), and raise a flag whenever one of following 3 conditions are not met in the GLD data: 1. 90%+ of children should have no education or primary only; 2. None of the primary, secondary and post-secondary categories should be 0 for people 15+; 3. For all age groups, no single education level should be 100%. Industry versus occupation This check considers the join distribution of industrycat4 and occup, and raises a flag when any of the following 15 conditions is violated: 1. Managers should not be prevalent in agriculture (share under 20%); 2. Professionals should be low in agriculture (under 10%); 3. Professionals should be mainly in services (over 60%); 4. Technicians should be low in agriculture (under 10%); 5. Clerks should be low in agriculture (under 10%); 6. Clerks should be mainly in services (over 60%); 7. Service and market should be low in agriculture (under 10%); 8. Service and market should be almost only in services (over 80%); 9. Skilled agricultural should be almost only in agriculture (over 80%); 10. Craft workers should be low in agriculture (under 10%); 11. Craft workers should be mainly in industry (over 60%); 12. Machine operators should be low in agriculture (under 10%); 13. In all ocuppations, “other” should be low (under 25%); 14. In any ocuppation, no industry should be exclusive (none is 100%); 15. In any occupation, not all industries should not be missing. Occupation versus education This check looks at the joint distribution of occupation and education, as measured by occup and educat4. A flag is raised whenever one of the following 6 conditions is not met: 1. Professionals with no education should be low (under 10%); 2. Professionals should mainly have post-secondary education (over 60%); 3. Technicians with no education should be low (under 10%); 4. Clerks with no education should be low (under 10%); 5. Machine operators with post-secondary education should not be prevalent (under 20%); 6. Elementary occupations with post-secondary education should be low (under 10%). Output produced Each of the 8 checks described above produces a file with the key data used to conduct a check (a list of correspondences or a joint bivariate distribution), as well as a sequence of 0s and 1s indicating if the conditions are met. If any of the required conditions is not met, a string variable explaining the reason behind the flag is created. Additionally, a file is created listing all flagged issues. All files are stored under "Block4_Bivariate/01_data". The code stores in the "01_summary" folder, an extra file called "B4_bivariave_results.xlsx" “bivariave_results.xlsx”. This file contains a list of cases in which a flag has been raised and will be empty when the data does not raise any flags. 4.2.6. Block 5 - GLD wage checks The goal of this section is to look at the GLD wage data (wage_no_compen variable for those with empstat == 1), and to use this information to detect potential harmonization errors. We compute 4 series of wages: by education, by occupation, by industry and by age, and flag a series whenever the wage behaves differently than expected. Note than in all cases, we convert the wages to hourly wages. Wage series & checks Wage by education To compute wage by education, we use the education variable educat4, a categorical variable that takes 4 values. We expect wages to be increasing in education, and we flag the series when this is not the case. Wage by occupation To compute wage by occupation, we use the education variable occup_skill, a categorical variable that takes 3 values. We expect wages to be increasing in skill level, and we flag the series when this is not the case. Wage by industry To compute wage by industry, we use the education variable industrycat4 skill, a categorical variable that takes 4 values, of which we consider 3: agriculture, industry and services. We expect wages to be highest in services, followed by industry and finally agriculture. We flag the series when this is not the case. Wage by age To compute wage by age, we use a re-coded version of the age variable, which classifies population in 5 groups: child (0 to 14 years old), young (15 to 24 years old), adult (25 to 54 years old), senior (55 to 70 years old) and retiree (71 to 120 years old). We expect wages to increase until adulthood, and we expect wages of retirees to be lower than wages of senior people. We remain agnostic about how wages for adults and for seniors compare. We flag the series when these rules are violated. Output produced Regardless of the results of the checks, we produce the following two pieces out output: * A small database for each wage series, with a binary variable indicating if any condition is violated. These are stored in "Block5_Wage/01_data". * A pdf image consisting on four figures, one for each series. This file can be found in "Block5_Wage/02_figures" and "01_summary". The graphs are colour coded. When the series is depicted in blue it conforms with our ex-ante expectations. If the line is drawn red, the series raises at least one flag. An example of the output produced, for the case of the 2016 Bangladesh LFS is shown below in Figure 26. Figure 26 - Example of the wage analysis checks. 4.2.7. Annex to Block 2 – GLD external checks The below is the list of exact indicators read through WDI or dbnomics and used to create external comparators. 1. Total population 1.1. SP.POP.TOTL, Total population, WDI 1.2. A.N.”CC”.W0.S1.S1._Z.POP._Z._Z._Z.PS._Z.N, Total population, NA Main Aggregates, UNData • Unit of measure: “PS” (persons) 1.3. POP_2POP_GEO_NB, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Classif1: "GEO_COV_NAT" (national geographic coverage) 1.4. POP_2POP_SEX_AGE_NB, Population by sex and age -- UN estimates and projections, Nov. 2021 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_10YRBANDS_TOTAL" (total age, classified in 10 year bands) 2. Gender split (% of female population) 2.1. SP.POP.TOTL.FE.ZS, Population, female (% of total population), WDI 2.2. POP_2POP_SEX_AGE_NB, Population by sex and age -- UN estimates and projections, Nov. 2021 (thousands), ILO • Sex: “SEX_F” & “SEX_T” (Female & Total. Female for the numerator, total for the denominator) • Classif1: AGE_AGGREGATE_TOTAL (total age, classified in aggregates) 2.3. POP_2POP_GEO_NB, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Sex: “SEX_F” & “SEX_T” (Female & Total. Female for the numerator, total for the denominator) • Classif1: “AGE_AGGREGATE_TOTAL” (total age, classified in aggregates) • Classif2: “GEO_COV_NAT” (national geographic coverage) 3. Urban population (% of total) 3.1. SP.URB.TOTL.IN.ZS, Urban population (% of total population), WDI 3.2. POP_2POP_GEO_NB, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Classif1: “GEO_COV_NAT” & “GEO_COV_URB” (national & urban geographic coverage) 4. Children (0-14 years old, % of total population) 4.1. SP.POP.0014.TO.ZS, Population ages 0-14 (% of total population), WDI 4.2. POP_2POP_SEX_AGE_NB, Population by sex and age -- UN estimates and projections, Nov. 2021 (thousands), ILO • Sex: SEX_T (total) • Classif1: “AGE_5YRBANDS_TOTAL”. In particular, we select age groups “AGE_5YRBANDS_Y00-04”, “AGE_5YRBANDS_Y05-09” and “AGE_5YRBANDS_Y10” to compute the numerator. 4.3. POP_2POP_GEO_NB, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_5YRBANDS_TOTAL". In particular, we select age groups “AGE_5YRBANDS_Y00-04”, “AGE_5YRBANDS_Y05-09” and “AGE_5YRBANDS_Y10” to compute the numerator. • Classif2: "GEO_COV_NAT" 5. Adults (15-64 years old, % of total population) 5.1. SP.POP.1564.TO.ZS, Population ages 15-64 (% of total population), WDI 5.2. POP_2POP_SEX_AGE_NB, Population by sex and age -- UN estimates and projections, Nov. 2021 (thousands), ILO • Sex: SEX_T (Total) • Classif1: "AGE_5YRBANDS_TOTAL". In particular, we select age groups "AGE_5YRBANDS_Y15-19" , "AGE_5YRBANDS_Y20-24", "AGE_5YRBANDS_Y25- 29", "AGE_5YRBANDS_Y30-34", "AGE_5YRBANDS_Y35-39”, "AGE_5YRBANDS_Y40-44", "AGE_5YRBANDS_Y45-49”, "AGE_5YRBANDS_Y50- 54", "AGE_5YRBANDS_Y55-59" and "AGE_5YRBANDS_Y60-64" to compute the numerator. 5.3. POP_2POP_GEO_NB`, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_5YRBANDS_TOTAL". In particular, we select age groups "AGE_5YRBANDS_Y15-19" , "AGE_5YRBANDS_Y20-24", "AGE_5YRBANDS_Y25- 29", "AGE_5YRBANDS_Y30-34", "AGE_5YRBANDS_Y35-39”, "AGE_5YRBANDS_Y40-44", "AGE_5YRBANDS_Y45-49”, "AGE_5YRBANDS_Y50- 54", "AGE_5YRBANDS_Y55-59" and "AGE_5YRBANDS_Y60-64" to compute the numerator. • Classif2: "GEO_COV_NAT" (national geographic coverage) 6. Seniors (65+ years old, % of total population) 6.1. SP.POP.65UP.TO.ZS, Population ages 65 and above (% of total population), WDI 6.2. POP_2POP_SEX_AGE_NB, Population by sex and age -- UN estimates and projections, Nov. 2021 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_5YRBANDS_TOTAL". In particular, we select age group "AGE_5YRBANDS_YGE65" to compute the numerator. 6.3. POP_2POP_GEO_NB, Population by rural / urban areas -- UN estimates and projections, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_5YRBANDS_TOTAL". In particular, we select age group "AGE_5YRBANDS_YGE65" to compute the numerator. • Classif2: "GEO_COV_NAT" (national geographic coverage) 7. Labor force size (number) 7.1. SL.TLF.TOTL.IN, Labor force, total, WDI 7.2. EAP_TEAP_SEX_AGE_NB, Labour force by sex and age (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_10YRBANDS_TOTAL" 7.3. EAP_2EAP_SEX_AGE_NB, Labour force by sex and age -- ILO modelled estimates, Nov. 2021 (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 8. Labor force participation rate (labor force/ total population) 8.1. SL.TLF.CACT.ZS, Labor force participation rate, total (% of total population ages 15+) (modeled ILO estimate), WDI 8.2. SL.TLF.CACT.NE.ZS, Labor force participation rate, total (% of total population ages 15+) (national estimate), WDI 8.3. SL.TLF.CACT.NE.ZS, Labor force participation rate, total (% of total population ages 15+) (national estimate), WDI 8.4. EAP_DWAP_SEX_AGE_RT, Labour force participation rate by sex and age (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 8.5. EAP_2WAP_SEX_AGE_RT, Labour force participation rate by sex and age – ILO modelled estimates, Nov. 2021 (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 9. Employment (number) 9.1. EMP_TEMP_SEX_AGE_NB, Employment by sex and age (thousands), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 9.2. EMP_2EMP_SEX_AGE_NB, Employment by sex and age -- ILO modelled estimates, Nov. 2021 (thousands) • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 10. Employment to population ratio (employed/ total population) 10.1. SL.EMP.TOTL.SP.ZS, Employment to population ratio, 15+, total (%) (modeled ILO estimate), WDI 10.2. SL.EMP.TOTL.SP.NE.ZS, Employment to population ratio, 15+, total (%) (national estimate), WDI 10.3. EMP_DWAP_SEX_AGE_RT, Employment-to-population ratio by sex and age (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 10.4. EMP_2WAP_SEX_AGE_RT, Employment-to-population ratio by sex and age -- ILO modelled estimates, Nov. 2021 (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 11. Unemployment rate (unemployed/ labor force) 11.1. SL.UEM.TOTL.ZS, Unemployment, total (% of total labor force) (modeled ILO estimate), WDI 11.2. SL.UEM.TOTL.NE.ZS, Unemployment, total (% of total labor force) (national estimate), WDI 11.3. UNE_DEAP_SEX_AGE_RT Unemployment rate by sex and age (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_AGGREGATE_TOTAL" (Total age) 11.4. UNE_2EAP_SEX_AGE_RT, Unemployment rate by sex and age -- ILO modelled estimates, Nov. 2021 (%), ILO • Sex: SEX_T (total) • Classif1: "AGE_YTHADULT_YGE15" (Age +15) 12. Agriculture (% of employment) 12.1. SL.AGR.EMPL.ZS, Employment in agriculture (% of total employment) (modeled ILO estimate), WDI 12.2. EMP_TEMP_SEX_ECO_NB, Employment by sex and economic activity (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_AGR" and "ECO_SECTOR_TOTAL" (agriculture and total) 12.3. EMP_2EMP_SEX_ECO_NB, Employment by sex and economic activity -- ILO modelled estimates, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_AGR" and "ECO_SECTOR_TOTAL" (agriculture and total) 13. Industry (% of employment) 13.1. SL.IND.EMPL.ZS, Employment in industry (% of total employment) (modeled ILO estimate), WDI 13.2. EMP_TEMP_SEX_ECO_NB, Employment by sex and economic activity (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_IND" and "ECO_SECTOR_TOTAL" (industry and total) 13.3. EMP_2EMP_SEX_ECO_NB, Employment by sex and economic activity -- ILO modelled estimates, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_IND" and "ECO_SECTOR_TOTAL" (industry and total) 14. Services (% of employment) 14.1. SL.SRV.EMPL.ZS, Employment in services (% of total employment) (modeled ILO estimate), WDI 14.2. EMP_TEMP_SEX_ECO_NB, Employment by sex and economic activity (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_SER" and "ECO_SECTOR_TOTAL" (services and total) 14.3. EMP_2EMP_SEX_ECO_NB, Employment by sex and economic activity -- ILO modelled estimates, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: "ECO_SECTOR_SER" and "ECO_SECTOR_TOTAL" (services and total) 15. Industry category (% of employment) 15.1. EMP_TEMP_SEX_ECO_NB, Employment by sex and economic activity (thousands), ILO • Sex: SEX_T (total) • Classif1: ECO_ISIC4_TOTAL & ECO_ISIC4_A - ECO_ISIC4_U or ECO_ISIC3_TOTAL and ECO_ISIC3_A - ECO_ISIC3_X 15.2. EMP_2EMP_SEX_ECO_NB`, Employment by sex and economic activity -- ILO modelled estimates, Nov. 2020 (thousands), ILO • Sex: SEX_T (total) • Classif1: ECO_DETAILS_TOTAL & ECO_ DETAILS _A - ECO_ DETAILS_RSTU 16. Status in Employment (% of employment) 16.1. EMP_2EMP_AGE_STE_NB, Employment by age and status in employment (thousands), ILO • Classf1: AGE_YTHADULT_YGE15 • frequency: A 17. Hourly wages 17.1. EAR_HEES_SEX_OCU_NB, Mean nominal hourly earnings of employees by sex and occupation (local currency), ILO • Sex: SEX_T (total) • Classif1 == "OCU_ISCO88_TOTAL" (Total occupations) 17.2. NY.GDP.PCAP.CN, GDP per capita (current LCU), ILO • Times 2/3 (labor share), divided by 2080 (number of full time working-hours in a year, 40h/week*52weeks = 2080 h) 4.3. Guide to the GLD survey series quality checks 4.3.1. Overview The series checks aim to help users inspect graphically a set of surveys for a country over time to ensure they are coherent and consistent. The only file the user needs to use is the process_time_series.R file, consisting of seven steps, that are detailed here in the following. 4.3.2. Step 1 - Define user variables Step 1 is the only section requiring user input. The user ought to input the country ISO alpha-3 code as well as the variables that they want to inspect, provided they are categorical (as are the variables listed below). Note that the wage information is automatically included and thus needs not to be listed. Subsequently the user defines three paths, the first (`path_in`) points to the folder that contains all the surveys for that country. That is, the path to folder that contains the top level survey folders (i.e., folders of the "CCC_YYYY_[Survey_Name]" level). The second is to the folder where the user wants to store the graphs created for their inspection. The last path points to where the functions are stored (i.e., where all the files listed in this GLD directory starting with "function" are stored on the user's system). # Enter country ISO Alpha 3 code country <- "[CCC]" # Enter variables that ought to be analysed (or leave standard variables) vars_to_study <- c("empstat", "educat7", "educat4", "industrycat10", "industrycat4", "occup", "lstatus") # Note that wage will be included by default, no need to include here # Define the path to the folder holding the series path_in <- "[For example: Z:/GLD-Harmonization/123456_AZ/ZAF]" # Define the path to the folder where the graphs ought to be stored in path_out <- "[Path to output folder]" # Define the path to the folder where this code and the other functions of the survey series checks are stored dir_w_functions <- "[Path to folder with functions]" 4.3.3. Step 2 - Call libraries, functions Step 2 defines the packages that are needed, ensures they are installed and loads them. It then runs the GLD functions for the survey series checks. # List packages that we need pkgs <- c("Hmisc", "tidyverse") # Check they are installed pkgs_2_install <- pkgs[!(pkgs %in% installed.packages())] # Install if not on system for (pkg in pkgs_2_install) { install.packages(pkg) } # Load packages for (pkg in pkgs) { library(pkg,character.only=TRUE) } purrr::walk(dir(dir_w_functions, pattern = "^function", full.names = T), ~source(.x)) 4.3.4. Step 3 - Load DF Step 3 uses the function defined in function_load_df.R. It uses as input the path defined at the start. The function will enter the path and extract the latest harmonized dataset for each folder. That is, if the BRA_2018_PNADC folder has four harmonized versions (from V01_A to V04_A), the function will select the latest (here V04_A). It then will extract the necessary variables (e.g., age, weight, unitwage, ...) plus the variables defined at the start to inspect and append all data files into a single dataset. If the option wap_only is set to TRUE (as shown below) the data will be restricted to the working age population. df <- load_df(path_in = path_in, wap_only = TRUE, vars_to_study = vars_to_study) 4.3.5. Step 4 - Make time series data frames Step 4 makes a list of aggregated data for each variable in the vector vars_to_study provided by the user in Step 1 using the function defined in function_make_cat_ts.R. For each variable the shares per year and variable category (excluding NAs), so that the shares sum to 1 for each year. If the option employed is set to TRUE (as shown below), the calculations apply only to those who are employed (i.e., for those with lstatus == 1). Note that the function automatically will not apply this reduction to the employed sub-sample if the variable passed is lstatus (as otherwise there would be no information value). summary_df_list <- purrr::map( vars_to_study, ~make_cat_ts(df = df, var = .x, employed = TRUE)) 4.3.6. Step 5 - Make wage time series data frame Step 5 uses the function defined in function_make_wage_ts.R to calculate for the wage employed (empstat == 1) the mean and median hourly wages as well as the 10th, 25th, 75th, and 90th percentile for each year. It also stores the sample size of respondents for which there are answers for all variables used to calculate the hourly wage (unitwage, whours, and wage_no_compen) wage_df <- make_wage_ts( df = df, country = country) 4.3.7. Step 6 - Make time series plots Step 6 creates a list of plots. Firstly, the list of shares over the categorical variables from Step 4 is run to create a graph for each, using the function defined in function_plot_cat_ts.R. Then, an additional plot for the wage information created in Step 5 is added using the function defined in function_plot_wage_ts.R. summary_plots <- purrr::map(summary_df_list, ~plot_cat_ts(.x)) # Add wage plot as the n_th plot n_th <- length(vars_to_study) + 1 summary_plots[[n_th]] <- plot_wage_ts(wage_df = wage_df) 4.37. Step 7 - Save all summary plots Step 7, the last step, simply takes the list of plots from Step 6 and stores them in the folder defined by path_out given by the user in Step 1 using the function defined in function_save_ts_plots.R. purrr::walk(summary_plots, ~save_ts_plots(.x, out_folder = path_out, country = country)) ` 5. Using the GLD 5.1. How to use the harmonized dta file How to use a single harmonized dta file A single file can be accessed, as discussed above, via the GLD server, the Microdata Library, or datalibweb. Each file represents the answers of individuals to the standard variables contained in the GLD dictionary. It can be directly used to calculate summary statistics (over the appropriate subgroups) and regressions. The variables of the first block contain all relevant variables (psu, ssu, strata, …) to correctly set the survey setting with the exception of the finite population correction factor, e.g., the total number of primary sampling units from which the PSUs where selected from. This information is not available to us and users should either try to approximate the number or use caution when reading results that implicitly assume an infinite population. How to access several harmonized dta files Users may access several surveys at a time and append them but should note that there are occasions when then format of a variable (namely the “_orig” variables). The GLD has prepared a Stata tool for users with access to the server to load the latest files from any given country. All explanations from how to install it to how to use it are available here on the GLD GitHub repository. 5.2. How to use the harmonization code? The harmonization code is designed for users to exploit it by amending and adding variables. To amend or edit there are three sections consider. Firstly, amending the filenames and paths, secondly, amending or adding the variable(s) of interest, and thirdly, amending the “Final steps Section” that cleans the dataset. For the first part, users need to evaluate the paths and overwrite the ones laid out in subsection `1.2` (see example below in Figure 27 of Step 1 of the harmonization code for the Brazilian 2020 PNADC) to ensure the files are read from and stored in the folders they wish to use. Figure 27 - Step 1 of the harmonization code When adding or amending variables, users could add the variable at any point after assembly (i.e., after section 1) and before the final steps. The GLD team recommends adding variables at the end of the section or subsection to which the variable belongs to. That is, if the user is adding an education variable, to do so after the last education variable. If the variable concerns the secondary job over the 12-month recall to add it after the last variable for that section. For example, the Pakistani 2020 LFS includes question 5.14 (Figure 28 below) that informs about the location where an activity is carried out. Figure 28 - Excerpt of the 2020 PAK LFS questionnaire As this question pertains to the main job over the 7-day recall period it should be placed at the end of the relevant section, namely section 8.2 (see Figure 29 below). Figure 29 - Indication of the recommended location for including an additional 7 day labor recall variable The next step in this process is then to add the variable. Users are free to do this as they wish, but the GLD team recommends keeping with the GLD format of wrapping variable code in tags, as shown below (Figure 30). Figure 30 - Example of the code for an additional, non-standard variable Once the variable(s) ha(s/ve) been added it needs to be included in the final section. This is because the final section not only drops variables that have missing answers for all observations but also reduces the data only to the variables in the standard dictionary and orders them accordingly. Since, by definition, any additional variable is not part of the standard dictionary it will not be part of the keeping and ordering sections (Figure 31 below). Users need to include it manually. Figure 31 - Step 9 of the harmonization code Once these three steps have been completed, the user will have created a customized output that expands on the GLD harmonization without having to go through all the previous steps to create a microdata file from the raw data that can be exploited. The harmonization is no longer a “take it or leave it” product to become sandbox with a sandcastle you can easily transform. 5.3. How to cite the GLD Citing the GLD is not as straightforward as citing an academic journal article. Imagine a GLD user who has analyzed, for example, the Brazilian PNADC, and used the harmonization code to build their data file but also added other elements. Moreover, the user read the Country Survey Details and learn something about the survey they may have learned anyway, yet much more quickly because of the CSD. Under a table of summary statistics, should this user reference the GLD or the IBGE, the Brazilian statistics office, who run and publish PNADC? The GLD team thinks in this case, the IBGE has done the bulk of the work and ought to be named first. We propose to include in the citation that data was mediated by the GLD, that data came via the GLD. The below Table 5 is a mock example of the table with such a citation. Table 5 - Example mock summary statistics table Year Country # of concept A % of concept B 2014 BRA 29,453 23.4 2015 BRA 29,209 24.5 … … … … 2014 TZA 18,463 9.34 2015 TZA 18,796 8.57 Source: BRA PNADC, IBGE; …; TZA LFS, NBS via World Bank Global Labor Database In the bibliography we propose the following citation: World Bank Jobs Group. (n.d.). Global Labor Database. World Bank. Retrieved (Date of retrieval), from https://github.com/worldbank/gld. The reader should be advised that the Date of retrieval of the data should correspond not to the last day the website was viewed, but to the day the data was last downloaded. Suppose a user downloaded the data by the 10th of Month of Year and on the 11th of that Month, data for the CCCC_YYYY_LFS was updated from CCCC_YYYY_LFS_V01_M_V0*2*_GLD to CCCC_YYYY_LFS_V01_M_V0*3*_GLD. The user’s data still contains V02. Then, even though the website was last accessed on the 13th of Month, the effective date for the purposes of the reference should remain the 10th of Month. 6. Contributing to GLD’s quality and expansion This section covers a description of how to collaborate with the GLD team, either improving the current offering or expanding it. 6.1. General rules to collaborate with GLD The GLD team is open to any contribution or collaboration and appreciates your help. From pointing out a typo in a harmonization (Rio de Jane**i**ro, not Rio de Janero) to providing us with a fully worked out harmonization your team had worked on anyway. The only rules are, firstly, to be mindful of each other’s timelines and workstreams so that we set out the right expectations about when a product can be delivered, and, secondly, to adhere – if you interact on our GitHub repository – by our Community Code of Conduct. 6.2. How to share new raw data with GLD If you have a new survey not included in GLD that you would like us to collaborate on, please reach out to the GLD Focal Point. We kindly request to make us aware of all data access restrictions on the data you would like to share. For data not in the public domain, we highly recommend the data be made available to all staff as a development data for official use. The former concept means the data does not contain personal information like names or telephone numbers (age and sex are acceptable), the latter that the data can be shared with all World Bank colleagues without restriction. If the data is provided by the producer (commonly the NSO) free of charge, you may use the World Bank’s Data Acquisition Template (here under the heading Templates for Data Acquisition. If the data is purchased from the provider, the process depends on the amount. Below a certain threshold it can by acquired by your business unit, above a threshold the acquisition needs to go through Central Procurement. For more details and general support, please visit the (intranet) site of Central Procurement. For restricted surveys, that is, once that cannot be shared even within the World Bank, we can keep the survey stored on the GLD server, which is only accessible to the core team and share it on a case-by-case basis. Moreover, via the datalibweb distribution system we can not only share it, equally, case-by-case, but put you as the Task Leader for the data so request and the power of approval remain with you. 6.3. How to share a harmonization for a survey not covered by GLD If you wish to share a harmonization for a survey not covered by GLD, first of all: thank you! That is the data equivalent of finding a banknote on the street. As such, we will try to accommodate you as best as we can. The easiest way is to reach out us is via an email to the GLD Focal Point. We can grant you temporary access to the core team GLD server where data can be exchanged. For users familiar with GitHub (or those who want to be), you may also “clone” or copy our repository, add your harmonization (following the GLD folder structure) and make a pull request. For more information, see details here on cloning as well as here on making a pull request. 6.4. How to collaborate with GLD on a new harmonization The GLD team is happy to collaborate with you on a new harmonization, divide tasks and reduce duplication. You can reach us via an email to the GLD Focal Point or by raising a blank issue on the GitHub repository. For more details on how to do this, please see the last sub-section of this section. 6.5. How to correct/ expand an existing harmonization The GLD team is happy to rectify errors in the harmonization or add a variable we had not included previously. Despite our best efforts and extensive validation efforts some things fall through the cracks. Only extensive use of the data and your collaboration can add value at this stage. For either issue you can email the GLD Focal Point directly. Additionally, you can raise an issue about this on our GitHub repository. This provides a standardized form to give feedback we may be able to incorporate more quickly. To do so, you may navigate to github.com/worldbank/gld/issues or click on the issues tab on the GLD repository (see red square in Figure 32 below). Figure 32 - Finding the Issues tab on the GLD GitHub repository Users then will see the list of open issues and can, by clicking on the button on the top right, open a new issue (see red square in Figure 33 below). Figure 33 - Example of how to raise a new issue on the GLD GitHub repository Now users have five options at their disposal (see image below). The four options listed, each with a green button stating “Get started” to initiate the issue, plus the option at the bottom (see red square in Figure 34 below) to open a blank issue, that is one that has no prior formatting. Figure 34 - Types of issues to raise on the GLD GitHub repository For reporting an error in the code, we recommend choosing the “Code Correction Report” option (the first). This will lead to the page shown below, where the user is provided with some boxes to make sure the information can be used directly in our data updating process, namely, to enter which surveys this applies to, a description of the error, the erroneous code and (not shown in Figure 35 below) a box for a proposed corrected code. Figure 35 - GLD code correction issue template Users can also alert us of a bug in the code via issues. There are two prepared formats for this. The first is the text-based alert (shown below in Figure 36). The user is requested to give information about the survey, describe the issue and, if possible, add the code that is causing the issue. Figure 36 - GLD text-based bug report template The alternative is the image/screenshot based issue report (third option on the issues options list). The text box is pre-filled with indications of what kind of information would be most useful to the GLD team (see Figure 37 below). Figure 37 - GLD image-based bug report template 6.6. How to correct and expand the Country Survey Details There is no specific method to expand or correct the Country Survey Details (CSD). Users are asked to either reach out to the GLD Focal Point or raise a blank issue on GitHub detailing the problems with the CSD. How to do this is described in the last sub-section of this page. Alternatively, users can make a pull request from a clone of the GLD repository where they have corrected or amended the text, so that we may review their request and integrate it. You can find details here on cloning as well as here on making a pull request. 6.7. How to correct and expand the GLD tools The GLD team is happy to receive updates on the tools we have created as well as on request for new tools. For corrections users can reach out to the GLD Focal Point or raise a blank issue on GitHub (see next sub-section on blank issues). They may also, as detailed above, update information on a copy of the repository and propose to merge it into the GLD via a pull request. For requesting new tools, in addition to emails and blank issues, there is also the “Feature Request” issue (Figure 38 below). The Description box guides users on what kind of information would be most helpful for us to create a tool that solves issues you may be encountering when using the GLD. Figure 38 - GLD feature request template 6.8. How to communicate on any other issue The GLD team encourages you to reach out on any issue. To inform us you can either reach out via email or create a blank issue on GitHub. A blank issue can be created, after clicking on the “Issues” tab in the repository, by selecting the “Open a blank issue” option at the end of the list of issue templates (red box in Figure 39 below). Figure 39 - Example of how to raise a blank issue on the GLD GitHub repository Here the user is only requested to enter a title to their issue and they have a blank box at their disposal to detail any issue they wish to communicate to us (see Figure 40 below). Figure 40 - GLD blank issue template