Policy Research Working Paper 10673 Missing Evidence Tracking Academic Data Use around the World Brian Stacy Lucas Kitzmüller Xiaoyu Wang Daniel Gerszon Mahler Umar Serajuddin Development Economics A verified reproducibility package for this paper is Development Data Group available at http://reproducibility.worldbank.org, January 2024 click here for direct access. Policy Research Working Paper 10673 Abstract Data-driven research on a country is key to producing correlated with its gross domestic product per capita, popu- evidence-based public policies. Yet little is known about lation, and the quality of its national statistical system. The where data-driven research is lacking and how it could paper identifies data sources that are strongly associated be expanded. This paper proposes a method for tracking with data-driven research and finds that availability of sub- academic data use by country of subject, applying natural national data appears to be particularly important. Finally, language processing to open-access research papers. The the paper classifies countries into groups based on whether model’s predictions produce country estimates of the they could most benefit from increasing their supply of or number of articles using data that are highly correlated demand for data. The findings show that the former applies with a human-coded approach, with a correlation of 0.99. to many low- and lower-middle-income countries, while Analyzing more than 1 million academic articles, the paper the latter applies to many upper-middle- and high-income finds that the number of articles on a country is strongly countries. This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at bstacy@worldbank.org. A verified reproducibility package for this paper is available at http://reproducibility. worldbank.org, click here for direct access. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Missing Evidence: Tracking Academic Data Use around the World Brian Stacy, Lucas Kitzmüller, Xiaoyu Wang, Daniel Gerszon Mahler, and Umar Serajuddin1 Keywords: Data, academia, research, natural language processing JEL codes: C45, C52, O30 1 Stacy, Wang, Mahler, and Serajuddin are with the World Bank’s Development Data Group. Lucas Kitzmüller completed the work while at the European Bank for Reconstruction and Development (EBRD). Corresponding author: Brian Stacy (bstacy@worldbank.org). We are grateful for comments from Dean Jolliffe, Jishnu Das, Olivier Dupriez, and Patrick Brock. We acknowledge financial support from a World Bank Research Support Grant (P178728). The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Views presented are those of the authors and not necessarily of the EBRD. 1 Introduction In recent decades, the amount of data produced has exploded, generating boundless opportunities for policies to improve people’s lives (World Bank 2021). Though data can be valuable in their raw form, the full value of data is only realized when they are analyzed to create insights, and these insights are converted to public policies or increased accountability. Researchers have a vital role to play in this regard. Many researchers spend countless hours digesting data, using data to create new knowledge, and communicating this knowledge with the intent of impacting public discourse and public policies. There are numerous examples of data-driven analyses having real and important impacts on people’s lives (Jolliffe et al. 2023). One example from Brazil explicitly looks at researchers’ ability to influence policy outcomes. There, evidence from 2,150 municipalities found that informing municipal mayors of research findings on the effectiveness of a simple policy change increased the probability that their municipality implemented the policy by 10 percentage points (Hjort et al. 2021). Without research, there is a risk that the return of data to society will be reduced, and policies to improve lives unrealized. Yet very little is known about where there is missing data-driven evidence and how governments can best stimulate an evidence base for local decision makers. This paper attempts to fill these gaps by addressing two questions: (1) Which countries are the subject of research papers using data? (2) How can countries increase their national evidence base? We focus on data-driven research due to the increasing importance of data for policy making and the specific policies that are needed to increase the supply and demand of data, such as boosting statistical capacity and improving data literacy. To answer the first question, we introduce a new method for measuring data use in research articles based on 1 million English-language articles spanning 216 countries and various academic fields. These articles are made available by Semantic Scholar Open Research Corpus (s2orc), which has digitized millions of research papers worldwide and made its raw text accessible via APIs (Lo et al. 2020). With the aid of Amazon Mturk workers, we manually code 900 of these articles as using data or not, upon which we train a natural language model to predict the coding of the Mturk workers (Devlin et al. 2018). The model achieved an 87% out-of-sample accuracy rate, and when the articles are aggregated to the country level, the model had a correlation of 0.99 with the number of articles classified by Mturk workers. The model was then applied to 1 million academic articles from the s2orc database from 2000-2020. The model estimates the amount of data-driven research on a country regardless of where the researchers may be located, not the amount of data-driven research by the citizens of a country. We argue that the former is the relevant quantity to understand the evidence base available to national decision makers. We find that data-driven research is strongly correlated with GDP per capita and population, which together account for around 75% of the variation across countries. High-income countries are the subject of nearly 50% of all papers using data despite representing only around 15% of the world's population, while low-income countries, comprising approximately 10% of the world population, account for only about 5% of articles using data. 2 To answer the second question – how countries can increase their national evidence base – we first establish that a country’s statistical capacity is predictive of data research even after controlling for population and GDP, and articles not using data (which we use as a proxy for the general research interest in the country). To understand which part of a country’s statistical capacity is most important for increasing data-driven research, we explore the data sources most related to academic data use. We find that the availability of geospatial data at the first administrative level is associated with a 1.1% increase in data use, that a population census in the past decade is associated with a 0.3% increase in data use, and that two or more labor force (agriculture) surveys over the past 10 years is associated with a 0.4% (0.2%) increase in data use. Though we are unable to establish these links causally, there are concrete data products that governments could supply to likely increase the evidence-base at its disposal. Boosting the supply of data is one way for a country to increase the data-driven research it is subject to, another is to increase the demand for its data. This is particularly relevant for countries that have already invested in relevant data products but are nonetheless subject to relatively little data-driven research. These are cases where existing data are underutilized and where it may be relevant to make existing data more accessible to researchers and possibly boost data literacy in the country. To explore this distinction between boosting data supply and data demand, and building upon Porteous (2020), we classify countries into four groups: Deserts have little data demand and little data supply, swamps have high data supply but little data demand, oases have high data demand but little data supply, and lakes have high data demand and high data supply. Nearly two-thirds of low-income countries and countries in Sub-Saharan Africa are oases, suggesting that these countries are getting relatively large evidence from their data supply and that they have relatively little issue with lack of demand, but that they could benefit from increasing the data available to researchers. By contrast, nearly half of countries in Europe are data swamps, suggesting a priority on increasing the use of existing data. Previous research has highlighted gaps between countries in economic research output and noted that richer countries are the subject of more economic research. For instance, Robinson, Hartley, and Schneider (2006), Das et al. (2013), and Porteous (2020) examine which countries are studied most by economists using the EconLit database. Cameron, Mishra, and Brown (2016) and Sabet and Brown (2018) extend this to note that impact evaluations are highly uneven across countries as well. Phillips and Greene (2022) show conflict research is biased towards Western countries, while Courtioux et al. (2022) show that academic research is highly related to public investments in scientific research. We contribute to the literature by applying NLP to improve our understanding of what countries are under-researched. Using NLP allows us to go beyond the existing literature in three ways: (1) scale up the sample size and look at all fields of interest rather than only economics, (2) identify papers that use data, which is key to understanding whether data demand or data supply could be explaining a country’s lack of research, and (3) indicate measures countries can take to increase data research. 3 The remainder of the paper is structured as follows. Section 2 discusses our data sources, section 3 details our methodology, section 4 introduces a theoretical framework, section 5 presents our empirical results, section 6 conducts robustness checks, and section 7 concludes. 2 Data Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet. We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the years 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles. Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample. Summary statistics of the final sample of 1 million articles are available in Table 1. Table 1. Summary Statistics of Article Corpus. 2000-2020 Published in Journal Data Use Country Identified Share of Field Articles (1=yes) (1=yes) (1=yes) Articles Business 0.56 0.64 0.30 28,571 2.8 Economics 0.79 0.68 0.28 62,241 6.0 Medicine 0.96 0.85 0.10 840,920 81.0 Political Science 0.42 0.33 0.34 26,185 2.5 Psychology 0.75 0.70 0.14 44,191 4.3 Sociology 0.90 0.33 0.25 35,640 3.4 4 3 Empirical Strategy The empirical approach employed in this project utilizes text mining with natural language processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn. To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non- standardly. The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country. The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3,500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019). 2 To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper: Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came. There are two classification tasks in this exercise: 1. Identifying whether an academic article is using data from any country 2. Identifying from which country that data came. For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to 2 A stratified sample of articles for the training set was taken. Of the articles 15% were medical, 15% psychology, 25% economics, 25% political science, 10% were tagged as business, and 10% sociology articles. 5 indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported. After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options. 3 For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper. We expect between 10 and 35 percent of all articles to use data. An image of the screen facing the MTurk workers when classifying an article is presented in Figure A.1 in the Appendix. The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers. A model is next trained on the 3,500 labeled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. Nine hundred articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence. The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated. 3 We ended up not using this in the paper. 6 The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in Figure 1. The number of articles represents an aggregate total of research output using data for each country summed from the corpus of papers that were not used to train the model. The Pearson correlation between the human raters and the NLP predictions is 0.996. Figure 1. Comparison of Human Classifications of Data Use to NLP Predictions Note: The horizontal axis shows the number of articles on a country using data as predicted by the natural language processing model. The vertical axis shows the number of articles on a country using data as classified by the human raters (Mturk workers). To make the performance of the model more concrete, we consider the output returned for three example articles. These articles were not in the training set of articles, so the model has not previously seen them. The first article, Perlman (2009), titled, “The Legal Ethics of Metadata Mining” is a law essay examining the ethics of metadata mining. This article has data (or metadata) as the subject of the article, but does not actually use data to perform analysis, making it potentially difficult to machine classify. In the authors’ judgement, this article should not be classified as using data. The NLP model also evaluated that this article did not use data. Figure 2 shows words that the model viewed as indicating data use (in red) and likely indicating an article does not use data (in blue) using the SHAp (SHapley Additive exPlanations) package in python (Lundberg and Lee 2017). While the model picked up keywords like data and examined (highlighted in red), indicating data use, the NLP algorithm also picked up other key words such as legal and review, which had a reduced likelihood of using data. On balance the predictions of the NLP model indicated the article did not use data. Using an alternative approach, such as 7 flagging key words in the article abstract like data, would have flagged this article as using data and given the incorrect classification. The second article, De Pasquale, Concetta, Federica Sciacca, and Zira Hichy (2017), titled, “Italian validation of smartphone addiction scale short version for adolescents and young adults (SAS- SV)”, was classified as using data. In the authors’ judgement, this was a correct classification. The article picked up on key words such as sample, showed, scale, and numeric values, indicating data use (highlighted in red). The third article, Smirnov, Anatoly A., and Irina V. Stukova (2015), titled, "Determinants of integration approach in the agrarian sphere development in contexts of transformation." was given a probability of using data of 0.67, which was below the threshold of 0.9 used to assign a label of data use. However, on the model thought the article more likely than not to use data. In fact, the article does not use data, indicating that the model can be ambiguous about the status of an article. Figure 2. Article Examples (a) No Data Use (b) Data Use (c) Model Unsure Note: The left panel is a sample from Perlman, Andrew M. “The Legal Ethics of Metadata Mining.” Akron Law Review 43 (2009). The right panel is a sample from De Pasquale, Concetta, Federica Sciacca, and Zira Hichy. “Italian validation of smartphone addiction scale short version for adolescents and young adults (SAS-SV).” Psychology 8, no. 10 (2017). The bottom panel is a sample from Smirnov, Anatoly A., and Irina V. Stukova. "Determinants of integration approach in the agrarian sphere development in contexts of transformation." Review of European studies 7, no. 8 (2015): 8. After applying the natural language processing model to around 1 million articles from the s2orc corpus, we compare our estimated number of articles produced using data to previous estimates in the literature. Das et al. (2013) use a corpus of more than 76,000 empirical economics papers published between 1985 and 2005 to rank the academic output of countries using the EconLit 8 database. While the estimates from our approach using the s2orc database do not exactly overlap because of differences in the years covered and the subjects, the correlation between country output is still 0.62 (Figure A.2a). Porteous (2020) examines the production of economics journal articles from 54 African countries between 2000 and 2019 using the EconLit database. The correlation between country rankings using the approach in this paper and that of Porteous (2020) is 0.87 (Figure A.2b). A third comparison can be made to the number of scientific and technical journal articles produced by the National Science Foundation (NSF) (National Science Board, National Science Foundation (2019)). The NSF counts the scientific and engineering articles published in physics, biology, chemistry, mathematics, clinical medicine, biomedical research, engineering and technology, and earth and space sciences. The NSF considers article counts from a set of journals covered by Science Citation Index (SCI) and Social Sciences Citation Index (SSCI). Comparing the NSF data with estimates from our NLP model for 2018 gives a correlation of around 0.9 (Figure A.2c). 4 Theory Building on a model introduced in Porteous (2022), we motivate analysis between the total amount of academic research using data and other development outcomes as follows. In a simple model with competitive members of academia, academics produce research using data, where benefits to that research scale with the size of the population of the country. That is, research that is about countries with greater populations produce higher benefits. This can be motivated if researchers choose to produce output that affects as many individuals as possible or else the prestige of academic work grows with the size of the country. As Porteous (2022) discusses, academics may also be interested in studying larger and more complex economies as the topics available to study (sectors and the policy environment) scales with the income level of a country. Academic research shows diminishing returns, where the payoffs decline depending on the quantity of existing research. The marginal cost of additional research using data depends on a variety of factors, such as the stock of existing data available for research in the country and other factors, such as differences in the cost of field work across countries. Researchers then compete so that the net benefit of additional research is identical across countries, so that for two countries , : ( , , ) − ( , ) = � , , � − � , � ∀ , (1) Where B(.) is a benefits function, C(.) represents costs, is the population size, is a measure of economic size, is the stock of economic research, and represents other factors affecting the cost of producing the research. Equation (1) motivates a model where the total amount of research in a country follows a Cobb-Douglas production function that depends on these factors listed above, where: = ( 1 2 )/ 3 (2) 9 Assuming that 1 , 2 ∈ (0,1), Equation (2) implies that there is diminishing marginal returns to the size of the country and the size of the economy. Additionally, as the factors affecting cost increase, the model predicts total research output using data to decrease. The model in log form is: log( ) = + 1 + 2 − 3 + (3) Equation (3) can be estimated using OLS, and the coefficients 1 , 2 , 3 can be interpreted as the elasticities of population size, economic output, and factors affecting costs of gathering data with respect to academic output using data. As a measure of the cost of gathering data for research, we use as a proxy variable the World Bank’s Statistical Performance Indicators (SPI), a measure of the performance of statistical systems for 174 countries (Dang et al. 2023, Cameron et al. 2021). Scores on the SPI are on a scale of 0-100, where countries scoring near 100 are the best performing systems and countries scoring closer to zero are the lowest performing systems. Because the term 3 for in equation (3) is structured as the elasticity with respect to costs of data collection, and because SPI scores are formulated so that better performing systems score higher, the coefficient on the SPI score in a regression will have the opposite sign as 3 . 5 Results Using the NLP model, the number of articles using data produced for each country is shown in Figure 3. Around 140,025 articles could be identified with a particular country. The two countries with the largest number of papers using data produced are the United States (12,273 papers) and China (12,063). India, Australia, and Japan are third, fourth, and fifth with 6,481, 5,463, and 5,300 papers respectively. Figure 3. Number of Articles using Data by Country of Subject (2000-2020) 10 Five countries are the subject of more than 25% of all academic output using data. However, these countries also make up more than 40% of the world’s population. The top 25 countries make up more than 80% of output, while the bottom 50 countries are subject of less than 1% of output. These shares match roughly to the population size of each of these groups. Figure 4. Number of Articles Using Data by Income Group and Region a) By income group b) By region Note: Panel (a) shows the total number of articles classified as using data by income group. Panel (b) shows the number of articles classified as using data by region per million people. There is much more imbalance when looking at income groups (Figure 4a). High-income countries are the subject of nearly 50% of all papers using data from 2000-2020, despite only making up around 17% of the world’s population. Despite making up around 1/10th of the world population, low-income countries only account for around 5% of articles using data. When looking by region (Figure 4b), Europe and Central Asia had the largest number of articles using data on a per capita basis with around 430 quantitative articles per million. North America is second with around 410 per million. South Asia is the subject of the fewest number of papers per million with only around 59 articles produced per million persons, less than half than Sub- Saharan Africa, the region with the second lowest quantitative articles per million. Table A.1 in the Appendix contains the total number of articles using data as well as the total number per million at the country level. 5.1 Relationships to Development Outcomes We estimate equation (3) using a dataset containing 167 countries with data available on all key variables. The number of articles is the three-year average count of all articles produced using data for each country between 2017 and 2019. A three-year average is used rather than a single year value for 2019, because the production of articles can be volatile for a single year observation. Population, GDP per capita, PPP, and SPI scores are all for the year 2019. There is a strong relationship between the number of articles produced using data and GDP per capita, population, 11 and the overall SPI score of a country. Bivariate regressions of log number of papers using data on log GDP, log population, and the SPI overall score are shown in Figure 5. Figure 5. Relationship between Papers Using Data and Development Outcomes (a) GDP per capita (b) Population (c) SPI Overall Score 12 Note: Data on GDP, population and SPI are retrieved from the World Bank’s World Development Indicators For GDP per capita, the figure indicates an elasticity of 0.5, meaning that a 10% increase in GDP is associated with a 5% increase in academic articles using data. The elasticity with respect to population is 0.67, and population alone explains 59% of the variation in academic articles using data. The overall SPI score is also predictive of academic output using data. A 10-point increase in SPI scores, which is approximately the same as moving from the median SPI score to the 65th percentile, translates into around a 0.6% increase in the number of articles using data. Table 2 shows coefficients from a regression of the log number of academic papers using data on log GDP per capita, log population, and the SPI scores. The first column shows a specification including just log GDP per capita and log population. These two indicators alone explain roughly 75% of the variation in the production of articles using data. The second column includes the SPI overall score as an additional control. The performance of a country’s statistical system is associated with greater academic output. Conditional on GDP per capita and population, a ten- point increase in SPI overall scores (on a scale of 0-100) translates into a 0.2% increase in academic output. It is possible that omitted variables are behind this result. For example, it could be the case that countries that are of general interest to the research community also happen to have high statistical performance. Das et al. (2013) and Porteous (2022) control for a range of variables which may influence the relationship sought after, such as whether English is a first language, international tourist arrivals, etc. We leverage that we have an estimate of the number of articles not using data – referred to as qualitative articles – to control for all factors that are likely to influence overall research output in a country unrelated to statistical performance. The number of qualitative articles is calculated as the number of total articles minus any articles using data, based on our NLP model. Countries that tend to be over-represented in quantitative articles also tend to be over- represented in qualitative articles, conditional on GDP per capita and population (Figure A.3). The elasticity is around 0.6, suggesting that a 1% increase in the number of qualitative articles associated with a 0.6% increase in the number of articles using data. We take this to imply that quantitative articles do not crowd out qualitative articles and vice versa. Returning to Equation (2), one prediction from the model is that countries with particularly high costs related to using data, (such as a lower statistical performance score), should see reductions in the amount of quantitative research, but it would not follow that it would see lower levels of qualitative research. This would predict the share of quantitative articles to total articles increasing in SPI score. Contrary to this, there is almost no relationship between SPI overall scores and the share of qualitative papers to total papers produced overall (Figure A.4). This suggests to us that some omitted variable may in part be driving the impact of statistical performance on academic articles using data, and that the number of qualitative articles can control for such possibly omitted factors. 13 The third column of Table 2 additionally controls for the log number of qualitative articles. The SPI overall scores are still statistically significant at the 5% level, as is country population. Column 4 examines the relationship with one of the pillars of the SPI, data sources, which covers the availability of recent censuses, surveys, academic data, and geospatial data. The data sources index from the SPI is also correlated with data use in academia, conditional on log GDP, population, and the number of qualitative papers. Similar regressions using the other pillars of the SPI, such as data services and data infrastructure are not associated with the number of articles using data in a similar regression, suggesting that it is the availability of particular data sources that is helpful in boosting academic data use. Table 2. Relationships between Number of Papers Using Data and Statistical Performance (1) (2) (3) (4) (Intercept) -12.41*** -10.66*** -3.96*** -3.57** (0.82) (0.95) (1.14) (1.17) Log GDP per capita 0.58*** 0.36*** 0.12+ 0.10 (0.05) (0.07) (0.07) (0.08) Log Population 0.70*** 0.63*** 0.29*** 0.29*** (0.03) (0.04) (0.05) (0.05) SPI Overall Score 0.02*** 0.01* (0.01) (0.01) Log Qualitative Papers 0.64*** 0.65*** (0.08) (0.08) SPI Data Sources Score 0.01* (0.00) Observations 168 168 168 168 R2 0.751 0.774 0.852 0.852 R2 Adj. 0.748 0.770 0.849 0.849 AIC 414.8 400.1 331.1 330.7 BIC 424.2 412.6 346.8 346.3 RMSE 0.82 0.78 0.63 0.63 Note: GDP, population and SPI data are retrieved from the World Bank's World Development Indicators. Regressions include all papers using data from 2017- 2019. Standard errors are robust to heteroskedasticity. ***=0.001 level, **=0.01 level, *=0.05 level, +=0.1 level. Using more detailed data from the SPI makes it possible to assess how the availability of specific data sources relates to academic output. Figure 6 shows the linear regression coefficient for the availability of each of the ten data sources considered by the SPI, conditional on log GDP per capita, log population, and log of qualitative papers. The ten data sources include: population census, agriculture census, business/establishment census, household consumption/income survey, agriculture survey, labor force survey, health survey, business/establishment survey, civil registration and vital statistics system (CRVS), and geospatial data at the Admin 1 (usually province/state) level. 14 Figure 6. Relationship between Number of Papers Using Data and Data Sources Note: Results from cross-sectional regression controlling for log population, log GDP per capita, and log number of qualitative papers. GDP and population data are retrieved from the World Bank’s Development Indicators, while the data source variables are from the Statistical Performance Indicators. The sample includes all papers using data between 2017 and 2019. Confidence intervals are at the 95% significance level. Full regression results are presented in Table A.2. The regression estimates indicate that the most important source is data availability at the Admin 1 level, which is associated with a 1.3% increase in academic output using data. The availability of a population census in the past 10 years is associated with a .3% increase in papers using data, which is significant at the 10% level. The availability of two or more labor force (agriculture) surveys over the past 10 years is associated with a 0.4% (0.2%) increase in data use, both significant at the 5% level. Though the magnitude of these effects may appear small, they likely do not reflect the full potential that investing in data sources can have, as some of the countries that have relevant data sources are faced with low use of its products for a variety of reasons. To explore this further, it is instructive to consider both the supply and demand for data. 5.2 Data Supply versus Data Demand: A Story of Data Deserts, Oases, Swamps, and Lakes The prior analysis suggested that countries can increase the number of data-driven academic articles they are subject to by increasing their supply of relevant data sources. Yet some countries may already be performing quite well in terms of the data supply they are producing. For such countries, the primary obstacle preventing more data research may not be the supply of data but the use of existing data. There are many reasons why existing data may not be used, including data being inaccessible to researchers and data being difficult to understand and process, for example due to lack of metadata (World Bank 2021). Other reasons may be related to lack of data 15 literacy among researchers, low trust in data from the national statistical system, and lack of infrastructure to access and use the data, which may be particularly an issue of low- and middle- income countries. The policies needed to boost data supply differ from those needed to boost data demand. Boosting the data supply of the national statistical system requires more financing for national statistical offices, better technical capacity of its staff, and at times better statistical laws to ensure their independence from other government bodies. Policies to boost data demand, on the other hand, involve financially cheap (but not necessarily politically cheap) wins, such as making data more accessible for researchers. It also involves increasing the data literacy of academics. Given the non-overlapping policy responses, it is useful for countries to know whether data demand or data supply is their weakest link in the data value chain. Below, we classify countries by whether they are primarily constrained by data supply, data demand, neither or both. Building on the semantics introduced by Porteous (2020) who classified countries into research deserts and research oases, we classify countries into four groups: Data deserts have little data demand and little data supply, data swamps have high data supply but little data demand, data oases have high data demand but little data supply, and data lakes have high data demand and high data supply. For assessing data supply, we again turn to the availability of various data sources, such as recent population and business censuses, household and health surveys, Civil Registration and Vital Statistics (CRVS), and sub-national geospatial data (for details see Dang et al., 2023). We measure data use as the residual from the regression of number of articles on the country using data on log population, log GDP per capita, and the average number of qualitative papers. The control for the number of qualitative papers once again serves as a proxy for general research interest and research environment unrelated to data. We classify countries by whether they are below/above the median of our indicators of data supply and data use (Figure 7). Since we use median values in the two dimensions to create the categories, there are about as many countries belonging to each group. Yet there are large discrepancies by region and income group (Table 3). The poorest countries and poorest regions are more likely to be data deserts or data oases. These categories apply to more than 80% of low- and lower-middle income countries, 93% of countries in Sub-Saharan Africa, and only 7% of high- income countries. Thus, poorer countries and poorer countries tend to have lower data supply. Low-income countries and countries in Sub-Saharan Africa are twice as likely to be oases than deserts. In fact, when looking at the map, it is evident that a large part of the world’s data lakes, are to be found in Sub-Saharan Africa. This suggests that, despite data scarcity, a significant volume of quantitative articles is already produced on these countries, when factoring in their population size, income, and attention in non-quantitative scholarship. These countries may benefit most from bolstering their data supply. This applies for example to Uganda, Ghana, Nepal, and Malawi. By contrast, nearly all upper-middle income countries and countries in Latin America who have low data supply are deserts rather than oasis. That is, they face both data 16 supply and data use constrains. This concerns countries such as Libya, Turkmenistan, and El Salvador. High-income countries and countries in East Asia & the Pacific are more likely to be a lake than a swamp. By contrast, upper-middle income countries and countries in Europe and North America are more likely to be a swamp than a lake. Hence, the latter countries may have successfully improved data supply but are lacking in data use. This concerns countries such as the Russian Federation, the UK, France, Germany, and the U.S. The Scandinavian countries are all classified as data lakes, meaning they have high supply and high use of their data. 17 Figure 7. Relationship between Data Use and Data Supply a) Country Scatterplot b) Map Note: Panel (a) plots on the vertical axis the residuals from a regression of the average number of data papers between 2017 and 2019 on log GDP, log population, and the log number of papers not using data (qualitative papers). The horizontal axis plots SPI Pillar 4, which tracks the availability of recent censuses, surveys, Civil Registration and Vital Statistics, and subnational data (for details see Dang et al., 2023). Red dashed lines represent median values. 18 Table 3. Data Deserts, Oases, Lakes, and Swamps by Region and Income Group Number of Data Data Data Data countries Deserts Oases Swamps Lakes Region East Asia & Pacific 29% 12% 6% 53% 17 Europe & Central Asia 11% 0% 58% 31% 45 Latin America & Caribbean 37% 11% 21% 32% 19 Middle East & North Africa 33% 22% 11% 33% 18 North America 0% 0% 100% 0% 2 South Asia 17% 50% 17% 17% 6 Sub-Saharan Africa 31% 62% 5% 3% 39 Income group Low-income 35% 65% 0% 0% 20 Lower-middle income 31% 49% 11% 9% 45 Upper-middle income 33% 0% 44% 22% 36 High-income 7% 0% 38% 56% 45 Total 25% 24% 26% 25% 146 Note: Share of countries within a region or income group that are classified as data deserts, oases, swamps, or lakes. For example, 29% of the 17 countries in East Asia & Pacific are classified as data deserts. In Figure A.5 we show the classification when using number of quantitative articles per capita (unadjusted for GDP or number of qualitative articles) as the measure of data use. With this classification, high-income, often OECD, countries are doing well in generating articles using data. Many of the data deserts are in low-income or conflict affected countries. While the per capita measure is useful for an overview of how countries are performing in producing articles using data, it is not always useful as a policy guide given that it is partially driven by GDP per capita and research interests, which are often outside the control of policy makers in the statistical system. 6 Robustness Checks In this section we offer some robustness checks to our main results. First, we estimate our main regressions using country fixed effects and year dummy variables rather than controlling for qualitative articles. Second, we investigate our results if we exclude medical papers. 19 6.1 Panel Analysis Our analysis may be subject to some omitted variable bias if countries with high statistical capacity share commonalities unrelated to their number qualitative articles. Below, rather than controlling for the number of qualitative articles, we add country fixed effects and year dummy variables to our OLS regressions, so that only within-country variation over time is used to identify our main relationships. Standard errors are clustered at the country level. Note that this specification may remove relevant variation if some countries have a constant high output of papers using data due to a constant high statistical capacity and vice versa. Using this specification, the relationship between academic output using data and log GDP per capita is similar as previously estimated, with an elasticity between around 0.4 and 0.6 depending on the specification (Table 4). The relationship with population is not statistically significant under this specification in most cases. The SPI overall scores are not statistically significant either under this specification, but because the SPI overall scores only are available starting in 2016, the time series is relatively short. To overcome this shorter time series, an “extended SPI” index is created that incorporates data from the older Statistical Capacity Indicator (SCI), which the World Bank has produced since around 2004. 4 Using the longer time series in the extended SPI, a 10 unit increase in the extended SPI overall score is associated with an increase in academic output using data of around 0.1%. The estimated coefficient is statistically significant at the 5% level and similar in magnitude to the cross-sectional results that control for the number of qualitative articles. The extended SPI data sources score is also statistically significant at the 5% level. Table 4. Longitudinal Relationships between Number of Papers Using Data and Statistical Performance (1) (2) (3) (4) (5) (6) Log GDP per capita 0.58** 0.51 0.58** 0.62** (0.20) (0.33) (0.22) (0.22) Log Population 0.18 1.09 0.40 0.38 (0.35) (0.69) (0.37) (0.37) SPI Overall Score 0.00 0.00 (0.00) (0.00) SPI Overall Score 0.01+ (Extended Series) (0.00) SPI Data Sources Score 0.00+ (Extended Series) 4The SCI does not contain as many indicators as the SPI (25 indicators versus 51 in the SPI), and not all the indicators are similar. Eleven indicators in the SCI overlap with the SPI: national accounts base year, consumer price index, balance of payments manual in use, government finance accounting, population census, agriculture census, health surveys, household consumption/income survey, CRVS availability, and the IMF SDDS status of the country. Using these indicators, the extended SPI series contains 35 indicators with data back to 2004. These include 5 indicators covering data use, 1 covering data services, 16 covering data products, 8 covering data sources, and 4 covering data infrastructure. The correlation between the extended SPI and the official SPI overall score is around 0.90. 20 (1) (2) (3) (4) (5) (6) (0.00) Observations 1974 1974 661 661 1974 1974 R2 0.957 0.956 0.991 0.991 0.957 0.958 R2 Adj. 0.953 0.952 0.988 0.988 0.954 0.954 R2 Within 0.025 0.001 0.001 0.013 0.036 0.039 R2 Within Adj. 0.025 0.000 -0.001 0.007 0.035 0.037 AIC 1571.9 1621.2 -280.9 -285.1 1554.1 1548.5 BIC 2421.3 2470.6 492.0 496.8 2414.6 2409.0 RMSE 0.33 0.34 0.15 0.15 0.33 0.33 Note: Data from the World Bank’s World Development Indicators (WDI) and SPI. Papers include all papers using data years 2004-2019. SPI Extended Series data supplements SPI data with data from Statistical Capacity Indicator (SCI) to extend series back to 2004. All regressions include standard errors clustered at the country level and country and year fixed effects. ***=0.001 level, **=0.01 level, *=0.05 level,+=0.1 level. Using a similar methodology as before, which combines SPI data with indicators available from the older SCI to create an extended data series back to 2004, partial regression coefficients are shown for the panel regression in Figure 8. Using the panel regressions with country and year fixed effects, the coefficient for labor force surveys remains statistically significant at the 10% level. The coefficient however shrinks from around 0.35 in the cross-sectional regression to close to 0.15. Figure 8. Panel Relationships Between Number of Papers Using Data and Data Sources Note: Results from panel regression controlling for year and country-fixed effects, log population and log GDP. Data from the World Bank’s Development Indicators (WDI) and the Statistical Performance 21 Indicators. The sample includes all papers using data between 2000 and 2020. Confidence intervals are at the 95% significance level. 6.2 Excluding Medical Articles As noted previously, medical articles make up around 81% of all articles in our corpus. A concern could be that medical articles receive too much weight in our calculations of country scores. As a check, we compare our main results when dropping all medical articles. Except for countries with very few articles classified as using data, there is a very high correlation between the measures based on all fields and measures excluding medicine (Figure 9). This makes us comfortable that the high share of medical articles is not driving our results. Figure 9. Comparison Between All Papers Using Data and Non-Medical Papers Using Data Figure 10 shows the relationships between the number of articles per country using data for all subjects and for each subject respectively between 2000 and 2020. In all cases, academic output using data is strongly correlated across subjects. The correlation between the number of medical articles using data and economics is around 0.94. The subjects with the greatest correlation in articles using data is Political Science and Sociology with a correlation close to 0.97. Economics and Psychology have the lowest correlation (0.74). 22 Figure 10. Correlation in Papers Using Data Across Subjects. 2000-2020. 7 Conclusion This paper suggests a method for measuring data use by researchers by classifying academic articles using natural language processing (NLP). Our NLP model can classify whether articles use data with 87% accuracy and if aggregated to the country-level, match human rated country- level classifications of academic papers using data with a correlation of around 0.99. The obtained classification correlates strongly with alternative measures in the literature. We apply the NLP model to around 1 million academic articles. We find that high-income countries are subject of a disproportionate share of academic output using data compared to their population. High-income countries account for around 50% of all articles, while making up only 17% of the world population. Low-income countries are the subject of only 5% of academic output using data, while making up 10% of the global population. GDP per capita and population are strong predictors of academic output, explaining around 75% of the variation between countries. We also find that statistical performance – even after conditioning on population and GDP per capita – is a strong predictor of academic output using data, particularly in the case of data sources available. We explore the data sources that are associated with data use, and find population censuses, geospatial data, agricultural surveys, and labor force surveys to be most relevant. Finally, we classify countries into four clusters, indicating whether they could most benefit from extending their availability of core data products or make existing data more 23 accessible to researchers, for example through data sharing agreements or by boosting data literacy. Our results can help countries better understand if their statistical system is supporting the needs of researchers. Given that research is at the heart of evidence-based policy making, a statistical system oriented around making data available and easily accessible for these purposes is important. Countries could use the results to facilitate an understanding of the maturity of their statistical systems in supporting researchers, point to other countries that a particular country could learn from as it seeks to improve, and track progress or assess return on investment for funding related to statistical capacity building. Our results face some limitations. First, the s2orc database does not currently contain non-English language articles. While the dominance of the English language in academic publications has been well-documented (Altbach 2007), the database omits articles published in other languages. Second, because of computational limitations, only 1 million articles were analyzed out of a sampling frame of around 10 million articles. Future work could bring in these additional articles to improve precision. Third, while the natural language processing model was able to classify whether or not the articles made use of data with 87% accuracy, the model could be extended in ways to extract more information from the articles, including more details about the subject of study and which particular data sources may have been used (surveys, censuses, administrative data, geospatial information etc.). References Altbach, Philip G. “The imperial tongue: English as the dominating academic language.” Economic and political Weekly (2007): 3608-3611. Cameron, Drew B, Anjini Mishra, and Annette N Brown. 2016. “The Growth of Impact Evaluation for International Development: How Much Have We Learned?” Journal of Development Effectiveness 8 (1): 1–21. Cameron, Grant J., Hai‐Anh H. Dang, Mustafa Dinc, James Foster, and Michael M. Lokshin. 2021. “Measuring the Statistical Capacity of Nations.” Oxford Bulletin of Economics and Statistics 83(4): 870-896. Courtioux, Pierre, François Métivier, and Antoine Rebérioux. 2022. “Nations Ranking in Scientific Competition: Countries Get What They Paid For.” Economic Modelling 116: 105976. Dang, Hai-Anh H, John Pullinger, Umar Serajuddin, and Brian Stacy. 2023. “Statistical Performance Indicators and Index: A New Tool to Measure Country Statistical Capacity.” Scientific Data 10(1): 146. Das, Jishnu, Quy-Toan Do, Karen Shaines, and Sowmya Srikant. 2013. “US and Them: The Geography of Academic Research.” Journal of Development Economics 105: 112–30. 24 Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1810.04805. Hansen, Stephen, Michael McMahon, and Andrea Prat. 2018. “Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach.” The Quarterly Journal of Economics 133 (2): 801–70. Hjort, Jonas, Diana Moreira, Gautam Rao, and Juan Francisco Santini. 2021. “How Research Affects Policy: Experimental Evidence from 2,150 Brazilian Municipalities.” American Economic Review 111 (5): 1442-80. Jolliffe, Dean, Daniel Gerszon Mahler, Malarvizhi Veerappan, Talip Kilic, Philip Wollburg. 2023. “What Makes Public Sector Data Relevant for Development?” World Bank Research Observer 38 (2): 325-346. Kleinberg, Bennett, Maximilian Mozes, Arnoud Arntz, and Bruno Verschuere. 2018. “Using Named Entities for Computer-Automated Verbal Deception Detection.” Journal of Forensic Sciences 63 (3): 714–23. Lo, Kyle, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. “S2ORC: The Semantic Scholar Open Research Corpus.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4969–83. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.447. National Science Board, National Science Foundation. 2019. “Publications Output: US Trends and International Comparisons. Science & Engineering Indicators 2018.” National Science Foundation. Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” arXiv. https://doi.org/10.48550/ARXIV.1912.01703. Phillips, Brian J., and Kevin T. Greene. 2022. “Where is Conflict Research? Western Bias in the Literature on Armed Violence.” International Studies Review 24 (3): viac038. Porteous, Obie. 2020. “Research Deserts and Oases: Evidence from 27 Thousand Economics Journal Articles on Africa.” Oxford Bulletin of Economics and Statistics. Robinson, Michael D, James E Hartley, and Patricia Higino Schneider. 2006. “Which Countries Are Studied Most by Economists? An Examination of the Regional Distribution of Economic Research.” Kyklos 59 (4): 611–26. Sabet, Shayda Mae, and Annette N Brown. 2018. “Is Impact Evaluation Still on the Rise? The New Trends in 2010–2015.” Journal of Development Effectiveness 10 (3): 291–304. Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019). 25 Shelar, Hemlata, Gagandeep Kaur, Neha Heda, and Poorva Agrawal. 2020. “Named Entity Recognition Approaches and Their Comparison for Custom Ner Model.” Science & Technology Libraries 39 (3): 324–37. Smirnov, Anatoly A., and Irina V. Stukova. “Determinants of integration approach in the agrarian sphere development in contexts of transformation.” Review of European studies 7, no. 8 (2015): 8. World Bank. 2021. “World Development Report 2021: Data for Better Lives.” Washington DC: World Bank. https://wdr2021.worldbank.org/ Yu Tian. 2020. “THE PARTNER REPORT ON SUPPORT TO STATISTICS PRESS 2020.” PARIS21. https://paris21.org/sites/default/files/inline-files/PRESS2020%20Final.pdf. 26 Appendix Countries shaded in dark orange have the lowest numbers of data use articles, countries in dark green have the highest. Countries are grouped into five groups: • Top Quintile: Countries in the top quintile are classified in this group. Shading is in dark green. • 4th Quintile: Countries in the 4th quintile, or those above the 60th percentile but below the 80th percentile, are in this group. Shading is in light green. • 3rd Quintile: Countries in the 3rd quintile, or those between the 40th and 60th percentiles, are classified in this group. Shading is in yellow. • 2nd Quintile: Countries in the 2nd quintile, or those above the 20th percentile but below the 40th percentile, are in this group. Shading is in light orange. • Bottom 20%: Countries in the bottom 20% are classified in this group. Shading is in dark orange. Table A.1. Country Production of Articles Using Data. Per capita and Total. 2017-19. Articles using data per million Total Number of Articles using Country persons data New Zealand 35.55 133 Norway 22.56 98 Ireland 19.19 68 Denmark 17.89 89 Sweden 17.35 129 Australia 16.55 337 Finland 16.00 68 Singapore 15.55 86 Estonia 14.82 17 Switzerland 11.78 88 Netherlands 11.45 183 Hong Kong SAR, China 10.52 78 Slovenia 10.22 19 Cyprus 9.22 8 27 Articles using data per million Total Number of Articles using Country persons data Canada 8.83 289 Georgia 7.88 27 Latvia 7.84 12 Croatia 7.63 26 Lithuania 7.52 15 Israel 7.14 52 Malaysia 7.10 200 Greece 7.03 67 Portugal 6.68 60 Qatar 6.65 20 Botswana 6.40 14 Puerto Rico 6.16 19 Austria 5.97 43 Kosovo 5.96 9 North Macedonia 5.62 7 Mongolia 5.16 14 Eswatini 5.13 4 Belgium 5.08 60 Czechia 4.87 43 Oman 4.78 17 Spain 4.69 211 South Africa 4.35 214 Jordan 4.14 49 Italy 4.10 238 Slovak Republic 4.03 10 Bahrain 4.02 4 Germany 3.80 273 Jamaica 3.79 5 West Bank and Gaza 3.77 18 United Kingdom 3.75 219 Lebanon 3.75 21 28 Articles using data per million Total Number of Articles using Country persons data Iran, Islamic Rep. 3.71 266 Kuwait 3.68 12 Serbia 3.65 18 Ghana 3.60 110 Hungary 3.58 30 Bosnia and Herzegovina 3.57 10 Saudi Arabia 3.56 133 Uruguay 3.50 12 Costa Rica 3.41 16 Korea, Rep. 3.41 169 Chile 3.27 55 Congo, Rep. 3.23 15 Romania 3.15 47 Poland 3.14 134 Bulgaria 3.11 21 Japan 3.10 341 Albania 2.92 6 Nepal 2.91 66 Kenya 2.89 132 Trinidad and Tobago 2.85 2 Panama 2.84 10 France 2.72 165 Timor-Leste 2.60 3 United States 2.58 724 Gambia, The 2.52 4 Sri Lanka 2.42 42 Armenia 2.36 1 Lao PDR 2.22 19 Turkiye 2.18 157 Uganda 2.18 86 Zimbabwe 2.15 29 29 Articles using data per million Total Number of Articles using Country persons data Mauritius 2.11 3 Rwanda 2.00 28 Thailand 1.96 134 Tunisia 1.94 21 United Arab Emirates 1.85 20 Liberia 1.74 9 Brazil 1.72 323 Zambia 1.72 33 Ecuador 1.63 19 Moldova 1.63 4 Peru 1.53 45 Cambodia 1.48 23 Tanzania 1.45 76 Togo 1.42 9 Ethiopia 1.41 177 Nicaragua 1.35 10 Senegal 1.33 22 Colombia 1.30 70 Malawi 1.27 21 Papua New Guinea 1.26 15 Djibouti 1.24 1 Nigeria 1.24 232 Argentina 1.23 57 Indonesia 1.23 339 Gabon 1.19 4 Eritrea 1.14 5 Cuba 1.12 12 Libya 1.12 2 Benin 1.11 11 Mexico 1.10 124 Equatorial Guinea 1.07 3 30 Articles using data per million Total Number of Articles using Country persons data Viet Nam 1.07 101 Haiti 1.05 11 Central African Republic 1.02 3 Niger 0.98 26 Bolivia 0.93 12 Guinea 0.93 11 China 0.89 1,191 Sierra Leone 0.87 6 Dominican Republic 0.86 9 Honduras 0.84 10 Ukraine 0.84 33 Azerbaijan 0.83 9 Bangladesh 0.83 116 Iraq 0.83 32 Kazakhstan 0.81 13 Cameroon 0.79 24 Pakistan 0.77 177 Lesotho 0.75 2 El Salvador 0.74 4 Egypt, Arab Rep. 0.72 83 Morocco 0.72 27 Russian Federation 0.71 81 Mauritania 0.68 3 Guatemala 0.66 7 Syrian Arab Republic 0.66 16 Myanmar 0.63 40 Kyrgyz Republic 0.62 4 Sudan 0.61 26 Belarus 0.57 4 Madagascar 0.57 14 Mozambique 0.53 14 31 Articles using data per million Total Number of Articles using Country persons data Philippines 0.52 52 Afghanistan 0.50 15 Burkina Faso 0.46 8 South Sudan 0.41 3 Yemen, Rep. 0.41 17 Somalia 0.40 6 India 0.36 466 Venezuela, RB 0.35 6 Burundi 0.34 4 Mali 0.34 8 Algeria 0.30 11 Tajikistan 0.25 2 Chad 0.23 3 Congo, Dem. Rep. 0.22 15 Angola 0.19 4 Guinea-Bissau 0.17 0 Turkmenistan 0.16 0 Côte d'Ivoire 0.14 3 Korea, Dem. People's Rep. 0.14 0 Uzbekistan 0.12 1 Note: Papers include all papers using data years 2017-2019. Population data comes from the World Bank. Only countries with a total population of more than 1 million persons shown. 32 Figure A.1 Amazon Mturk Prompt Figure A.2. Comparison to Number of Academic Articles from other papers a) Comparison with Das et al. (2013) Note: Estimates from Das et al. (2013) are taken from Table A3 of their paper. 33 b) Comparison with Porteous (2020) Note: Estimates from Porteous (2020) are taken from Table 3 in that paper and include all journal publications from 2000 to 2019 (column 2 of Table 3). c) Comparison with NSF database of scientific and technical articles (2018) Note: Data from the NSF are papers produced for 2018. 34 Figure A.3. Comparison of the Number of Qualitative and Quantitative Papers Note: The figure compares the residuals from regressing the number of qualitative or quantitative papers on GDP and population. The figure uses data from 2017-2019. Figure A.4. Share of Qualitative Papers vs SPI Overall Score. 2017-2019. 35 Figure A.5. Relationship between Data Research and Data Supply a) Country Scatterplot Note: Red lines indicate median values b) Map Note: a) On the y-axis, this figure plots the average number of data papers per capita between 2017 and 2019. SPI Pillar 4 on the x-axis tracks the availability of recent censuses, surveys, Civil Registration and Vital Statistics, and subnational data (for details see Dang et al. (2023). Red dashed lines represent median values. 36 Table A.2. Relationship between Data Use in Academia and Data Sources from SPI. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Population Census in past 10 years 0.30+ (0.16) Agriculture Census in past 10 years 0.16 (0.12) Business/Establishment Census in 0.12 past 10 years (0.11) Household surveys: 2 or more in 10 0.33 years (0.21) Agriculture Surveys: 2 or more in 10 0.22* years (0.10) Labor Force Surveys: 2 or more in 10 0.36* years (0.15) Health Surveys: 2 or more in 10 years 0.07 (0.12) Business/Establishments: 2 or more in 0.20* 10 years (0.10) Complete Civil Registration and Vital -0.21 Statistics System (0.18) Availability of Data at 1st Admin 1.27** Level (ODIN) Score (0.44) Intercept -4.45*** -4.15*** -4.14*** -4.71*** -4.37*** -4.04*** -4.26*** -4.06*** -4.73*** -4.15*** (1.16) (1.18) (1.21) (1.25) (1.16) (1.16) (1.19) (1.19) (1.26) (1.14) Log GDP per capita 0.18** 0.18** 0.20** 0.22*** 0.20** 0.16* 0.21** 0.18** 0.28** 0.19** (0.07) (0.07) (0.06) (0.07) (0.06) (0.06) (0.06) (0.07) (0.09) (0.06) Log Population 0.30*** 0.30*** 0.29*** 0.30*** 0.31*** 0.29*** 0.29*** 0.30*** 0.29*** 0.29*** (0.05) (0.05) (0.06) (0.05) (0.05) (0.05) (0.06) (0.05) (0.06) (0.05) Log Qualitative Articles 0.68*** 0.67*** 0.69*** 0.65*** 0.64*** 0.66*** 0.69*** 0.66*** 0.69*** 0.64*** (0.08) (0.08) (0.08) (0.08) (0.08) (0.08) (0.08) (0.08) (0.08) (0.08) Observations 168 168 168 168 168 168 168 168 168 168 R Sq. 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 Note: GDP and population data are retrieved from the World Bank's World Development Indicators while the data sources are from the SPI. Papers include all papers using data years 2017-2019. ***=0.001 level, **=0.01 level, *=0.05 level, +=0.1 level. 37