Policy Research Working Paper 10822 Estimating Extinction Threats with Species Occurrence Data from the Global Biodiversity Information Facility Susmita Dasgupta Brian Blankespoor David Wheeler Development Research Group Development Data Group & Environment, Natural Resources and Blue Economy Global Practice June 2024 Policy Research Working Paper 10822 Abstract The world is experiencing a severe loss of biodiversity, high- by the International Union for Conservation of Nature. The lighting the need for a strong global conservation strategy. model is based on 87,731 species in the Global Biodiversity Effective conservation depends on accurate information Information Facility database that have been assessed by about where endangered species live and the local threats the International Union for Conservation of Nature. The they face. Using data from the Global Biodiversity Infor- results are used to predict threat levels for 512,675 species mation Facility, this paper creates threat and protection without International Union for Conservation of Nature indicators for more than 600,000 species, including ani- ratings, revealing many more potentially threatened species mals, plants, and fungi. The indicators include habitat size, and changing the maps of “conservation hotspots.” The level of protection, nearby population density, and specific paper concludes by noting that its methods can support threats like population encroachment for land species and rapid updates of species maps and threat indicators as the fishing activity for marine species. The paper then uses an Global Biodiversity Information Facility database continues ordered logit model to analyze the relationship between to grow. these indicators and the extinction risk categories assigned This paper is a product of the Development Research Group, the Development Data Group, Development Economics and the Environment, Natural Resources and Blue Economy Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at bblankespoor@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Es�ma�ng Ex�nc�on Threats with Species Occurrence Data from the Global Biodiversity Informa�on Facility Susmita Dasgupta Brian Blankespoor David Wheeler JEL classification: Q57 Keywords: Conserva�on planning, global biodiversity, species’ occurrence region, Species’ Ex�nc�on Threats, Kunming-Montreal Global Biodiversity Framework. Acknowledgments: This research was funded by a grant from the Global Environment Facility to a World Bank program managed by the authors, with Dr. Nagaraja Rao Harshadeep, Global Lead for Disrup�ve Technology. We are also thankful to the par�cipants of the webinar on this subject conducted during the Biodiversity Week, including our colleagues from the World Bank's Sustainable Development Global Prac�ce and the Environment, Natural Resources, and Blue Economy Global Prac�ce. The findings, interpreta�ons, and conclusions expressed in this paper are en�rely those of the authors. They do not necessarily represent the views of the Interna�onal Bank for Reconstruc�on and Development/World Bank and its affiliated organiza�ons, or those of the Execu�ve Directors of the World Bank or the governments they represent. 1. Introduc�on The world is rapidly losing biodiversity. Pimm et al. (2014) found that the current rate of species extinction is at least 1,000 times the background rate. Supporting evidence from the Living Planet Index (LPI), which tracks population trends for vertebrate species in terrestrial, freshwater, and marine habitats, shows a 69% decline since 1970. The LPI informs the Convention on Biological Diversity (CBD) and its Conference of the Parties (COP). In response to indicators of rapid decline, 188 governments ratified the Kunming-Montreal Global Biodiversity Framework (GBF) at COP 15 in December 2022. Among other measures, the Framework commits participants to protecting 30% of the planet by 2030. Effective implementation of the Framework requires addressing two key questions: (1) What is the spatial distribution of global biodiversity that should be protected? (2) How can protecting 30% of the planet best conserve this biodiversity? Our previous paper (Dasgupta et al. 2023) addressed the first question using the Global Biodiversity Information Facility (GBIF), which has expanded to include occurrences for over 2 million species. In the past two years, the GBIF has added about 1.3 million occurrence records daily. Most records include locational coordinates, enabling new estimates of spatial distributions for previously unmapped species and improved estimates for species with existing maps. Using machine-based pattern recognition, we estimated spatial occurrence maps for over 600,000 species. Our algorithm allows rapid updates and new maps as GBIF data increases. This paper addresses the second question by developing extinction risk indicators for all mapped species in our GBIF database. The number of mapped species far exceeds those with assessed extinction risks by the International Union for Conservation of Nature (IUCN) and other organizations. We use econometric analysis to identify significant predictors of threats identified by IUCN. Our database has three components: (1) Risk indicators for use by stakeholders; (2) A composite risk index with weights determined by our analysis; (3) Alternative indices with different weighting criteria. The remainder of the paper is organized as follows. We develop the ex�nc�on risk indicators in Sec�on 2, while Sec�on 3 explores their combina�on into composite indicators and the implica�ons. Sec�on 4 summarizes and concludes the paper. 2. Ex�nc�on Risk Indicators Among our ex�nc�on risk indicators, some apply to all species while others are confined to terrestrial or marine habitats. This sec�on begins with two indicators for all species: habitat size and degree of formal protec�on. For each terrestrial species, we add human popula�on density in its habitat, along with a species-specific measure of “human setlement sensi�vity” computed from GBIF species occurrence data. For marine species, we add two measures of fishing intensity, a measure of pressure from proximate coastal popula�ons, and the degree of coverage by 2 Exclusive Economic Zones (EEZs). Among the 8 indicators, 6 involve overlays of species occurrence maps on the spa�al distribu�ons of risk determinants. 2.1 General Risk Indicators 2.1.1 Species Occurrence Region Size Jenkins et al. (2015) note that “small species occurrence region size is the best predictor of ex�nc�on risk and, thus, the first metric for conserva�on priority”. This factor has been studied extensively in the empirical literature (Kraus et al. 2023; Veach et al. 2017; Purvis et al. 2000; Jenkins et al. 2015; Manne and Pimm 2001; Manne, Brooks and Pimm 1999). It has par�cular significance because it is a widely-recognized indicator of ex�nc�on risk that can be computed for any mapped species. 2.1.2 Formal Protec�on The World Database of Protected Areas (UNEP-WCMC 2019) provides a global shapefile that iden�fies all areas defined as protected under UNEP-WCMC standards. To compute the formal protec�on index, we have transformed the shapefile (which includes 283,568 polygons) into a global raster with a resolu�on of .05 decimal degrees (about 5 km). Each raster cell has value 1 if it includes a protected area and 0 otherwise. For each species, we overlay its occurrence map on the protected area raster. The mean value of the within-boundary raster cells is equivalent to percent coverage by formal protec�on. 2.2 Terrestrial Risk Indicators 2.2.1 Popula�on Density Other things equal, we would expect species’ ex�nc�on risk to increase with popula�on density. We measure density with a spa�al raster at 2.5 min resolu�on (.042 decimal degrees) from the Gridded Popula�on of the World (GPW), v4 (SEDAC/CIESIN, 2023). We overlay the popula�on raster with species occurrence maps and compute the mean cell popula�ons of occurrence areas for the 230,616 terrestrial species in our database. 1 2.2.2 Human Setlement Sensi�vity Species differ greatly in their sensi�vity to human incursion. In North America, for example, bears and mountain lions are generally sighted in wilderness areas while raccoons and squirrels are commonly observed in setled areas. If species occurrence reports provided unbiased informa�on about species popula�ons, a “setlement sensi�vity” metric could be derived from sta�s�cal analysis of the rela�onship between the spa�al distribu�ons of human and species popula�ons. The metric would capture the degree to which a species’ popula�on density decreases, increases or remains constant as human spa�al density increases. Other things equal, a species whose 1 Raster cell popula�ons are equivalent to popula�on densi�es in this case because all cells have the same area. 3 popula�on declines sharply with human popula�on density would be expected to face greater ex�nc�on risk if the human popula�on surged in its habitat. Unfortunately, compu�ng setlement sensi�vity is complicated by well-known sampling biases in GBIF occurrence repor�ng, which is o�en a voluntary exercise that does not employ scien�fic sampling. In prac�ce, repor�ng is more prevalent in accessible areas with higher incomes and larger popula�ons (Garcia-Rosello et al. 2023; Borgelt et al. 2022; Isaac and Pocock 2015; Reddy and Dávalos 2003). To illustrate the problem, consider a species that is moderately setlement- sensi�ve, so that unbiased repor�ng would reveal a modestly-sloped nega�ve rela�onship between its spa�al density and human popula�on density. However, popula�on-biased repor�ng entails a countervailing effect which may produce a sta�s�cal rela�onship that is posi�vely sloped. While this complica�on is unavoidable, our methodology can s�ll derive meaningful inferences from the data. Using the previously-described popula�on raster, we assign each terrestrial raster cell to one of 15 scaled popula�on groups.2 We overlay the popula�on raster with species occurrence maps and compute total occurrences by popula�on group for all terrestrial species. Total reported occurrences for species with extreme setlement sensi�vity will decline con�nuously across popula�on groups because the sensi�vity effect dominates popula�on- related repor�ng bias. Conversely, repor�ng bias will ensure that total occurrences increase con�nuously across popula�on groups for species with modest or zero setlement sensi�vity, as well as species that thrive in human setlements (e.g., brown rats). Intermediate cases are defined by setlement sensi�vi�es that increase with human popula�on density. Species with increasing sensi�vity may exhibit trend reversal across popula�on groups, with ini�al dominance for repor�ng bias ul�mately superseded by dominance for setlement sensi�vity. For each species, we incorporate these factors by iden�fying the popula�on group with the most reported occurrences. Rela�ve setlement sensi�vity is highest for group 6. For 432,278 terrestrial species, Figure 1 displays the distribu�on across popula�on groups. The highest-sensi�vity group (6) accounts for 123,875 species; counts increase in the first few groups and are rela�vely lower between popula�on groups 9-15. This distribu�on illustrates a varied response where some species tend to avert areas of human setlement while other species occurrence areas overlap high human setlement areas (popula�on group 9-15). 2 We assign cell popula�ons to 15 groups as follows (by interval upper limit): [0,1,10,50,100,500,1000,5000,10000,20000,40000,60000,80000,100000,10000000. 4 Figure 1: Terrestrial species with maximum occurrence reports by popula�on group 2.3 Marine Risk Indicators 2.3.1 EEZ Coverage Extended Economic Zones assign exclusive commercial rights to countries, which have powerful economic incen�ves to maintain sustainable fishing stocks and prevent unlicensed incursions. While EEZs should have the most direct impact on commercially-exploitable fish species, they may also provide secondary protec�on from uncontrolled botom trawling and other prac�ces that damage non-commercial species and their habitats. For this exercise, we employ the EEZ boundary shapefile from the Mari�me Regions Geodatabase maintained by the Flanders Marine Ins�tute (2019). We intersect this shapefile with occurrence maps for the 145,092 marine species in our database. For each species, we construct an EEZ coverage index as the percent of its total occurrence area that lies within an EEZ. 2.3.2 Fishing Intensity Other things equal, the intensity of fishing ac�vity within a species’ occurrence area may have a significant effect on its survival probability. To assess this factor, we draw on two databases maintained by Global Fishing Watch (GFW) (Kroodsma et al. 2018). 3 3 htps://globalfishingwatch.org/ 5 AIS-Based Global Fishing Effort GFW’s most comprehensive ac�vity is near-real-�me recording of fishing effort by vessels which broadcast their loca�ons using the global automa�c iden�fica�on system (AIS). GFW stores daily es�mates of vessel-specific fishing hours for the period 2012-2020 in georeferenced daily files at a resolu�on of 0.1 decimal degrees. 4 We have downloaded these files in csv format; computed georeferenced total fishing hours for 2012-2020; rasterized the result; and resampled to a raster with 0.25 degree resolu�on for faster computa�on. We have constructed a fishing intensity index by overlaying this raster with the occurrence map for each marine species and compu�ng mean fishing hours for all raster cells within the map boundary. Estimates from Satellite Imagery AIS-based es�mates are incomplete because many fishing vessels do not use AIS consistently. In response, GFW has recently collaborated with several research ins�tu�ons to derive more accurate es�mates from high-resolu�on satellite imagery (Paolo et al. 2024). The authors note that the exercise covers somewhat more than 15% of the ocean in which more than 75% of industrial fishing ac�vity is concentrated. We have downloaded the summary raster at 0.10 degree resolu�on 5 and resampled to 0.25 degrees for faster computa�on. Then we have constructed our second fishing intensity index by overlaying the result with the occurrence map for each marine species and compu�ng the mean value for raster cells within the map boundary. 2.3.3 Coastal Popula�on Influence Human setlements in coastal areas influence offshore species via recrea�on, tourism, small-scale fishing, and exploita�on of marine resources in other ways. Although quan�fica�on of specific effects is not feasible at global scale, the aggregate effect may well be significant. Accordingly, we have constructed a general index of popula�on influence with a spa�al kriging algorithm that replaces offshore missing values in our global terrestrial popula�on raster (Sec�on 2.2.1) with projected values from proximate onshore popula�ons. 6 3. Composite Risk Indicators 4 These files are available for download at htps://globalfishingwatch.org/data-download/datasets/public- fishing-effort. The relevant csv files are in zip format with the prefix mmsi-daily-csvs-10-v2-2012 … 2020. 5 The file is raster_5th_degree.feather, available at htps://figshare.com/ar�cles/journal_contribu�on/Satellite_mapping_reveals_extensive_industrial_ac�vi ty_at_sea_-_analysis_data/24309475. 6 We use the inverse distance weigh�ng (idw) func�on in R, with a maximum projec�on distance of 5 decimal degrees. We spread the effects of highly-concentrated urban popula�ons by including 100 neighboring points in the projec�ons and se�ng the distance power parameter (the exponent of distance in idw) at 0. 6 The indicators developed in the previous sec�on can all be associated with ex�nc�on risk. However, their rela�ve importance remains a subject for debate because rigorous empirical analysis is difficult. In prac�ce, the conserva�on community has relied on risk assessments by experts who draw on unpublished species-level informa�on as well as publicly-available data. These assessments typically group species into categories based on their es�mated likelihood of future ex�nc�on, given current condi�ons. The resource intensity of species assessments has created several problems for conserva�on planning, including rela�vely long intervals between updates; an o�en-sizable gap between currently-available informa�on on species sigh�ngs and the habitat maps that are used for assessments; and a growing gap between the number of assessed species and the number of species for which georeferenced occurrence data are sufficient for mapping. The later gap has mo�vated the development of methods that can translate generally-observable risk factors into ex�nc�on risk es�mates for non-assessed species. These methods frequently use machine-learning models fited to data for species with risk category assignments by IUCN and other organiza�ons. Such models atempt to capture the collec�ve judgment of thousands of independent experts. They are fited to data on es�mated ex�nc�on risks, not observed ex�nc�ons, and they unavoidably incorporate assessors’ assump�ons about risk determinants that may be contestable. These caveats notwithstanding, we will devote the first part of this sec�on to es�ma�ng and applying a predic�ve model for non-assessed GBIF species that is based on IUCN assessments. In later sec�ons we will also develop and illustrate some alterna�ve approaches to risk index construc�on. 3.1 Es�ma�ng Ex�nc�on Risks from IUCN Data Numerous publica�ons cite IUCN as the authorita�ve source of species risk assessments, while no�ng the large gap between species documented by organiza�ons like the GBIF and species that have been assessed using IUCN’s �me- and resource-intensive evalua�on processes. Researchers have responded to this gap with machine-based learning algorithms that predict ex�nc�on risks from readily-observable characteris�cs of species and their environments. These exercises have been undertaken for vertebrates (Strona 2014 [fish], Tagliacollo et al. 2021 [fish], Caetano et al. 2022 [rep�les], Wieringa 2022 [bats], Cazalis et al. 2023 [mammals, rep�les, amphibians, fish], Lucas et al. 2023 [amphibians]; invertebrates (Cazalis et al. 2023, [order Odonata (dragonflies and damselflies)]; plants (Zizka et al. 2020 [orchids], Walker et al. 2023 [genus Myrcia, orchids, legumes], Zizka et al. 2022 [orchids], Silva et al. [trees], Ribeiro et al. 2022 [Brazilian terrestrial plants (Brazil)], Levin et al. 2022 (North American plants); and species across mul�ple taxa classed as Data Deficient by IUCN (Borgelt et al. 2022). Although this work has developed rapidly, several problems persist. First, species coverage is far from comprehensive. Most work has focused on vertebrate animals and vascular plants, with scant aten�on to other animals (most notably invertebrates) and other major phyla. Second, these pilot modeling exercises are idiosyncra�c, with no convergence to a standard approach as yet. Third, machine learning exercises have an unavoidable “black box” character, making it difficult to explain their predic�ons to non-technical people who are stakeholders in conserva�on policy. 7 This paper aims to address all three caveats. First, we develop a standard methodology that can produce ex�nc�on risk ra�ngs for GBIF species as soon as their georeferenced occurrence data are sufficient for mapping. Second, the methodology is reasonably simple and easy to replicate. Third, it provides a transparent view of the rela�onship between risk ra�ngs and the predic�ve indicators that we have developed in Sec�on 2. 3.1.1 Model Es�ma�on We focus on species in five IUCN Red List categories: Least Cri�cal (LC), Near Threatened (NT), Vulnerable (VU), Endangered (EN) and Cri�cally Endangered (CR) (IUCN 2022). We incorporate the IUCN categories into a model that predicts each species’ ex�nc�on threat status from the values of its risk indicators. We use a vulnerability model that treats a species’ probability of assignment to each of the five IUCN categories as an ordered hierarchy from LC (least vulnerable) to CR (most vulnerable). Ordered probability models can employ probit or logit es�mators, depending upon the distribu�on of model errors (normal for probit; logis�c for logit). We chose ordered logit es�ma�on because it provides a somewhat beter fit to the data. Our model es�ma�on dataset comprises the 87,731 mapped GBIF species that have been assigned IUCN Red List categories. We fit the ordered logit model to data for species in three terrestrial categories (vertebrates, plants, other species) and two marine categories (fish, other species). We depart from machine learning conven�on by using an explicit model, but we allow for highly nonlinear rela�ons among model variables with a translog specifica�on. 7 We also allow for differences within species groups by including dummy variables for taxonomic orders that are represented by more than 30 species. 3.1.2 Results For each species group, standard χ2 tests validate the inclusion of both order dummy variables and translog es�ma�on terms (the squared logs and log-interac�ons of risk indicators). Table 1 presents pairwise correla�ons for the variables in the five es�ma�ng equa�ons. The correla�ons are almost all very small, indica�ng that collinearity of model variables is not a problem in this case. 8 Our full results are quite lengthy and the signs and magnitudes of individual translog terms are difficult to interpret. Table 2 provides a useful par�al view by repor�ng results for a basic model that is restricted to the logs of the relevant risk indicator variables. In this model, the t-sta�s�c for the es�mated coefficient of each variable provides an approximate measure of its rela�ve importance. Table 2 shows that species occurrence region size is by far the most important variable across the five es�mates, with the appropriate sign and very high significance in all cases. Formal protec�on has the appropriate sign and consistently high significance in all three terrestrial 7 The translog specifica�on incorporates the logs and squared logs of all model variables, as well as all pairwise interac�ons among them. It is a second-order approxima�on to an unknown set of nonlinear rela�onships. 8 Independent effects can be hard to determine in regression models with collinear (highly-correlated) explanatory variables. 8 cases. The marine cases offer a different view of protec�on, with weak results for formal protec�on but very strong results for species’ habitats that lie within Exclusive Economic Zones. Popula�on pressure has the appropriate sign and high significance for terrestrial vertebrates, terrestrial plants and other marine species, although the results for other terrestrial species are anomalous. The effect of commercial pressure is evident for marine fish, with posi�ve and highly-significant results for one of our two fishing intensity measures. The other measure has a perverse sign but is insignificant. For terrestrial species, the setlement sensi�vity measure yields mixed results. Setlement sensi�vity is measured from 1 to 15, indica�ng the popula�on group that includes the greatest number of species observa�ons. A nega�ve result indicates that (other things equal), the risk is greatest and least, respec�vely, for species with the most observa�ons in the smallest and largest popula�on groups. However, these basic model results suggest that a different rela�onship holds for plants. 3.1.3 Implica�ons for Predic�on Our database includes risk indicator values for 512,675 species that lack IUCN ra�ngs, and we can use our model es�ma�on results for the 87,731 Red List species to predict ex�nc�on threats for the others. We test our model by developing a composite index that measures the likelihood that a species would be assigned some threat status above LC in the IUCN assessment process. For each species, we construct the index from the ordered logit model’s predicted assignment probabili�es for each IUCN category. We form composite probabili�es by adding the assignment probabili�es for NT, VU, EN and CR. We express the probabili�es as percents, round to the nearest 10th percent and, for each percent category, we compute the share of species that IUCN assigns above-LC status. Table 3 presents the results for each of our five species groups, which look very similar. At this level of aggrega�on, model predic�ons track IUCN assignments very closely. In the case of land vertebrates, for example, the first table row presents results for species with model-predicted (percent) probabili�es between 0 and 10% of assignment to above-LC status. For this group, IUCN actually assigns above-LC status to 5.22% of species. The assignment percentage increases steadily down the rows; in the final row (model-predicted probabili�es between 90% and 100%), IUCN assigns above-LC status to 95.1% of species. The same patern is evident in the other four cases, with last-row IUCN above-LC assignment probabili�es of 92.7% (land plants), 90.3% (other land species), 100% (marine fish) and 91.8% (other marine species). 9 Table 1: Correla�ons between predic�ve model variables Location Species Factor Definition Factor1 Factor2 Factor3 Land Species Occurrence Vertebrates Region Size Factor1 Setlement Sensi�vity Factor2 0.038 Popula�on Density Factor3 -0.068 -0.69 Formal Protec�on Factor4 0.099 -0.20 0.25 Species Occurrence Plants Region Size Factor1 Setlement Sensi�vity Factor2 0.065 Popula�on Density Factor3 -0.10 -0.67 Formal Protec�on Factor4 -0.031 -0.15 0.17 Species Occurrence Other Region Size Factor1 Setlement Sensi�vity Factor2 0.13 Popula�on Density Factor3 -0.037 -0.59 Formal Protec�on Factor4 -0.062 -0.048 -0.016 Ocean Species Occurrence Fish Region Size Factor1 EEZ Coverage Factor2 0.0082 Fishing Intensity (1) Factor3 -0.30 0.0018 Fishing Intensity (2) Factor4 -0.021 0.14 0.47 Other Species Occurrence species Region Size Factor1 EEZ Coverage Factor2 0.38 Coastal Popula�on Factor3 -0.019 0.10 Formal Protec�on Factor4 -0.059 -0.16 0.11 10 Table 2: Ordered logit results for IUCN Red List categories (1a: Terrestrial) Vertebrates Plants Other Log Variable Coefficient t-Statistic Coefficient t-Statistic Coefficient t-Statistic Species Occurrence Region Size -0.183 55.25** -0.319 93.16** -0.138 23.17** Settlement Sensitivity -0.050 2.77** 0.079 4.24** -0.126 4.06** Population Density 0.054 11.02** 0.022 4.50** -0.031 3.43** Formal Protection -0.038 5.93** -0.035 5.53** -0.076 6.84** N 24,943 32,667 6,681 R2 0.12 0.22 0.11 R2 (Full)a 0.22 0.29 0.26 (1b: Marine) Fish Other Species Log Variable Coefficient t-Statistic Coefficient t-Statistic Species Occurrence Region Size -0.034 7.80** -0.227 34.79** Exclusive Economic Zone -0.256 24.22** -0.075 10.34** Fishing Intensity (1) -0.01 1.75 Fishing Intensity (2) 0.047 6.11** Coastal Population Influence 0.034 5.67** Formal Protection -0.013 1.69 N 8,530 14,897 R2 0.09 0.14 R2 (Full)a 0.27 0.22 Note: Levels of significance are denoted as follows: * significant at 5%; ** significant at 1% a Including translog terms. 11 Table 3: Model-predicted risk assignments vs. IUCN assignments (1a: Terrestrial) Model Other Species Predictions: Vertebrates Plants Probability Above Above Above Group (%) LC LC LC LC LC LC 0 94.8 5.22 94.8 5.25 94.9 5.09 10 86.5 13.5 85.9 14.1 85.5 14.5 20 74.6 25.4 76.7 23.3 75.7 24.3 30 65.7 34.3 65.9 34.1 65.1 34.9 40 53.5 46.5 56.6 43.4 54.7 45.3 50 45.9 54.1 45.9 54.1 42.0 58.0 60 35.7 64.3 30.4 69.6 33.5 66.5 70 23.6 76.4 21.7 78.3 21.1 78.9 80 13.2 86.8 14.5 85.5 17.7 82.3 90 4.9 95.1 7.3 92.7 9.68 90.3 (1b: Marine) (1b: Marine) Model Predictions: Fish Other Species Probability Above Above Group (%) LC LC LC LC 0 95.9 4.05 91.4 8.59 10 87.5 12.5 80.4 19.6 20 73.7 26.3 74.5 25.5 30 60.7 39.3 67.8 32.2 40 50.4 49.6 58.1 41.9 50 48.6 51.4 46.3 53.7 60 32.6 67.4 32.9 67.1 70 23.4 76.6 23.4 76.6 80 22.6 77.4 14.2 85.8 90 33.3 66.7 10.8 89.2 12 3.1.4 Spa�al Implica�ons In this subsec�on, we assess the model by comparing global maps for its predic�ons and IUCN threat indicators. We confine the discussion to vertebrates and plants, since the IUCN data are too sparse to support robust comparisons for other major species groups (e.g., Arthropods). We overlay species maps with a 0.25 degree global grid, compute each cell total by summing across all species whose species occurrence regions overlap it, and normalize results to the species occurrence region [0-100] to facilitate comparison. Figure 2 displays the global distribu�on of threat status counts for over 77,000 GBIF plants and vertebrates (nearly 40,000 plants; 37,000 vertebrates) that have IUCN risk category assignments. Each cell contains the number of overlayed species (normalized to [0 – 100]) that are rated NT, VU, EN or CR by IUCN. This approach assigns a weight of 1 to any species in the four categories and 0 to species rated LC. In contrast, our ordered logit model assigns every species a probability for each of the four threat categories. For comparison with Figure 2, each cell value in Figure 3 is the sum of total predicted probabili�es for all overlayed species (normalized to [0 – 100]). Spa�al paterns in the two figures are strikingly similar, with significant threat clusters in the Eastern United States, Mexico, the Northern part of South America, Central Europe, Eastern Africa, coastal Western Africa, Southeast Asia and Eastern Australia. Figure 3 displays total model-predicted cell probabili�es for GBIF plants and vertebrates that are rated in the IUCN database. We extend the exercise by using the es�mated ordered logit model and threat indicators to predict results for the 203,000 GBIF species (nearly 190,000 plants, 13,000 vertebrates) that are not rated by IUCN. Addi�on of these species more than triples the size of the applica�on database (from 77,000 to 281,000). Figure 4 replicates the methodology of Figure 5, but for all plants and vertebrates in the GBIF database. Figure 6 summarizes the results by displaying the cellwise changes from Figure 5 to Figure 6 (in normalized values). Incorpora�ng many more GBIF species produces heightened threat status for Heightened threat areas include coastal North America, the Northern part of South America and Atlan�c Brazil, coastal West Africa, South Africa, Madagascar, Western Europe, Southeast Asia, Japan and Australia. We should emphasize that while normaliza�on to the species occurrence region [0 – 100] is useful for comparisons, it should not mask the fact that tripling the species popula�on has the effect of raising absolute values across the board. A balanced interpreta�on of Figures 2 – 5 should therefore note that many more poten�ally-threatened species have been iden�fied across the globe by this exercise, while the spa�al distribu�on of threatened species has also been altered. 13 Figure 2: Plants and vertebrates: global cell counts for species in IUCN categories NT, VU, EN, CR 14 Figure 3: Plants and vertebrates: global cell totals for model-predicted threat probabili�es, IUCN species match 15 Figure 4: Plants and vertebrates: global cell totals for model-predicted threat probabili�es, all GBIF Species 16 Figure 5: Plants and vertebrates: difference in model-predicted threat probabili�es, IUCN match vs all GBIF species 17 3.2 Alterna�ve Composite Indicators 3.2.1 Small Occurrence Region Species Priority While species occurrence region size is commonly cited as the best predictor of ex�nc�on risk (e.g., Jenkins et al. 2015), this is not always reflected in risk assessments. To illustrate, our compara�ve exercise includes 87,718 IUCN-assessed terrestrial and marine species that we have mapped with GBIF data. Among these, many species with �ny species occurrence regions (grid scales of 5 km or less9) are rated Least Cri�cal (LC) by IUCN. However, it would be reasonable to assert that species with such �ny species occurrence regions are in constant jeopardy, because spa�al economics evolve unpredictably and modern systems can clear large areas in �me periods far shorter than typical update intervals for IUCN assessments. At the same �me, the econometric results in Table 2 show that habitat size is the most significant determining factor for IUCN threat ra�ngs in all five es�ma�on groups. Figure 6 reveals the source of this apparent contradic�on. To create the figure, we have computed the percent of LC species in 5-km species occurrence region scale groups up to 200 km. The scater plot shows the strong overall effect of scale on LC assignment, which reaches 83.5% at 200 km scale. However, more than 27.5% of species are rated LC, even when their habitat scales are 5 km or less. Figure 6: LC assignment vs species occurrence region scale In this subsec�on, we explore the implica�ons of assigning greater priority to small- species occurrence region status. We divide our more than 600,000 GBIF species into five groups by grid scale (species occurrence region size) in km: 1 [0-20]; 2 [21-50]; 3 [51-100]; 4 [101-200]; 5 [201+]. For example, the database includes 77,290 species in scale group 1. We sort each scale group by predicted threat probability from the ordered logit model and stack them successively with group 9 The term “grid scale” refers to the side length of a square grid cell with equivalent area. 18 1 in the top 77,290 posi�ons. In the model-predicted threat probability index, the scale composi�on of the top 77,290 posi�ons is quite different. Among these, for example, only 41.26% are in group 1 (1-20 km). Once we stack the data by habitat scale group, 21.05% of the species are displaced to scale group 2 in the stack, 16.27% to group 3, etc. Table 4: Species Occurrence Region sizes of top-ranking species in the model-based composite indicator Species occurrence region area (Min. km) Count % 1x1 31,893 41.26 21 x 21 16,267 21.05 51 x 51 12,578 16.27 101 x 101 8514 11.02 201 x 201 8038 10.4 Total 77,290 100 3.2.2 Spa�al Implica�ons To compare spa�al distribu�ons before and a�er the change, we divide the original (Figure 4) species ordering and the stacked ordering into 20 equal-sized groups. For the cases before and a�er stacking, we construct variables in which species are assigned values equal to their maximum group percen�les [5, 10, …, 95, 100]. Then we overlay all GBIF species maps for vertebrates and plants on our 0.25 degree grid, compute grid cell sums for the before and a�er variables, and normalize to the species occurrence region [0-100]. Figures 7 and 8 display the results, which seem quite similar at global scale. For a regional comparison, Figure 11 focuses on northwest South America. At this scale, small- species occurrence region priori�za�on via stacking clearly makes a difference; areas with rela�vely high threat ra�ngs expand eastward and grow markedly in size. By implica�on, species with very small habitats are more widely distributed in this area than other threatened species. Figure 10 deepens the global analysis by displaying cellwise changes. Madagascar and Western Cape in Africa and marine areas in the southern hemisphere have decreases with threat status. In contrast, modest rela�ve increases are observable in many of the areas with highest threat status in Figure 7 (Mexico, the western Andean region, Atlan�c Brazil, coastal West and Central Africa, South Africa, Madagascar, Southeast China, and coastal southeast Australia). These global paterns suggest that species with very small habitats are more widely dispersed than threatened species as a group, so that assigning higher priority to small-habitat species tends to broaden the areas with high threat status. 19 Figure 7: Plants and vertebrates: Global cell totals for model predic�ons by threat group 20 Figure 8: Plants and vertebrates: Global cell totals for model-predic�ons a�er priority reassignment for small species occurrence region species 21 Figure 9: Plants and vertebrates: Western South America: Effect of priority reassignment for small species occurrence region species 22 Figure 10: Plants and vertebrates: differences in global cell totals for model-predic�ons a�er priority reassignment for small species occurrence region species 23 3.2.3 Other Summary Indicators In our econometric analysis, es�ma�on of the ordered logit model for each of five groups 10 has iden�fied four risk indicators that play significant roles in determining ex�nc�on threat probabili�es. Our database iden�fies these as Factors 1 to 4, with accompanying defini�ons that are included in Table 1. We have devoted considerable aten�on to predic�on of species-level threat probabili�es from econometric es�ma�on of the rela�onships between these factors and IUCN threat indicators. However, it is also possible to construct aggregate threat indicators from the factors themselves. In this sec�on, we develop and explore the implica�ons of three such indicators: the maximum, median and mean values of the four factors. The maximum value can be interpreted as the most conserva�ve available indicator, because each species’ threat measure is its largest risk factor. The median value is a robust measure of central tendency that is not dominated by one outlier among the four factors. The standard alterna�ve is the mean value, which incorporates all available informa�on at the cost of some outlier risk. An important task for this exercise is to determine whether the spa�al distribu�ons produced by these alterna�ve indicators are similar or quite different. Our core comparison includes four indicators: the values predicted by our ordered logit model and the maximum, median and mean values from our 4 factor variables. In addi�on, we acknowledge the econometric dominance of species occurrence region size (Factor 1 in all cases) by including this factor as a stand-alone indicator. For each of the five indicators, we assign its species values to species maps for plants and vertebrates; overlay those maps on the global 0.25 degree grid; compute cell totals across overlayed species; and normalize the results to the species occurrence region [0-100]. Figure 11 displays all five indicators with iden�cal color scales for ease of comparison. Visual inspec�on indicates that the Max, Median and Mean indexes have only second-order differences. Some differences are apparent on close scru�ny, but the overall global patern is basically the same. When Factor 1 Species Occurrence Region Scale) alone is employed, the iden�fied threat regions remain broadly similar to the previous three maps but there are smaller clusters of highest-threat (red) cells in southern Africa, Europe, Northern coastal Australia and some areas in East Asia. The Model-Predicted indicator resembles the Species Occurrence Region Scale indicator, but with smaller red cluster areas in all regions. 10 Terrestrial [vertebrates, plants, other]; marine [fish, other]. 24 Figure 11: Plants and vertebrates: five threat indicators 25 4. Summary and Conclusions In a previous paper (Dasgupta et al. 2023), we developed occurrence maps for more than 600,000 species from georeferenced data provided by the Global Biodiversity Informa�on Facility (GBIF). In this paper, we have combined those maps with ex�nc�on threat factors to produce composite global indicators of ex�nc�on threat that provide new insights into the spa�al domain for conserva�on planning. The paper develops threat factors for terrestrial species that include their species occurrence region sizes, their sensi�vity to human encroachment, the density of neighboring popula�ons, and the degree of formal protec�on. We also develop threat factors for marine species that include species occurrence region size, degree of formal protec�on, degree of coverage by EEZs (Exclusive Economic Zones), the impact of coastal popula�on, and two measures of commercial fishing ac�vity. The paper explores three approaches to composite threat indicator construc�on. The first comes from econometric es�mates of the rela�onship between our threat indicators and expert-based Red List ra�ngs from IUCN. Our approach extends numerous pilot exercises in the published literature to an econometric model that can be applied to all mapped species in the IUCN database. The model uses our threat indicators to es�mate the probability that a species is assigned a threatened status (NT, VU, EN, CR) by IUCN. We find a close sta�s�cal associa�on between the threat probabili�es predicted by our model and IUCN threat status assignments. We also find a close spa�al associa�on between the two variables. We use our econometric results to predict threat probabili�es for mapped GBIF species that have not been rated by IUCN. We add these to the compara�ve mapping exercise and find that radical expansion of the species database produces a geographic broadening of high-risk areas. We construct a second composite indicator that assigns higher conserva�on priority to species whose small habitats may put them in permanent jeopardy. This indicator employs our model- based threat indicator, but modifies it to accommodate grouping by habitat scale. A spa�al comparison with our model-based indicator at regional scales reveals that small-habitat species are more widely distributed than threatened species generally. By implica�on, assignment of higher priority to small-habitat species entails expansion of areas whose threat status may warrant formal protec�on. Finally, we construct a set of indicators that combine our risk factors directly rather than relying on an econometric model. We develop composite indicators based on the maximum values, medians and means of the four factors that have significant effects on IUCN threat ra�ngs for different groups of species. We map the results and find that the three indicators exhibit only second-order differences at global scale. We compare their spa�al patern with the patern yielded by species occurrence region size, which the econometric analysis shows to be the most significant determinant of IUCN threat status for all species. We find more pronounced concentra�on of high-threat areas in the later case, although the overall spa�al patern of threatened areas retains a close resemblance to the paterns for maximum, median and mean indicators. This trend con�nues for the model-based indicator, which shows even more concentrated high-threat areas within a global threat patern that resembles the paterns for the other four indicators. 26 To summarize, the results in this paper complement our previous mapping exercise (Dasgupta et al. 2023) by showing that the GBIF’s radical expansion of georeferenced species informa�on has significant implica�ons for global conserva�on planning. From a sta�c perspec�ve, spa�al analysis suggests that broadening species representa�on also broadens the cri�cal domain for biodiversity conserva�on. The concurrent availability of new risk indicators also broadens the scope for stakeholder par�cipa�on because, as our results show, iden�fica�on of high-priority areas can vary with differences in the assignment of weights to risk indicators. We should close by adding a dynamic perspec�ve as well. Con�nued rapid growth of the GBIF database will both expand the number of mapped species and alter the es�mated boundaries of exis�ng maps. As this process con�nues, the global stakeholder community will be best served by a georeferenced database that supports con�nually updated es�ma�on of species occurrence maps and the associated ex�nc�on threat indicators. 27 References Borgelt, J., J. Sicacha-Parada, O. Skarpaas et al. 2022. Na�ve range es�mates for red-listed vascular plants. Nature Scien�fic Data, 9:117. Borgelt, J., M. Dorber, M. Høiberg and F. Verones. 2022. More than half of data deficient species predicted to be threatened by ex�nc�on. Communica�ons Biology, 5:679. Caetano, G., D. Chapple, R. Grenyer, T. Raz, J. Rosenblat, R. Tingley et al. 2022. Automated assessment reveals that the ex�nc�on risk of rep�les is widely underes�mated across space and phylogeny. PLoS Biol 20(5). Cazalis, V., L. San�ni, P. Lucas et al. 2023. Priori�zing the reassessment of data-deficient species on the IUCN Red List. Conserva�on Biology, 2023;e14139. Dasgupta, S., B. Blankespoor, and D. Wheeler. 2023. Revisi�ng Global Biodiversity: A Spa�al Analysis of Species Occurrence Data from the Global Biodiversity Informa�on Facility. Policy Research Working Paper, World Bank, June 2024. Flanders Marine Ins�tute. 2019. Mari�me Boundaries Geodatabase: Mari�me Boundaries and Exclusive Economic Zones (200NM), version 11. Available online at htps://www.marineregions.org/. htps://doi.org/10.14284/386 Garcia-Rosello, E., J. Gonzalez-Dacosta, C. Guisande and J. Lobo. 2023. GBIF falls short of providing a representa�ve picture of the global distribu�on of insects. Systema�c Entomology, 48(4): 489-497. Isaac, N. and M. Pocock. 2015. Bias and informa�on in biological records. Biological Journal of the Linnean Society. 115: 522–531. IUCN. 2022. The IUCN Red List of Threatened Species. Version 2022-2. htps://www.iucnredlist.org. Accessed on [9 May 2023] at htps://doi.org/10.15468/0qnb58 Jenkins, C., K. Van Houtan, S. Pimm and J. Sexton. 2015. US protected lands mismatch biodiversity priori�es. PNAS, 112(16): 5081-5086. Kraus, D., A. Enns, A. Hebb et al. 2023. Priori�zing na�onally endemic species for conserva�on. Conserva�on Science and Prac�ce, 5(1). Kroodsma, D. A., Mayorga, J., Hochberg, T., Miller, N. A., Boerder, K., Ferre�, F., ... & Worm, B. 2018. Tracking the global footprint of fisheries. Science, 359(6378), 904-908. Levin, M., J. Meek, B. Boom, S. Kross and E. Eskew. 2022. Using publicly available data to conduct rapid assessments of ex�nc�on risk. Conserva�on Science and Prac�ce, 2022;e12628. 28 Lucas, P., M. Di Marco, V. Cazalis et al. 2023. Tes�ng the predic�ve performance of compara�ve ex�nc�on risk models to support the global amphibian assessment. bioRχiv, The Preprint Server for Biology. Manne, L. and S. Pimm. 2001. Beyond eight forms of rarity: Which species are threatened and which will be next. Animal Conserva�on, 4:221–229. Manne, L., T. Brooks and S. Pimm. 1999. Rela�ve risk of ex�nc�on of passerine birds on con�nents and islands. Nature, 399: 258–261. Paolo, F., D. Kroodsma, J. Raynor et al. 2024. Satellite mapping reveals extensive industrial ac�vity at sea. Nature, 625: 85–91. Pimm, S. L., Jenkins, C. N., Abell, R., Brooks, T. M., Gitleman, J. L., Joppa, L. N., ... & Sexton, J. O. 2014. The biodiversity of species and their rates of ex�nc�on, distribu�on, and protec�on. science, 344(6187), 1246752. Purvis, A., J. Gitleman, G. Cowlishaw and G. Mace. 2000. Predic�ng ex�nc�on risk in declining species. Proceedings of the Royal Society, Biological Sciences, 267: 1947–1952. Reddy, S. and L. Dávalos. 2003. Geographical sampling bias and its implica�ons for conserva�on priori�es in Africa. Journal of Biogeography, 30: 1719–1727. Ribeiro, B., K. Guidoni-Mar�ns, G. Tessarolo et al. 2022. Issues with species occurrence data and their impact on ex�nc�on risk assessments. Biological Conserva�on, 273: 109674. SEDAC/CIESIN (NASA Socioeconomic Data and Applica�ons Center/Center for Interna�onal Earth Science Informa�on Network, Columbia University). 2023. gpw_v4_popula�on_count_adjusted_to_2015_unwpp_country_totals_rev11_2020_2pt5_min.� f Silva, S., T. Andermann, A. Zizka, G. Kozlowski and D. Silvestro. 2022. Global Es�ma�on and Mapping of the Conserva�on Status of Tree Species Using Ar�ficial Intelligence. Front. Plant Sci., 13:839792. Strona, G. 2014. Assessing fish vulnerability: IUCN vs FishBase. Aqua�c Conserva�on Marine and Freshwater Ecosystems, 10.1002/aqc.2439. Tagliacollo V., F. Dagosta, M. de Pinna, R.Reis and J. Albert. 2021. Assessing ex�nc�on risk from geographic distribu�on data in Neotropical freshwater fishes. Neotrop Ichthyol., 19(3). UNEP-WCMC. 2019. User Manual for the World Database on Protected Areas and world database on other effec�ve area-based conserva�on measures: 1.6. UNEP-WCMC: Cambridge, UK. Available at: htp://wcmc.io/WDPA_Manual 29 Veach, V, E. Di Minin, F. Pouzols and A. Moilanen. 2017. Species richness as criterion for global conserva�on area placement leads to large losses in coverage of biodiversity. Diversity and Distribu�ons, 23: 715–726. Walker, B., T. Leão, S. Bachman, E. Lucas and E. Lughadha. 2023. Evidence-based guidelines for automated conserva�on assessments of plant species. Conserva�on Biology, 2023;37: e139. Wieringa, J. 2022. Comparing predic�ons of IUCN Red List categories from machine learning and other methods for bats. Journal of Mammalogy, 103(3): 528–539. Zizka, A., D. Silvestro, P. Vit and T. Knight. 2020. Automated conserva�on assessment of the orchid family with deep learning. Conserva�on Biology, 35(3): 897–908. Zizka, A., T. Andermann and D. Silvestro. 2022. IUCNN –Deep learning approaches to approximate species' ex�nc�on risk. Diversity and Distribu�ons, 28:227–241. 30