Policy Research Working Paper 11115 Building and Managing Local Databases from Google Earth Engine with the geeLite R Package Marcell T. Kurbucz Bo Pieter Johannes Andrée Development Economics Development Research Group May 2025 Policy Research Working Paper 11115 Abstract Google Earth Engine has transformed geospatial analysis is introduced to facilitate the construction, manage- by providing access to petabytes of satellite imagery and ment, and updating of local databases for Google Earth geospatial data, coupled with the substantial computational Engine-computed geospatial features, which enables users power required for in-depth analysis. This accessibility to monitor their evolution over time. By storing geospatial empowers scientists, researchers, and non-experts alike features in SQLite format—a serverless and self-contained to address critical global challenges on an unprecedented database solution requiring no additional setup or admin- scale. In recent years, numerous R packages have emerged istration—geeLite simplifies the data collection process. to leverage Google Earth Engine’s functionalities. How- Furthermore, it streamlines the conversion of stored data ever, constructing and managing complex spatio-temporal into native R formats and provides functions for aggregat- databases for monitoring changes in remotely sensed data ing and processing created databases to meet specific user remains a challenging task that often necessitates advanced needs. coding skills. To bridge this gap, geeLite, a novel R package, This paper is a product of the Development Research Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at bandree@worldbank.org and mkurbucz@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Graphical Abstract Building and Managing Local Databases from Google Earth Engine with the geeLite R Package Marcell T. Kurbucz, Bo Pieter Johannes Andrée Highlights Building and Managing Local Databases from Google Earth Engine with the geeLite R Package Marcell T. Kurbucz, Bo Pieter Johannes Andrée • Google Earth Engine offers vast satellite imagery and computational power. • geeLite is an R package designed to leverage the power of Google Earth Engine. • It enables creating, updating, and managing custom databases for real- time tracking. • It stores data in SQLite format, enhancing both accessibility and porta- bility. • It streamlines the reading, aggregation, and processing of created databases. Building and Managing Local Databases from Google Earth Engine with the geeLite R Package Marcell T. Kurbucza,b,1,∗, Bo Pieter Johannes Andréeb,1 a Institute for Global Prosperity, The Bartlett, University College London, 149 Tottenham Court Road, London, W1T 7NF, United Kingdom b Development Economics Data Group, World Bank, 1818 H Street NW, Washington, D.C., 20433, USA Keywords: Google Earth Engine, Geographic Information System, Remote Sensing, Raster Data, Spatio-Temporal Data, Software JEL Codes: C21, C63, C81, C88, L17 ∗ Corresponding author. Email addresses: mkurbucz@worldbank.org (Marcell T. Kurbucz), bandree@worldbank.org (Bo Pieter Johannes Andrée) 1 These authors contributed equally to this work. Funding by the World Bank’s Food Systems 2030 (FS2030) Multi-Donor Trust Fund program (TF0C0728 and TF0C7822) is gratefully acknowledged. We thank Andres Chamorro and Ben P. Stewart for code testing and comments, as well as Steve Penson, David Newhouse and Alia J. Aghjanian for helpful comments and input. This paper reflects the views of the authors and does not reflect the official views of the World Bank, its Executive Directors, or the countries they represent. 1. Introduction The ever-growing volume of Earth observation data presents both oppor- tunities and challenges for scientific inquiry. While vast datasets hold the potential to revolutionize our understanding of Earth systems, traditional desktop-based analysis methods often struggle with the computational bur- den associated with such data (Amani et al., 2020). Google Earth Engine (GEE) has emerged as a powerful solution, offering a cloud-based platform for efficient management, analysis, and visualization of geospatial big data (Gorelick et al., 2017). Its core strength lies in its ability to overcome the limitations of traditional approaches by providing (Tamiminia et al., 2020): • Petabytes of public geospatial data: A comprehensive data cata- log readily accessible through a web interface, including historical and current satellite imagery, environmental variables, and other geospatial information. • High-performance parallel processing: Leveraging Google’s cloud infrastructure for large-scale analysis, enabling researchers to tackle complex problems that would be infeasible on personal computers. • Accessible development environment: A web-based interface and application programming interfaces (APIs) in JavaScript and Python, supporting widely-used programming languages. This combination of features allows scientists to investigate new scientific questions in a wide range of subjects, particularly those requiring large-scale spatial and temporal analysis.2 Examples include studies on climate change (Banerjee et al., 2024; Kazemi Garajeh et al., 2024), environmental degrada- tion (Andrée et al., 2019), land cover change (Wang et al., 2020; Burgueño et al., 2023), deforestation (Chen et al., 2021; Brovelli et al., 2020), flood mon- itoring (Hamidi et al., 2023), wildfire prediction (Tavakkoli Piralilou et al., 2022), urbanization (Marconcini et al., 2021; Zheng et al., 2021), food se- curity (Andrée et al., 2020; Wang et al., 2022; Penson et al., 2024), and sustainable development (Burke et al., 2021), to name a few. Recognizing the broader potential of GEE within the R user community, Aybar et al. (2020) developed the rgee R package. Acting as a bridge, rgee seamlessly integrates GEE with R’s extensive ecosystem of geospatial 2 For a comprehensive review of current and potential future GEE research trends, refer to Velastegui-Montoya et al. (2023). 2 packages, making GEE’s capabilities accessible to a wider user base. This integration is facilitated by the reticulate package (Ushey et al., 2024), allowing rgee to interact with GEE’s official Python API and utilize all its modules, classes, and functions. In recent years, numerous R packages have been published to extend the functionality of rgee and improve its user experience. Among these, tidyrgee (Arno and Erickson, 2022) aids users in filtering, joining, and summarizing GEE image collections, rgee2 (Kong, 2022) expands upon the original package with additional functions, and rgeeExtra (Aybar et al., 2023) simplifies its syntax. Other packages, like SAEplus (Team, 2022) and LandsatTS (Berner et al., 2023), facilitate the data extraction and process- ing from GEE or provide geospatial decision-making tools, such as RePlant alpha (Morales et al., 2023). While available third-party libraries improve the accessibility of GEE’s functionalities, constructing and managing up-to-date temporal geospatial databases remains a complex time-consuming task that requires advanced coding knowledge. This poses a significant barrier for users interested in mon- itoring applications but lacking such expertise. To address this gap, our pa- per introduces geeLite, a novel R package that enables users to easily build, maintain, and keep up-to-date local temporal databases of GEE-computed geospatial features. geeLite offers a flexible yet user-friendly data collection procedure and stores the gathered features in SQLite format—a serverless, self-contained solution that eliminates the need for additional setup or ad- ministration.3 Furthermore, it streamlines the conversion of stored data into native R formats and provides functions for aggregating and processing cre- ated databases to meet specific user needs. The metadata for the new package is detailed in Table 1. 3 More information can be found at: https://sqlite.org/features.html (retrieved: February 2, 2025). 3 Table 1: Metadata of the geeLite package Metadata Description Metadata Contents Current code version: 0.1.0 Permanent link: https://github.com/mtkurbucz/geeLite Legal code license: Mozilla Public License Version 2.0 Code versioning system: Git Software code languages: R System requirements: OS agnostic (Linux, macOS, MS Windows). R 4.3.2 or later. Required R packages: cli, crayon, data.table, dplyr, geojsonio, googledrive, h3jsr, jsonlite, knitr, lubridate, magrittr, optparse, progress, purrr, reshape2, reticulate, rgee, RSQLite, sf, stats, stringr, tidyr, tidyrgee, utils, rnaturalearth, rnaturalearthdata. Suggested R packages: leaflet, rmarkdown, testthat, withr. External dependencies: A virtual Conda environment with the following Python packages: earthengine-api, ee_extra, and numpy. The current version of the geeLite package (0.1.0) specifically requires version 0.1.370 of earthengine-api. User manual: https://github.com/mtkurbucz/geeLite/blob/master/README.md Support email: mkurbucz@worldbank.org Note: The underlined R packages are not directly used by the geeLite package; however, they are essential for the database construction process. During the setup of geeLite via its gee_install function, the availability of these packages is checked, and they are automatically installed if not already available. The rest of this paper is organized as follows. Section 2 outlines the instal- lation steps, workflow, structure, and unit testing of the geeLite package. Section 3 presents an example of the package’s application. Finally, Section 4 discusses the impact of the software and provides conclusions. 2. Software Description 2.1. Installation The geeLite package can be installed from GitHub (Kurbucz and Andrée, 2024) using the devtools R package (Wickham et al., 2022). To interact with GEE, geeLite utilizes the rgee R package, which requires a virtual Conda environment containing the earthengine-api, ee_extra, and numpy Python packages. After installing geeLite, users need to set up this environment using the gee_install function as follows: 1 # I n s t a l l g e e L i t e package : [R code ] 2 # i n s t a l l . packages (" de v t o o l s ") 3 d e v t o o l s : : i n s t a l l _ g i t h u b ( " mtkurbucz / g e e L i t e " ) 4 geeLite::gee_install () Note that the current version of the geeLite package (0.1.0) is specifi- cally compatible with earthengine-api version 0.1.370. The gee_install function ensures the installation of the correct version of this package and 4 also installs the geojsonio and rnaturalearthdata R packages if they are missing. 2.2. Workflow The workflow of the geeLite package consists of two primary steps. First, the configuration step (a) allows users to create and modify a configuration file that specifies the dimensions, variables, and spatial aggregation methods for the desired database. Subsequently, this file guides the second step, data collection (b), which begins with user authentication and then efficiently builds a new database or updates an existing one according to the user’s specifications. Additionally, the package provides basic tools for managing the resulting database within an R session (c) and offers utilities to streamline the automation of database updates (d). The workflow and folder structure of the generated database are illustrated in Figure 1 and detailed in the following points. Figure 1: Workflow of and the folder structure of the generated database Note: The geeLite package operates through two primary steps: (a) configuration, where users define the parameters for the database to be generated, and (b) data collection, which begins with user authen- tication. Upon successful authentication, the package collects and preprocesses data from GEE, tailored to the user’s specifications, to build or update a custom SQLite database. a) Configuration A configuration file, created using the set_config function, is stored locally in JSON format. This file allows users to define regions of interest (regions) for zonal statistics calculations at both the administrative level 0 (countries) and administrative level 1 (subnational states). These regions are identified using ISO 3166-2 codes, which consist of two letters for countries and addi- 5 tional characters for states.4 Users can select multiple data sources (source) by specifying one or more GEE datasets, along with their bands of interest and the statistical indicators for spatial aggregation. The configuration file also includes settings for the resolution of the H3 grid system (resol),5 the start date for data collection (start), and a scaling parameter to adjust image quality before processing (scale). Additionally, to avoid interruptions caused by exceeding GEE’s limit on concurrent zonal statistics calculations, users can set a maximum number of calculations to be performed simultaneously in the configuration file (limit), with the default limit set to 10,000. To simplify the configuration process for data collection, the get_state function allows users to view the complete list of available region codes, while the get_config function displays the current contents of the configuration file. Users can also modify the selected regions (regions), data sources (source), and the limit on concurrent zonal statistics calculations (limit) either manually (outside the R session) or using the modify_config function. However, modifying other configuration parameters requires rebuilding the database based on a new configuration file. b) Data Collection (with Authentication) Data collection involves user authentication, as well as the building and up- dating of the database using the run_geelite function. Authentication Building on the authentication protocol of the rgee package, the run_geelite function initiates the authentication process by directing users to a web browser, where they are prompted to grant permission to link their Google accounts with GEE. Users are then instructed to copy the resulting token into the provided input field. Upon successful authentication, a directory is created under the path ∼/.config/earthengine/, named according to the specified username. This directory securely stores all credentials associated 4 Shapefiles for the selected regions are retrieved using the rnaturalearth package (Massicotte and South, 2024). 5 geeLite applies Uber’s H3 grid system (Uber, 2021) using the h3jsr package (O’Brien, 2023) to divide user-defined regions into hexagonal bins, from which zonal statistics are calculated. 6 with the user’s Google account. If no username is provided, the credentials are stored directly in the ∼/.config/earthengine/ folder. As long as the credentials remain valid, geeLite will automatically use them to initialize the selected Google account. In addition to the standard authentication method, users have the op- tion to authenticate using Google Cloud service account credentials. This method allows them to create a service account via the Google Cloud Con- sole and download a corresponding JSON key file. By configuring the GOOGLE_APPLICATION_CREDENTIALS environment variable to reference the downloaded JSON key file, users can bypass the interactive, browser-based authentication process.6 Once the environment variable is set, geeLite will automatically use the service account credentials for authentication. This method is particularly useful for automated workflows and production envi- ronments where graphical interfaces may not be available. Database Building and Updating After user authentication, the database-building operation gathers GEE- computed geospatial features based on the configuration file and stores them in SQLite format (geelite.db ).7 Depending on the value of the mode param- eter, the extraction process operates in either "local" mode—processing data in smaller chunks directly within R—or "drive" mode, which leverages Google Drive for exporting larger datasets. The resulting database contains information about H3 bins (referred to as grid) and collected datasets, orga- nized into separate tables named according to their associated GEE datasets. The grid table includes an Open Geospatial Consortium (OGC) feature iden- tifier (ogc_fid), an H3 index (id), a corresponding region code (iso), and a geometry shape (geometry). Other tables, containing zonal statistics, are structured in a wide format. These tables include the H3 index (id) of the associated bins, the band name (band), the statistical indicator employed (zonal_stat), and the zonal statistics computed on specific dates (format- ted as YYYY_MM_DD) when the dataset was updated. In addition to the configuration file and SQLite database, the package generates two supplementary files: state.json and log.txt. These files respec- 6 To set this environment variable, use the following R code: Sys.setenv( GOOGLE_APPLICATION_CREDENTIALS = "path/to/service-account-key.json"). 7 SQLite objects are managed by the RSQLite package (Müller et al., 2024). 7 tively record the current state of the database and log session types (build or update) along with their timestamps. To facilitate dataset management and scheduled updates, the geeLite package organizes a cli folder alongside the database. This folder contains an R script for each main function of the package—except for set_cli and read_db—allowing them to be executed from the Command Line Interface (CLI). To run these functions via the CLI, use the following structure: Rscript [cli/function] --parameter [parameter]. These scripts automatically use the path to the root directory of the generated database as an input parameter, eliminating the need for manual definition.8 Once the database is built, it can be updated (rebuild = FALSE) or re- built (rebuild = TRUE). For updates, the process begins with a comparison of the configuration and state files. If the configuration file has been mod- ified since the last data collection procedure, the database will be adjusted and updated accordingly. A successful update overwrites the state file and modifies the log file. In the case of rebuilding, the data collection is based solely on the configuration file, and it overwrites the existing database and its supplementary files. Note that while some parameters of the configuration file (regions, source, and limit) can be modified manually, the state file is vulner- able to any manual changes. Manual modification or deletion of this file can corrupt future updates of the database and necessitate a rebuild. c) Data management in R The geeLite package provides two main functions to simplify the analysis of the generated SQLite database: fetch_vars and read_db. The fetch_vars function allows users to retrieve detailed information about the available vari- ables in the database. This information includes variable names—comprising the database, band, and statistical indicator names separated by slashes— along with variable IDs, source details, start and end dates, and average frequencies in days. The read_db function is designed to read, aggregate, and process data from the database according to user-defined parameters. When executing read_db, the selected variables are first converted to 8 For more detailed information on using these CLI scripts, please refer to Section 3.1 or visit the package’s GitHub repository (Kurbucz and Andrée, 2024). 8 a daily frequency using a preprocessing function (prep_fun)—such as the default linear interpolation. The data is then aggregated at a specified fre- quency (freq) using one or more aggregation functions (aggr_funs); by default, monthly mean values are calculated. Additionally, read_db provides the option to apply further transformations through post-processing func- tions (postp_funs), which can be especially useful for feature extraction in machine learning applications. To enhance flexibility, geeLite allows users to fully customize both the aggregation and post-processing functions. This can be achieved not only by adjusting default settings, but also by applying specific functions to differ- ent variables. Furthermore, geeLite includes an additional main function, init_postp, which provides the option to define post-processing functions externally, in separate files, thereby improving the transparency of the trans- formation process. This function creates a postp folder in the root directory of the generated database, containing an editable R script (functions.R ) where users can define custom post-processing functions, along with a JSON file (structure.json ) that specifies which functions to apply to which variables during post-processing. After the database is imported, R’s full range of capabilities for process- ing, modeling, and plotting the data can be utilized. d) Automation The geeLite package allows users to execute its functions via the command- line interface (CLI). Each database created with geeLite includes a cli folder containing CLI-compatible versions of all functions, except for set_cli and read_db. Alternatively, these functions can be generated directly using the set_cli function. This feature is designed to support automatic database updates through job-scheduling utilities, such as Linux Crontab (see, e.g., Bradley, 2016). 2.3. Structure The eleven main functions of the geeLite package, along with their param- eters, are detailed in Table 2. 9 Table 2: Description of main functions and parameters Functions Name Parameters File Description gee_install 1 gee_install.R Creates a Conda environment and installs required dependencies. set_config 2-9 set_config.R Creates the configuration file. run_geelite 1-2, 9-12 run_geelite.R Collects and stores data, and updates the state and log files. modify_config 2, 9, 13, 14 modify_config.R Modifies the configuration file. set_cli 2, 9 set_cli.R Creates scripts to make main functions callable through the CLI. get_config 2 get_json.R Prints the configuration file. get_state 2 get_json.R Prints the state file. fetch_regions 15 fetch_regions.R Displays ISO 3166-2 region codes. fetch_vars 2, 16 access_db.R Displays information on the available variables in the SQLite database. read_db 2, 17-21 access_db.R Reads, aggregates, and processes data from the SQLite database. init_postp 2, 9 access_db.R Initializes the file structure for external post-processing. Parameters ID Name Type Description 1 conda character Name of the virtual Conda environment. 2 path character Path to the root directory for the generated database. 3 regions character ISO 3166-2 codes of regions of interest, with two letters for coun- tries and extra characters for subdivisions like states. 4 source list Detailed description of the GEE datasets of interest. This is a nested list with three levels: names, bands, and zonal_stats.a 4.1 names list Names of the GEE datasets to be used (e.g., "MODIS/061/MOD13A1"). 4.1.1 bands list Specific data bands from the datasets (e.g., "NDVI"). 4.1.1.1 zonal_stats character Type of spatial statistics to be computed for each region (options: "mean", "sum", "median", "min", "max", "sd"). 5 resol integer Spatial resolution of the H3 grid system.b 6 scale integer The nominal resolution (in meters) for processing the image pro- jection. If left as NULL (the default), a resolution of 1000 is used.c 7 start date Starting date for data collection (default: "2020-01-01"). 8 limit integer Controls batch size for concurrent zonal statistics calculations. In local mode, it is ⌊limit/number of dates⌋. In drive mode, it sets the max H3 bins per export. Default: 10000. 9 verbose logical Displays messages (default: TRUE). 10 rebuild logical If TRUE, overwrites the database and associated files based on the configuration file (default: FALSE). 11 user character Generates a folder in ∼/.config/earthengine/ to store credentials for a specific Google identity. Default is the root directory: NULL. 12 mode character Specifies the mode of data extraction. Acceptable values are "local" and "drive". In "local" mode, data is extracted in smaller chunks directly within R, which is suitable for modest data vol- umes. In "drive" mode, data is exported via Google Drive, en- abling the handling of larger datasets that require parallel pro- cessing or higher export limits. Defaults to "local". 13 keys list Paths to the values designated for replacement.d 14 new_values list New values to replace the original entries at the specified paths.d 15 admin_lvl integer Specifies the administrative level: 0 for country, 1 for state, or NULL for all regions (default: 0). 16 format character Specifies the output format. Options include "data.frame" (de- fault), "markdown", "latex", "html", "pipe" (Pandoc pipe tables), "simple" (Pandoc simple tables), or "rst". 17 vars character Names or IDs of the selected variables. Use the fetch_vars func- tion to list available variables (default: "all"). 18 freq character Output frequency. Options are "day", "week", "month" (default), "bimonth", "quarter", "season", "halfyear", "year", or NULL (disable aggregation). 19 prep_fun function A pre-processing function for time series data, applied before ag- gregation. For daily frequency, missing values are handled using the specified method. By default, if set to NULL, linear interpola- tion is applied. 20 aggr_funs function A function or list of functions to aggregate data at the specified or list frequency (freq). The default is mean: function(x) mean(x, na.rm = TRUE).e 21 postp_funs function, A function or list of functions applied to the time series data list, or of a single bin after aggregation. The default is NULL, indicat- character ing no post-processing.e Alternatively, set to "external" to apply post-processing from external files initialized with the init_postp function. See Appendix A for more details. Note: Parameters marked with underlines are considered optional. (Referenced pages retrieved on February 2, 2025.) a The complete data catalog of GEE is accessible at: https://developers.google.com/earth-engine/datasets/catalog. b Allowable values and related dimensions can be found at: https://h3geo.org/docs/core-library/restable/. c More information can be found at: https://developers.google.com/earth-engine/guides/scale. d For example, use keys = list("regions", c("source", "MODIS/061/MOD13A2", "NDVI")) and new_values = list(c("SO", "KE"), c("mean", "max")) to update the regions to Somalia and Kenya and modify the zonal_stats for "NDVI" to "mean" and "max". e Both the aggr_funs and postp_funs parameters allow users to apply different functions to different variables. To do this, users can define a nested list, such as aggr_funs <- list("default" = FUN_1, "Var_1" = list(FUN_1, FUN_2)), where the default function is applied to any variable not explicitly specified. In postp_funs, the functions defined in aggr_funs can be refer- enced by their index, corresponding to the order in which they were listed. For example, postp_funs <- list("default" = NULL, "Var_1/2" = FUN_3) applies FUN_3 as a post-processing function after the data is aggregated using FUN_2. 10 2.4. Unit Test The geeLite package includes a test file generated using the testthat pack- age (Wickham, 2011). This file automatically tests nearly all functions pro- vided by geeLite, with the exception of set_cli, fetch_regions, and init_postp. The tests cover a range of tasks, including configuration gener- ation, database building, database reading, configuration modification, and database updates. If the user’s Google account is not authenticated through a regular session of the geeLite package before running the tests, they can authenticate man- ually using the internal function geeLite:::set_depend(conda = "rgee", user = NULL, drive = TRUE, verbose = TRUE).9 3. Illustrative Example This section presents a practical example of the typical workflow for the geeLite package, demonstrating key steps such as configuration, database creation, updates, data retrieval, and processing. The example generates an up-to-date database containing the mean and standard deviation zonal statistics for the Normalized Difference Vegetation Index (NDVI) in Somalia and the Republic of Yemen. These statistics are calculated from January 1, 2010, using a grid system with a resolution level of three, which corresponds to areas of approximately 12, 393 square kilometers.10 3.1. Related Workflow a) Configuration First, the fetch_regions function is used to identify the ISO 3166-2 country codes for Somalia and Yemen. These codes are then used to configure the data generation process as follows: 9 The parameters are described in detail in Table 2. 10 NDVI data are sourced from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. The dataset’s profile is available at: https://developers.googl e.com/earth-engine/datasets/catalog/MODIS_061_MOD13A2 (retrieved: February 2, 2025). 11 1 # S e t t i n g the c o n f i g f i l e : [R code ] 2 s e t _ c o n f i g ( path = " path / t o /db" , 3 r e g i o n s = c ( "SO" , "YE" ) , 4 source = l i s t ( 5 "MODIS/061/MOD13A2 " = l i s t ( 6 "NDVI" = c ( "mean" , "min" ) 7 ) 8 ), 9 resol = 3 , 10 s t a r t = "2010 − 01 − 01" ) As a result, the configuration file (config/config.json ) is generated at the specified target path. Alternatively, the data generation process can be configured via the CLI. First, the CLI versions of the geeLite functions are set up using the set_cli function. Then, the data generation process is configured through the CLI as follows:11 1 # S e t t i n g t h e CLI f i l e s : [ Bash code ] 2 R s c r i p t . / g e e L i t e / c l i / s e t _ c l i . R - - path " path / t o /db" 3 4 # Change d i r e c t o r y : 5 cd path / t o /db 6 7 # S e t t i n g the c o n f i g u r a t i o n f i l e : 8 R s c r i p t c l i / s e t _ c o n f i g . R - - r e g i o n s "SO YE" - - s o u r c e " l i s t ( ’MODIS/061/ MOD13A2’ = l i s t ( ’ NDVI’ = c ( ’ mean ’ , ’ min ’ ) ) ) " - - r e s o l 3 - - s t a r t " 2010 - 01 - 01 " b) Data Collection Once the configuration file is defined, the current database and its supple- mentary files (state/state.json, log/log.txt, along with the CLI versions of the functions) can be generated using the run_geelite function: 1 # Building or updating the database : [R code ] 2 r u n _ g e e l i t e ( path = " path / t o /db" ) Re-running the run_geelite function updates the dataset. Both the building and updating processes can be easily handled through the corre- sponding CLI script, as shown below:12 11 The CLI versions of the functions automatically load all necessary dependencies. 12 Using the CLI script is particularly useful for scheduling regular database updates. 12 1 # Building or updating the database : [ Bash code ] 2 Rscript cli/run_geelite.R c) Configuration Modification To update regions of interest, data sources, or the limit parameter, modify the configuration file using the modify_config function, then execute the run_geelite function to refresh the database. In the configuration, mean (mean) and minimum (min) statistics were initially selected for zonal statis- tics calculations. To tailor the database for the original task, the following code replaces the minimum statistic with the standard deviation (sd) in the configuration file and updates the database accordingly: 1 # Modifying the c o n f i g u r a t i o n f i l e : [R code ] 2 m o d i f y _ c o n f i g ( path = " path / t o /db" , 3 k e y s = l i s t ( c ( " s o u r c e " , "MODIS/061/MOD13A2 " , "NDVI" ) ) 4 new_v a l u e s = l i s t ( c ( "mean" , " sd " ) ) ) 5 6 # Updating t h e d a t a b a s e : 7 r u n _ g e e l i t e ( path = " path / t o /db" ) Alternatively, the same modifications to the configuration file and database can be made via the CLI using the following code: 1 # Modifying the c o n f i g u r a t i o n f i l e : [ Bash code ] 2 R s c r i p t c l i / m o d i f y _ c o n f i g . R - - k e y s " l i s t ( c ( ’ s o u r c e ’ , ’MODIS/061/ MOD13A2’ , ’NDVI ’ ) ) " - - new_v a l u e s " l i s t ( c ( ’ mean ’ , ’ sd ’ ) ) " 3 4 # Updating t h e d a t a b a s e : 5 Rscript cli/run_geelite.R After the modifications, the generated SQLite database occupies 1.37 MB of disk space. Note that increasing the spatial resolution, number of indicators, aggregation methods, or geographic coverage will lead to greater disk usage. For example, raising the hexagonal resolution from size 3 to size 4 in the original configuration (reducing the bin size from approximately 12, 393 to 1, 770 square kilometers) would increase the file size to 9.33 MB. Despite the higher resolution, the file size would still be manageable, even for multiple countries. 3.2. Reading and Processing the Database The generated SQLite database (data/geelite.db ) contains two tables: grid, which stores information about the H3 bins, and MODIS/061/MOD13A2, which 13 holds the collected zonal statistics.13 To read, aggregate, and pre-process the database in R, the read_db function from the geeLite package can be used. In this example, the dataset was converted to a daily frequency using the default pre-processing method, which applies linear interpolation: 1 # Reading SQLite t a b l e s : [R code ] 2 db <− read_db ( path = " path / t o /db" , f r e q = " day " ) This code snippet generates a list containing three elements: a simple features (sf) object representing the grid table, and two data frames corresponding to the variables MODIS/061/MOD13A2/NDVI/mean and MODIS/061/MOD13A2/NDVI/sd. 3.3. Visualizing the Database After joining these tables using the H3 index (id), the spatiotemporal char- acteristics of the generated database are illustrated in Figure 2.14 13 The contents of the SQLite database can be reviewed using the fetch_vars function. 14 A code example for visualizing the mean NDVI values from the generated database is provided in Appendix B. To highlight the capabilities of the geeLite package, Figure A1 in Appendix C presents examples of higher-resolution H3 grid systems applied to NDVI data from Yemen. 14 Figure 2: Unscaled mean NDVI data retrieved from GEE (scale conversion factor: 0.00001) Note: The generated database consists of forty-two H3 bins per country, each covering approximately 12, 393 square kilometers. It tracks NDVI time series data across these bins, with 5, 306 consecutive values derived from linear interpolation of the original 335 measurements collected between January 1, 2010, and July 11, 2024. As shown in Figure 2, forty-two H3 bins were defined for both countries, each covering approximately 12, 393 square kilometers. The number of time series generated for each bin corresponds to the number of variables collected. The original dataset, collected between January 1, 2010, and July 11, 2024, comprised 335 measurements. After converting the data to a daily frequency using linear interpolation, the number of measurements increased to 5, 306. 3.4. Automation To automate the updating of the generated database, Linux Crontab and the CLI version of the run_geelite function are used. For a monthly schedule, the following script can be utilized:15 1 # Monthly update with Crontab : [ Bash code ] 2 (crontab -l 2>/dev/null; echo "0 0 1 * * Rscript cli/run_geelite.R" ) | c r o n t a b - 15 For more information about Linux Crontab, see (Bradley, 2016). 15 4. Impacts and Conclusions The development and release of geeLite represent a significant advancement in geospatial analysis, providing a user-friendly tool that simplifies the col- lection, management, aggregation, and processing of GEE data. By bridging the gap between GEE’s powerful data capabilities and R users with varying levels of technical expertise, geeLite offers several key benefits: • Facilitating real-time geospatial monitoring and early warning systems: Users can maintain continuously updated local geospatial databases, critical for real-time applications such as disaster response, environmental monitoring, and crisis management. Access to dynamic spatial data supports the development of early warning systems in these fields. • Improving research efficiency and data accessibility: geeLite reduces the need for extensive coding, simplifying the collection, ag- gregation, and management of GEE data. Its intuitive design enables researchers to quickly gather and process large geospatial datasets, sav- ing time and effort across multiple disciplines. • Enhancing reproducibility and collaboration: With SQLite- based storage, databases are portable and self-contained, making it easy to share data and collaborate among research teams. This feature pro- motes reproducibility, allowing others to easily replicate experiments and validate findings. • Supporting long-term geospatial studies: Designed for longitudi- nal research, geeLite excels at tracking geospatial features over time in studies such as climate change, land-use patterns, and biodiversity monitoring. Its ability to manage and update local databases ensures consistency for long-term analysis. In conclusion, geeLite marks a significant step forward in making geospa- tial analysis more accessible, efficient, and reproducible. As the scale and complexity of Earth observation data continue to grow, tools like geeLite will be essential in transforming raw data into actionable insights that sup- port informed decision-making and impactful research. Appendix A. Post-Processing with External Files The init_postp function generates a postp folder in the root directory of the database. This folder includes an editable R script (functions.R ), where users 16 can define custom post-processing functions, and a JSON file (structure.json ) that outlines which functions should be applied to each variable during post- processing. When the postp_funs parameter in the read_db function is set to "external", the post-processing logic specified in these external files is applied. To illustrate the proper structure of these files, consider an example where the "MODIS/061/MOD13A2/NDVI/mean" variable is binarized based on its up- per and lower deciles, while all other variables remain unchanged. First, de- fine the functions FUN_1 <- function(x) as.numeric(x < quantile(x, 0.1)) and FUN_2 <- function(x) as.numeric(x > quantile(x, 0.9)) in the functions.R script. Then, to apply these functions to the mean NDVI variable, the structure.json file should be configured as follows:16 1 { [JSON file] 2 " default " : null , 3 "MODIS/061/MOD13A2/NDVI/mean" : [ 4 [ "FUN_1" , "FUN_2" ] 5 ] 6 } Appendix B. Code Example for Data Visualization Similar to Figure 2, the mean NDVI values from the database generated in Section 3 can be visualized using the following R script: 16 Note that, unlike in the R script, the absence of post-processing is denoted by lowercase null in the JSON file. 17 1 # I n s t a l l and l o a d r e q u i r e d p a c k a g e s : [R code ] 2 # i n s t a l l . packages (" s f ") 3 # i n s t a l l . packages (" dplyr ") 4 # i n s t a l l . packages (" l e a f l e t ") 5 library ( sf ) 6 l i b r a r y ( dplyr ) 7 library ( leaflet ) 8 9 # Read d a t a b a s e and merge g r i d with MODIS data 10 db <− read_db ( path = " path / t o /db" ) 11 s f <− merge ( db$grid , db$‘MODIS/061/MOD13A2/NDVI/mean‘ , by = " i d " ) 12 13 # S e l e c t the date to v i s u a l i z e 14 n d v i <− s f $ ‘2020 − 01 − 01 ‘ 15 16 # C r e a t e a c o l o r p a l e t t e f u n c t i o n based on t h e v a l u e s 17 c o l o r_p a l <− c o l o r N u m e r i c ( palette = " v i r i d i s " , domain = n d v i ) 18 19 # C r e a t e t h e l e a f l e t map 20 leaflet ( data = s f ) % >% 21 addTiles () % >% # Add b a s e t i l e s 22 addPolygons ( 23 f i l l C o l o r = c o l o r_p a l ( n d v i ) , # Fill color 24 c o l o r = "# BDBDC3" , # Border c o l o r 25 weight = 1 , # Border w e i g h t 26 opacity = 1 , # Border o p a c i t y 27 fillOpacity = 0.9 # F i l l opacity 28 ) % >% 29 addScaleBar ( p o s i t i o n = " b o t t o m l e f t " ) %>% # Add s c a l e bar 30 addLegend ( 31 p a l = c o l o r_pa l , # Color p a l e t t e 32 v a l u e s = ndvi , # Data v a l u e s t o map 33 t i t l e = "Mean NDVI" , # Legend t i t l e 34 p o s i t i o n = " bottomright " # Legend p o s i t i o n 35 ) 18 Appendix C. High-Resolution H3 Grid Figure A1: Examples of high-resolution H3 grid: unscaled mean NDVI data for Yemen retrieved from GEE (scale conversion factor: 0.00001) Note: This figure demonstrates the capabilities of the geeLite package by collecting NDVI data for Yemen using high-resolution H3 grid systems. It illustrates two resolution levels: resol = 5 and resol = 6, which correspond to grid cells with approximate areas of 252 square kilometers and 36 square kilometers, respectively. References Amani, M., Ghorbanian, A., Ahmadi, S.A., Kakooei, M., Moghimi, A., Mir- mazloumi, S.M., Moghaddam, S.H.A., Mahdavi, S., Ghahremanloo, M., Parsian, S., et al., 2020. Google Earth Engine Cloud Computing Platform for Remote Sensing Big Data Applications: A Comprehensive Review. IEEE Journal of Selected Topics in Applied Earth Observations and Re- mote Sensing 13, 5326–5350. Andrée, B.P.J., Chamorro, A., Kraay, A., Spencer, P., Wang, D., 2020. Pre- dicting Food Crises . Andrée, B.P.J., Chamorro, A., Spencer, P., Koomen, E., Dogo, H., 2019. Revisiting the Relation Between Economic Growth and the Environment: A Global Assessment of Deforestation, Pollution and Carbon Emission. Renewable and Sustainable Energy Reviews 114. URL: http://www.scop us.com/inward/record.url?eid=2-s2.0-85070497157&partnerID=MN8 TOARS, doi:10.1016/j.rser.2019.06.028. 19 Arno, Z., Erickson, J., 2022. tidyrgee: ’tidyverse’ Methods for ’Earth Engine’. R Package Version 0.1.0, https://github.com/r-tidy-remote- sensing/tidyrgee. Aybar, C., Montero, D., Barja, A., Herrera, F., Gonzales, A., Espinoza, W., 2023. Combining R and Earth Engine, in: Cloud-Based Remote Sensing with Google Earth Engine: Fundamentals and Applications. Springer, pp. 629–651. Aybar, C., Wu, Q., Bautista, L., Yali, R., Barja, A., 2020. rgee: An R Package for Interacting with Google Earth Engine. Journal of Open Source Software 5, 2272. Banerjee, A., Ariz, D., Turyasingura, B., Pathak, S., Sajjad, W., Yadav, N., Kirsten, K.L., 2024. Long-Term Climate Change and Anthropogenic Activities Together with Regional Water Resources and Agricultural Pro- ductivity in Uganda Using Google Earth Engine. Physics and Chemistry of the Earth, Parts A/B/C 134, 103545. Berner, L.T., Assmann, J.J., Normand, S., Goetz, S.J., 2023. ‘LandsatTS’: An R Package to Facilitate Retrieval, Cleaning, Cross-Calibration, and Phenological Modeling of Landsat Time Series Data. Ecography 2023, e06768. Bradley, J., 2016. OS X Incident Response: Scripting and Analysis. Syngress. Brovelli, M.A., Sun, Y., Yordanov, V., 2020. Monitoring Forest Change in the Amazon Using Multi-Temporal Remote Sensing Data and Machine Learn- ing Classification on Google Earth Engine. ISPRS International Journal of Geo-Information 9, 580. Burgueño, A.M., Aldana-Martín, J.F., Vázquez-Pendón, M., Barba- González, C., Jiménez Gómez, Y., García Millán, V., Navas-Delgado, I., 2023. Scalable Approach for High-Resolution Land Cover: A Case Study in the Mediterranean Basin. Journal of Big Data 10, 91. Burke, M., Driscoll, A., Lobell, D.B., Ermon, S., 2021. Using Satellite Im- agery to Understand and Promote Sustainable Development. Science 371, eabe8628. 20 Chen, S., Woodcock, C.E., Bullock, E.L., Arévalo, P., Torchinava, P., Peng, S., Olofsson, P., 2021. Monitoring Temperate Forest Degradation on Google Earth Engine Using Landsat Time Series Analysis. Remote Sensing of Environment 265, 112648. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R., 2017. Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone. Remote Sensing of Environment 202, 18–27. Hamidi, E., Peter, B.G., Muñoz, D.F., Moftakhari, H., Moradkhani, H., 2023. Fast Flood Extent Monitoring with SAR Change Detection Using Google Earth Engine. IEEE Transactions on Geoscience and Remote Sensing 61, 1–19. Kazemi Garajeh, M., Haji, F., Tohidfar, M., Sadeqi, A., Ahmadi, R., Karim- inejad, N., 2024. Spatiotemporal Monitoring of Climate Change Impacts on Water Resources Using an Integrated Approach of Remote Sensing and Google Earth Engine. Scientific Reports 14, 5469. Kong, D., 2022. rgee2. R Package Version 0.1.1, https://github.com/rpkgs/rgee2. Kurbucz, M.T., Andrée, B.P.J., 2024. geeLite R Package. URL: https: //github.com/mtkurbucz/geeLite. Marconcini, M., Metz-Marconcini, A., Esch, T., Gorelick, N., 2021. Under- standing Current Trends in Global Urbanisation: The World Settlement Footprint Suite. GI_Forum 9, 33–38. Massicotte, P., South, A., 2024. rnaturalearth: World Map Data from Nat- ural Earth. URL: https://docs.ropensci.org/rnaturalearth/. R Package Version 1.0.1.9000, https://github.com/ropensci/rnaturalearth, https://docs.ropensci.org/rnaturalearthhires/. Morales, N.S., Fernández, I.C., Durán, L.P., Pérez-Martínez, W.A., 2023. RePlant Alfa: Integrating Google Earth Engine and R Coding to Support the Identification of Priority Areas for Ecological Restoration. Land 12, 303. 21 Müller, K., Wickham, H., James, D.A., Falcon, S., 2024. RSQLite: SQLite Interface for R. URL: https://rsqlite.r-dbi.org. R Package Version 2.3.7, https://github.com/r-dbi/RSQLite. O’Brien, L., 2023. h3jsr: Access Uber’s H3 Library. URL: https://obrl-s oil.github.io/h3jsr/. R Package Version 1.3.1. Penson, S., Lomme, M., Carmichael, Z., Manni, A., Shrestha, S., Andrée, B.P.J., 2024. A Data-Driven Approach for Early Detection of Food In- security in Yemen’s Humanitarian Crisis. Technical Report. The World Bank. Tamiminia, H., Salehi, B., Mahdianpari, M., Quackenbush, L., Adeli, S., Brisco, B., 2020. Google Earth Engine for Geo-Big Data Applications: A Meta-Analysis and Systematic Review. ISPRS Journal of Photogrammetry and Remote Sensing 164, 152–170. Tavakkoli Piralilou, S., Einali, G., Ghorbanzadeh, O., Nachappa, T.G., Gho- lamnia, K., Blaschke, T., Ghamisi, P., 2022. A Google Earth Engine Ap- proach for Wildfire Susceptibility Prediction Fusion with Remote Sensing Data of Different Spatial Resolutions. Remote Sensing 14, 672. Team, S.S., 2022. SAEplus. R Package Version 0.1.0, https://github.com/SSA-Statistical-Team-Projects/SAEplus. Uber, T., 2021. H3: A Hexagonal Hierarchical Geospatial Indexing System. Ushey, K., Allaire, J., Tang, Y., 2024. reticulate: Interface to ’Python’. URL: https://rstudio.github.io/reticulate/. R Package Version 1.35.0, https://github.com/rstudio/reticulate. Velastegui-Montoya, A., Montalván-Burbano, N., Carrión-Mero, P., Rivera- Torres, H., Sadeck, L., Adami, M., 2023. Google Earth Engine: A Global Analysis and Future Trends. Remote Sensing 15, 3675. Wang, D., Andrée, B.P.J., Chamorro, A.F., Spencer, P.G., 2022. Transitions Into and Out of Food Insecurity: A Probabilistic Approach with Panel Data Evidence from 15 Countries. World Development 159, 106035. URL: https://www.sciencedirect.com/science/article/pii/S0305750X2 200225X, doi:https://doi.org/10.1016/j.worlddev.2022.106035. 22 Wang, L., Diao, C., Xian, G., Yin, D., Lu, Y., Zou, S., Erickson, T.A., 2020. A Summary of the Special Issue on Remote Sensing of Land Change Science with Google Earth Engine. Wickham, H., 2011. testthat: Get Started with Testing. R J. 3, 5. Wickham, H., Hester, J., Chang, W., Hester, M.J., 2022. Package ’devtools’. Zheng, Z., Wu, Z., Chen, Y., Yang, Z., Marinello, F., et al., 2021. An- alyzing the Ecological Environment and Urbanization Characteristics of the Yangtze River Delta Urban Agglomeration Based on Google Earth Engine. Acta Ecologica Sinica 41, 717–729. 23