Machine Learning Guided Outlook of Global Food Insecurity Consistent with Macroeconomic Forecasts

The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.

Motivated by the deterioration in global food security conditions, this paper develops a parsimonious machine learning model to derive a multi-year outlook of global severe food insecurity from macro-economic projections. The objective is to provide forecasts that are internally consistent with wider economic assessments, allowing both food security policies and economic development policies to be informed by a cohesive set of expectations. The model is validated on holdout data that explicitly test the ability to forecast new data from history and extrapolate beyond observed intervals. It is then applied to the World Economic Outlook database of April 2022 to project the severely food insecure population across all 144 World Bank lending countries.
The analysis estimates that the global severely food insecure population may remain above 1 billion through 2027 unless large-scale interventions are made. The paper also explores counterfactual scenarios, first to investigate additional risks in a downside economic scenario, and second, to investigate whether restoring macroeconomic targets is sufficient to revert food insecurity back to pre-pandemic levels. The paper concludes that the proposed model provides a robust and low-cost approach to maintain reliable long-term projections and produce scenario analyses that can be revised systematically and interpreted within the context of available economic outlooks.
of April 2022, a database of macro-economic projections that is updated semiannually by the International Monetary Fund (IMF), to reconstruct and project severe food insecurity from 2005 through 2027. 1 The resulting food insecurity outlook covers all 144 IDA and IBRD countries that together cover approximately 98% of global historical food insecure populations, and is internally consistent with the economic outlooks that serve as the input for the projections.
The results show that the 2021-2023 severely food insecure population may reach above 1 billion in the World Bank's 144 IDA-eligible and IBRD countries, an increase of over 172 million people over the 2017-2019 pre-pandemic estimate. The projection exceeds the estimate of under 780 million for the 2008 World Food Price Crisis period and may remain above 1 billion through 2027 without large-scale targeted interventions. It is important to note that these results are a forecast of unmitigated impacts extrapolated from 2019 data. At the same time, the WEO outlook of April has not yet factored in the full effects of the Ukraine crisis. Due to the light-weight specification of the model, future results can easily be updated and revised based on new data. The framework can also utilize country specific growth projections available from other authorities and incorporate future relevant data streams that may become available.
Because finance is often unlocked contingent on some estimate of needs or projected response costs, the paper shows how the country level food insecurity estimates can be combined with response costs assumptions to assess temporal trends in development assistance needs. The different predictions work in tandem to decompose projected development assistance needs into the combined effects of changes in the prevalence of severe food insecurity, continued growth in the overall population, and changes in the per capita response costs. In the baseline forecast, the prevalence of severe food insecurity is projected to drop by about 0.5% points by 2025 from 2021-2023 highs, but the combined effect of continued population growth and elevated commodity prices would still lead to a doubling in development assistance needs compared to pre-pandemic baselines, calling for a sustained approach to investing in bettering the state of food security throughout the world.
The paper concludes that the proposed framework provides a robust and lowcost approach to maintain reliable projections and scenario analyses that can be interpreted within the context of available economic outlooks, contributing to the capacity to monitor global food insecurity trends and allowing both food security policies and economic policies to be informed by a cohesive set of expectations.
The paper is structured as follows. Section I reviews related literature and discusses the background to this work. The objective is to better understand the use case that leads to the relatively narrow specification of the proposed 4 ANDRÉE SEPTEMBER 28, 2022 model. Section II then introduces the data and develops the modeling framework. Section III presents validation and key prediction results, section IV continues the discussion and section V concludes. Additional results are found in the appendix.

I. Background
Food crises have devastating effects on populations. Black et al. (2013) conclude that almost half of global child deaths are associated with undernutrition and Devereux (2000) estimates that famines in the 20th century were responsible for an equal amount of deaths as those from World Wars I and II combined. The international community has long responded to critical food insecurity situations with humanitarian aid after the severity of a situation has been declared, a definition that may involve mortality itself. By this time, irreversible harm has already been done. Delay in unlocking financing resources until after a rise in mortality has been confirmed played a critical role in the severity of the Somali famine of 2010-2011 (Lautze et al., 2012;Seal and Bailey, 2013) which left almost 300,000 dead despite more than a billion dollars in aid spent on response (Checchi and Courtland, 2013). In recent years, that realization has resulted in a growing emphasis on investment in prevention and resilience (Haan et al., 2012). 2 In many cases, planning responses ahead has been shown to save more lives and be more cost effective than responding after situations have deteriorated (Meerkatt et al., 2015;Mechler, 2016). A pro-active approach to emergency response financing requires a capacity to anticipate when, where and how many people will be at risk of severe food insecurity conditions under baseline economic scenarios.
To contribute to this capacity, this paper explores a statistical model to estimate and project global food insecurity conditions using macro-economic outlooks. Extreme food insecurity situations are multifaceted in nature and frequently develop as part of multiple interlocking challenges (Maxwell et al., 2020). The paper's approach anchors the food insecurity outlooks to economic assumptions in an attempt to contribute to broader foresight that is internally consistent so that multiple policy actions can be informed jointly by a cohesive set of expectations. This choice of approach is motivated further by the fact that development finance involves large capital flows that are replenished following a slow multi-year replenishment cycle. The general trends in the magnitude of future at-risk populations should therefore ideally be projected years in advance so that existing financing mechanisms can be restructured, or new ones designed, according to the projected changes in needs. Country eligibility for development finance operations is typically also conditional on the existence of a developed macro-economic frame-2 Recent financial innovations in this direction are for example the World Bank's Early Response Financing (ERF) tools under the Crisis Response Window (CRW) that may provide rapid disbursements to respond to slower-onset events which are identified as having the potential to escalate into major crises. Eligibility is conditional on having or developing a preparedness plan for food insecurity. Another example is the Crisis and Emergency Response Component (CERC), which can be used to rapidly channel pre-allocated emergency funds to respond to emergencies, such as droughts or floods triggered by an El Nino / La Nina event whose risks can be pre-identified. work and lending is geared toward investments that contribute to set development goals that are grounded in country economic analyses. Food insecurity in turn poses wide economic risks that may lead to serious setbacks in progress toward the development goals of affected countries. 3 Economic developments and food insecurity trends should therefore not be analyzed in isolation of one another. 4 The focus on prevention has motivated researchers to develop forward-looking capabilities. This paper specifically contributes to understanding the ability of statistical models to anticipate future food insecurity conditions in the particular context of assessing development financing needs and builds on several earlier efforts. Mellor (1986) discussed prevention strategies with an emphasis on economic weakness, crop failure, and subsequent price signals as leading indicators of famine, providing a modeling context that remains relevant to this day. Inflation signals in particular had also been discussed for instance by Seaman and Holt (1980), who theorized that in anticipation of extreme food shortage, market prices should increase as populations begin hoarding food items. Cutler (1984) evidenced this using data from the Ethiopian famine in 1972-1974, and Khan (1994 during the famine of 1984-1985 in Niger. Andrée (2021a) provides a longer list of severe food crises that occurred at the backdrop of record inflation levels.
More recently, researchers have begun to implement these ideas using machine learning techniques that draw from a wider variety of indicators for prediction. The majority of these efforts focus on granular predictions derived from novel data sources. For example, Mwebaze et al. (2010); Okori and Obua (2011) predict household famine in Uganda between 2004, Vu et al. (2022 predict changes in household level food insecurity in Vietnam, and Lentz et al. (2019) predict food insecurity for village clusters in Malawi. Martini et al. (2021);Foini et al. (2022) develop high-frequency now-casting capabilities at a global scale focused on the food consumption score and Balashankar et al. (2021) extract leading signals from news streams. Andrée et al. (2020) predict the outbreaks of new food crises up to a full year ahead at the administrative level in 21 highrisk countries using the price data from Andrée (2021a,b) combined with conflict data and remote sensing data. Using the data set from Andrée et al. (2020), Wang et al. (2020Wang et al. ( , 2022 also provide a stochastic simulation model that projects 6 ANDRÉE SEPTEMBER 28, 2022 tail risks up to 3 years ahead. The current paper essentially proposes a middle ground between the work of Wang et al. (2020Wang et al. ( , 2022 and Andrée et al. (2020), combining the multi-year outlook properties provided by the first approach and the predictive optimization of the latter, while aiming for global coverage such as attempted by Martini et al. (2021); Foini et al. (2022).
The importance of maintaining current outlooks on global food security developments, along with mechanisms to update and revise projections systematically and timely, is likely to increase in coming years as many drivers of food insecurity are projected to worsen into the 21st century. Success in poverty reduction has largely coincided with substantial loss of pristine natural environments (Stern et al., 1996;Andrée et al., 2019a). Degraded environments, more frequent climatic extremes, and a continuously growing population will continue to put pressure on future agricultural systems (Grainger, 1990;Ingram et al., 2010;Myers et al., 2017;Diogo et al., 2017). These developments put the rural poor that rely on local natural assets for income and food consumption particularly at risk (Duraiappah, 1998;Barrett and Bevis, 2015;Barbier and Hochard, 2018).

A. Target Variable
The analysis targets the prevalence (rate) of severe food insecurity. This metric is derived from the Food Insecurity Experience Scale (FIES) and is used to monitor progress toward the UN's SDG of Zero Hunger by tracking food insecurity at three levels: food security, moderate food insecurity, and severe food insecurity. 5 This data is available for individual countries from the FAO (Food and Agriculture Organization) as 3-year centered moving averages. The FAO also publishes annual figures, but only at an aggregated level.
The prevalence of severe food insecurity is defined as the percentage of people in the population who live in households classified as severely food insecure. A household is classified as severely food insecure when at least one adult in the household in the last 12 months has reported to have been forced to reduce the quantity of the food consumed, to have skipped meals, having gone hungry, or having to go for a whole day without eating because of a lack of money or other resources. This food insecurity metric captures a chronic definition of food insecurity, and differs from acute measures such as the percentage of people that are in food crisis, and thus unable to meet minimum dietary needs without adopting irreversible coping strategies that have long-term impacts. 6 The severely food insecure population is however the wider population that is at risk of becoming acutely food insecure and is thus of particular interest to prevention and 5 The population shares that experience each level of food insecurity are estimated based on survey responses to eight yes/no questions about access to adequate food; see details by Cafiero et al. (2018).
6 Chronic food insecurity is broadly speaking the result of overwhelming poverty while acute food insecurity may be considered to be more of a short-term phenomenon related either to man-made or natural shocks, such as drought, that often hit while chronic food insecurity has already become prevalent. resilience strategies. An important distinction from a policy perspective is that when food insecurity is primarily chronic, markets are typically still functional so that a wider range of economic interventions may still be considered as part of a prevention and resilience strategy. 7 A challenge in using the data to track global developments is that the regional totals cannot be traced back to the underlying country figures. First, the annual figures are not available at the individual country level. Second, the moving averages are not published for all countries that make up the aggregated totals. For instance, 70% of the 2018-2020 average of the global number of severely food insecure people cannot be accounted for after summing up all available data points. This is mostly because the data is not available for several countries with high population counts like China, India, and Pakistan, and several of the major ongoing food crises including those in the Sahel region, the Syrian Arab Republic, and the Republic of Yemen. 8 In some of these countries, particularly the Sahel countries, other harmonized food insecurity assessments are available from the Cadre Harmonisé and the Integrated Phase Classification system. For a small set of countries, these data are utilized to produce proxies for the prevalence of severe food insecurity using linear regression scaling. The details are in the appendix, section A.A1.
A second challenge is that the data is produced following a deliberate sampling and measurement process that is not designed for monitoring rapid developments and guiding actions during crisis situations. There is a typical delay in the individual country figures of 2 years due to the 3-year averaging plus the typical publication delay of any country figure. For policy purposes, it would be beneficial to have preliminary estimates and future expectations and be able to track changes therein. As an example, upward or downward revisions in economic growth expectations are typically more informative for setting a new course of action than the official growth rate from one or more years ago.

B. Modeling Framework
In order to help evaluate shifts in food insecurity during a rapidly evolving crisis environment and provide adjustable forward guidance, the analysis takes a model-based approach to producing preliminary estimates, filling historical data 7 Focus on the more chronic measure is grounded in a rhetoric of prevention, adapted to the geographical and temporal scale at which development finance decisions take place. This proposes that long-term aspects of food insecurity can be tackled through long-term development assistance, while acute food insecurity situations may require direct humanitarian assistance. This also implies that humanitarian objectives may benefit from a different modeling objective that have already been pioneered in earlier work, for instance the 1 to 12 months ahead subnational food crisis risks forecast by Andrée et al. (2020) or the 36 months ahead projections at national scale by Wang et al. (2020Wang et al. ( , 2022. 8 The number of countries for which data are available varies over time (96 in 2015, 120 by 2019). It is unclear to what extent trends in country coverage of the assessments is associated with global totals, which, by itself, is another motivation to this work. It is worth noting that the need for this complex modeling work, and imputation in general, is in part a result of chronic underinvestment in transparent data collection and dissemination systems. The overall availability of reliable data, without which investments are difficult to justify, remains low, particularly in lower income countries. gaps, and projecting future values. To accomplish this, a predictive model is constructed for severe food insecurity that uses covariates that describe macroeconomic drivers of food insecurity and structural vulnerabilities. The model is of a simple contemporaneous form: (1) where y it is the prevalence rate of severe food insecurity for country i at time t, and X it is a vector of length d that describes several attributes of country i at time t. For estimation purposes, the individual country data can naturally be stacked as vectors so that the prediction function can be estimated from Y = f (X) where Y is a vector of length T × N , and X is a d-column matrix with T × N rows. The annual predictions at the national level are optimized for out-of-sample squared correlations. The optimized modelf can then be used to reconstruct historical food insecurity databased on official macro-economic statistics, fill missing observations, or produce preliminary estimates and projections by leveraging preliminary assessments of the macro-economic figures.

C. Covariates
Since the outlook is intended to inform broad-based economic policies that can be defended in front of a non-expert audience, parsimony and straightforwardness in input variables carries a premium. A critical constraint for specifying X is that credible outlooks or plausible scenarios for the future values X H : {X t+1, t+2,..., t+h } across a long horizon h > 3 will be required to produce meaningful food insecurity outlooks. The empirical application here applies the model to the WEO database to project food insecurity through 2027, and considers the following variables for model estimation purposes.
1) The poverty rate at US$1.90 2) GDP per capita, ppp adjusted 3) The 3-year average real GDP growth rate 4) The 3-year average population growth rate 5) The 3-year average CPI (Consumer Price Index) inflation rate 6) Agriculture, forestry, and fishing, value added (% of GDP) 7) Food imports (% of merchandise imports) 8) Fuel imports (% of merchandise imports) 9) Agricultural land cover (% of land) 10) Forest land cover (% of land) 11) Historical child mortality (1995-2015 average rate) All the data, including the dependent variable, are taken from the WDI (accessed by API on September 1, 2022). All the data comes from standard sources and are well maintained. Two additional fixed controls are added. 9 First, a fixed country size indicator is added to capture the nominal resource potential implied by the land cover shares. Second, the change from the 2006-07 minimum to 2008-12 maximum CPI inflation rates is added to capture overall domestic price (in)stability during the previous global food price crisis when global food prices spiked in 2008 and 2011. This also allows the model to relate the 3-year inflation rates to the extreme price changes that occurred in this period.
The covariates can be grouped roughly across five dimensions: macro-economic stability, food affordability, the ability to expand the short-term food supply, pressures on food resources, and pre-existing vulnerabilities. More specifically, the covariates inform the model about trends in productivity growth in relation to population growth and price growth, and whether growth in purchasing power occurs equitably. The 3-year inflation rate captures the rate at which prices increase and captures overall macro-economic stability. The percent of merchandise imports spent on food provides a measure of food import dependency, and the percent spent on fuel captures whether the domestic food production system, through diesel fuel, is dependent on imports. The availability of agricultural land, and the share of GDP produced by the agricultural system, capture the size and added value of the domestic food production system. Interactions between these two variables will be most informative to the model. A higher share of agricultural GDP for a given level of arable land suggests high-yield technologies are used. Conversely, low values in both suggest a low capacity of domestic production systems. The complement to agricultural land is roughly natural land since urban areas are usually just a fraction of total land cover. The presence of forest land cover thus captures whether natural land is barren or lush, which allows the model to roughly distinguish between different agro-climatic zones. Finally, a historical deep lag of child mortality is used to capture fragility in the recent past. In section IV, possible missing data dimensions and future avenues for data integration are discussed further.
The application builds on a 1999-2019 data set covering all 144 IDA plus IBRD countries. Kossovo, Somalia and Syria did not have sufficient data and are treated separately as described in the appendix, but 10 additional countries with complete data were added for model estimation purposes. 10 This means there are 3,171 observations (N=151, T=21). Some observations are missing and have to be interpolated. This is only needed for a small share of data points, primarily in the poverty data, and is discussed in detail in the appendix.

D. Estimation
The model is estimated non-parametrically, using the Cubist algorithm described by Quinlan (1992); Witten et al. (2016) and implemented by Kuhn et al. (2012). 11 This is a piece-wise linear model that combines decision trees, boosting, and neighborhood smoothing, to capture smoothed versions of Random Foresttype of nonlinearities that can be locally extrapolated. In particular, where a Random Forest or Boosted Trees algorithm can only interpolate (by returning for instance a median within the range of values seen) at terminal nodes, the terminal node in Cubist ends in a linear regression that can extrapolate. This generally also results in smoother transitions around the cutoffs of decision trees, which is more suitable for numeric data. Due to the local linear nature, local extrapolations generally remain reasonably stable, which is a benefit over neural network predictions that can rapidly turn explosive outside observed data intervals.
Cubist models have done well on a variety of prediction problems, often reaching accuracy not far below that of deep learning methods on complex prediction tasks while maintaining full model interpretability and providing often better performance on smaller data sets. See for instance (Morellos et al., 2016;Ng et al., 2019;Andrée, 2021a;Sbahi et al., 2021) for applications. It also runs much faster than other ensembling methods such as Gradient Boosting Machines and eXtreme Gradient Boosting methods (Hagenauer et al., 2019). A final benefit of Cubist is that the full rule structure can be printed so it retains a relatively high degree of transparency compared to other "black-box" machine learning models.

Training on small data
The central modeling challenge is that the desire for predictions itself stems from a lack of data, and so the learning algorithm has to be optimized to learn efficiently from a small amount of data. An insufficient size of training data is well-known to result in unsatisfactory generalization performance of machine learning algorithms and researchers have suggested a variety of synthetic data techniques to combat the issue. 12 11 Cubist was developed as an extension of the better-known M5 regression trees model by incorporating pruning, neighborhood smoothing and boosting. Essentially is uses a computationally efficient strategy to recursively partition the data space and fit simple piece-wise linear prediction models within each partition, whose predictions are combined using neighborhood averaging of local model predictions. The advantages over M5 are that it can produce smoother transitions across numeric outputs, and complete in much faster runtime.
12 Niyogi et al. (1998) discussed that the natural way for an intelligent learner to counter the small data problem and successfully generalize is by exploiting prior information that may be available about the domain or that can be learned from prototypical examples. They also show that the notion of using prior knowledge by expanding the training data with virtual data is mathematically equivalent to incorporating the prior knowledge as a regularizer, providing a theoretical motivation for the use of synthetic data. The literature on small data prediction has subsequently shown that synthetic data can help improve a model's ability to predict rare outcomes (Menardi and Torelli, 2014) and has since suggested a wide variety of sampling techniques that have gone beyond simple up-sampling methods to generate entirely new cases that mix attributes adaptively using non-parametric neighborhood methods (Chawla et al., 2002;Han et al., 2005;Douzas and Bacao, 2019;Douzas et al., 2022). In contrast, models have also Any synthetic data case consists of elements describing both the L.H.S. and R.H.S. of the regression model. A drawback from popular non-parametric sampling approaches that mix observed attributes by creating synthetic cases out of dynamically weighted observed cases is that these procedures may result in combinations of covariate values that are internally inconsistent. For instance, nonparametric sampling methods could generate synthetic cases with higher poverty and lower inequality, or simultaneously increase the shares of all types of land cover, ignoring that there is typically a systematic relationship between such variables. This issue can be circumvented when there are a high number of entries in the data for which covariates are observed and only the dependent variable is missing. In this, case synthetic observations can be generated whose covariate values are consistent with actual observations. 13 Here, synthetic cases are generated stochastically using a historical prediction model applied to historical observed covariates. Recall that the restrictions on the use of covariates in the final model are based on the availability of reliable future values of those covariates. For a historical prediction model used only to impute past observations, those restrictions do not apply. This fact is exploited to generate synthetic cases from historical covariates combined with historical proxies for food insecurity. The observations for the prevalence of severe food insecurity start only in 2015, while other food insecurity proxies go back much further in time. Importantly, these include data on undernourishment rates, child mortality rates, life expectancy, poverty at national lines, but also environmental rents and the percentage of urban population in addition to the full set of covariates used in the main model. This allows designing a more accurate model for the sole task of completing historical data which can be used to draw synthetic cases from. This historical imputation model in vector notation is In this model, Z now includes the additional covariates that are historically predictive but cannot be used for projection purposes. Particularly the historical malnutrition rates, life expectancy and child mortality rates are strong predictors for historical food insecurity and allow creating good additional synthetic cases. Imputations are generated using the multiple prediction framework of Andrée (2021a), which iteratively estimates fully optimized machine learning models to replace previous imputations with better predictions generated from all other been trained on data that has artificially been created from generative models (Li and Wen, 2014;Abdul Lateh et al., 2017) or even within fully simulated environments Nikolenko (2021). 13 As a simple example, Andrée et al. (2020) train a model on monthly food insecurity assessments that are published every 4 months while covariates can be observed every month. They produce simple synthetic cases by assuming that the categorical outcome of the food insecurity assessment would be the same one month prior to or after the assessment. They argue that since the official assessments themselves are not exactly timed measurements but qualitative estimates prone to error, the errors in the synthetic cases would likely be of a similar degree as the measurement error in the observations. This allows them to expand the size of the training data by a factor 3, while ensuring that all covariate values are internally consistent, and this improves prediction on holdout observations. ANDRÉE SEPTEMBER 28, 2022 covariates. 14 This is an iterative Markov chain Monte Carlo algorithm that converges rather fast. The number of updating iterations is based on the convergence of cross-validated prediction performance as in Andrée (2021a). Since the combined predictor set (X, Z) is relatively large, recursive feature elimination is performed by estimating variable importance using a Random Forest, then iteratively discarding the least important predictor. The variable selection that yields the best cross-validation performance for the severe food insecurity prevalence rates is used to run the first 10 imputation iterations, after which the algorithm continues with the fully specified regressions that use all covariates.
To adjust for the fact that the synthetic cases are estimated with uncertainty, with some cases generated in more uncertain regions than others, a total of 10 imputations are generated for each entry of observed covariates. This uses the stochastic properties of the estimated model of Equation 2, to generate multiple likely values. The final model is then trained by creating 10 training data sets {Y 1 , ..., Y 10 }, each replacing Y m missing with different predictionsŶ m missing generated from models {ĝ 1 , ...,ĝ 10 }. This results in 10 different models {f 1 , ...,f 10 } which can be ensembled into a single modelf by aggregating the predictions of the individual base learners {f 1 , ...,f 10 }. 15 The accuracy of the final model can be validated using standard validation techniques.
What this strategy achieves is essentially a form of simulation-based training. The intuition is that the final forecast model has to project a global food insecurity situation for 2020 and after, that is beyond anything seen in the 2015-2019 data but likely comparable to the dynamics in the 2007-2011 data when food prices spiked globally during a time when extreme poverty was more prevalent. Therefore, simulated data for this historical period adds crucial information about conditions outside of the 2015-2019 data perimeter. The synthetic data strategy should therefore not in the foremost place improve the average prediction for the 2015-2019 period (as a standard cross-validation exercise would be indicative of), but act to aid the extrapolation when covariates move out of 2015-2019 country intervals. If the non-parametric extrapolation is based solely on 2015-2019 dynamics, they may become highly unstable when many covariates go outside observed intervals in the post 2019 data, something a simple cross-validation exercise using the 2015-2019 data would not reveal. The application will show that the synthetic data strategy increases prediction performance tremendously on holdout data selected around the edges of the sample space.
One important parameter in the synthetic data algorithm is the cutoff year that determines how many synthetic cases should be added, 1999 being all synthetic data and 2015 being no synthetic data. The synthetic cases are generated with a model that uses all 1999-2019 data, but the final forecast model may not necessarily benefit from training on the full synthetic history. The deeper history on the one hand allows the model to learn a richer representation of the data, but on the other hand the deeper history is more difficult to predict by the imputation model and so the quality of the synthetic cases may be lower. To find the right cut-off, holdout predictions of the final prediction model using all synthetic data cutoff years 1999, 2000, ..., 2015 have been validated. The optimal cutoff was selected as 2005, which thus covers synthetic cases leading up to and following the previous global food crisis around 2008.

A. Validation Results
Three main models are estimated. Linear uses a simple linear regression, trained only on observed data, thus without addition of any synthetic cases. Cubist trains a Cubist regression on the same original data. Cubist synth estimates the Cubist regression with synthetic cases. Prediction performance is estimated using several setups. In all the experiments, the models are optimized for R 2 and the associated MAE and RMSE are given for completeness. 16 First a standard 10-fold cross-validation exercise was performed. This application estimates each model 10 times on a 90% sample so that predictions are validated 10 times on the 10% held out observations. The results show that the Cubist model beats the linear model by a large margin, with an R 2 of 0.94 versus one of 0.60 for the linear regression, highlighting the nonlinear nature of the data relationships. The linear model is simply not able to cope with the cross-country heterogeneity and nonlinearity. 17 The Cubist model with synthetic cases performs even better. The cross-validated R 2 is 0.95 for the Cubist model with synthetic data, and the RMSE and MAE drop from respectively 3.49 and 1.75 to 3.10 and 1.62 compared to the vanilla Cubist model, approximately a reduction of 10%. 16 Cubist models can be optimized using two tuning parameters Committees and N eighbors that respectively control the boosting iterations and the degree of neighborhood smoothing that is applied. First a full search across all possible N eighbor values was conducted using the standard grid of boosting iterations (1, 10, 20) considered by the implementation of Kuhn et al. (2012). Then, the optimal N eighbor +/ − 1 was used and a boosting grid of 1, 10, 20, 40, 60 was used. The maximum number of boosting iterations possible is 100 but this increased computational burden beyond manageable levels.
17 A fixed effects model may cope better with the cross-country heterogeneity but would not be able to 14 ANDRÉE SEPTEMBER 28, 2022 In the second experiment, the holdout sample consists of the highest severity rate observed in each country. This explicitly validates the models for their ability to anticipate an unprecedented value (outside observed history) and is specifically aimed at assessing each model's ability to extrapolate. This more closely resembles the intended use of forecasting the global crisis risks post 2019 based on 2015-2019 data. It follows the previous work of Celiku and Kraay (2017); Andrée et al. (2020) who argue that for prevention purposes, a model must be able to forecast an outbreak (an extreme event) before it occurs, and thus be validated using statistical measures and holdout data that reflect this use case explicitly. This is opposed to using a standard combination of holdout data and statistical measures that are dominated by no-change events. In the extreme value holdout test, a strong drop in prediction performance is visible for both the standard linear and vanilla Cubist models. While the vanilla Cubist model had an R 2 of 0.94 in the standard 10-fold cross-validation test, performance drops to 0.82 when the model is trained without each country's highest observed prevalence rate and then validated against these difficult observations. The RMSE also more than doubles from 3.49 to 7.53, while the MAE increases 2.5 fold. The performance of the Cubist model that has trained on simulated data is not as badly affected. Drawing from richer memory, the model maintains an R 2 of 0.94, while the RMSE and MAE are respectively 34% and 42% below that of the vanilla Cubist model.
In the third experiment, the holdout consists of the 2019 observation. In this test, the models are thus explicitly validated for their ability to forecast future data, training only on history. This similarly zooms in on the intended use of forecasting. Compared to the basic 10-fold validation exercise, there is again a use those in-sample averages to extrapolate onto unobserved countries, and in some cases, there is only a single prevalence rate observed in the country, which would introduce estimation issues. deterioration in prediction performance of the vanilla Cubist model. While the vanilla Cubist model had an R 2 of 0.94 in the standard 10-fold cross-validation test, performance dropped to 0.87 when the model is trained on history and validated on future data. The RMSE also increased from 3.49 to 6.01. The performance of the model that has trained on simulated data is again not as badly affected. The synthetically trained Cubist has a very high R 2 of 0.97, and the RMSE and MAE are respectively 41% and 26% below that of the vanilla Cubist model.
The overall out-of-sample estimate for the final model R 2 of 0.95 suggests that the model is highly predictive. To test whether the model performance is stable across countries, the cross-validation scores are broken down by major World Bank regions. The results show that across all regions and income groups, the model remains accurate. The model performs best in middle-income countries in which there is no noticeable bias and an R 2 of 0.96-0.98, but predictions have a slight downward bias in the large group of low-income countries (the average prediction is 25.5 compared to the observed average of 26.6) and a minor upward bias in the high-income countries (average prediction of 3.6 compared to the average observation of 3.3). Note: Prediction performance estimated for various setups. In all experiments, the model is optimized for R 2 and the associated MAE and RMSE are given. The R 2 is the preferred metric for optimization. This is because the projections are generated by extrapolating the covariate matrix using the % change implied by the WEO -to cancel possible biases between historical level data and values used in the WEO, for instance when the WDI and WEO deviate from one another in terms of population counts -and subsequently the prevalence of severe food insecurity is extrapolated by taking the change in the response variable and extrapolating from the last known official data point. In countries without official data, this coincides with simply the point estimates generated by the model, in which case the RMSE and MAE are better indicators of accuracy. The RMSE provides also the approximate linear confidence interval and is best compared against the St.Dev in observations. The difference between the average prediction value and the average observation value reveals bias, which should be compared against the MAE to understand whether the RMSE is mostly driven by variance in prediction error or bias in prediction error. In all cases, the model seems to behave well across all groups, with performance improving in middle income areas where, taken together, large populations live in food insecure situations. Source: Results have been estimated by the author for this paper.

ANDRÉE SEPTEMBER 28, 2022
Note that the R 2 for South Asia is calculated based on only 22 samples, and hence it should with high probability be lower than in the other regions purely due to the small sample size rather than actual worse prediction performance. 18 The model seems to perform reasonably well still, for instance when the focus is on the MAE of 0.99 rate points, which is below that of some other regions.

B. Prediction Results
In this section, the final optimized model is used to assess the likely compound effects of the pandemic and Ukraine war and compare the forecast to estimates for the 2008 Global Food Crisis period. 19 The main text presents plots for all countries combined, additional plots for specific regions and incomes are in the appendix (figures A1 and A2).
The results are generated by projecting the covariates forward using the WEO. Outlooks are available for GDP, population sizes, and inflation figures. The WEO tends to have positive bias, documented for instance by Gatti et al. (2022), and may be understood as economic targets that can be met if identified risks are managed. The April WEO is put together during January -March. As such, the data does not yet fully reflect the impacts of recent major events. Risks and associate downside trajectories are however discussed, so counterfactual scenario values of inflation, growth and poverty here are used based on the downside multipliers of the April report to investigate the additional impacts of downside economic risks. Specifically, the downside projections consider a further increase in global food prices of +6% in 2022 and +3% in 2023, and reduced growth of -2% below default projection growth rates in 2022 and 2023, converging to -0.9% by 2027, as a result of increased tightening by central banks to curb inflation. 20 18 A proof is beyond the scope of the paper, but note that (squared) correlations cannot be averaged linearly as this results in a biased (underestimate) off total correlation. That is detailed in the literature around the Fischer Z transformation and its adjusted versions. Therefore, by running the logic behind the Fischer Z transformation in reverse, one could show that for a vector pair (X 1 , X 2 ) each of length L with a correlation R there exists a partition so that sub-vector pairs with lengths l 1 , l 2 , ..., l N with l 1 ≡ l 2 ≡, ..., l N and L := l 1 , l 2 , ..., l N it holds true that the correlations of those sub-vectors satisfy r 1 + δ 1 → r 2 − δ 2 → r 3 + δ 1 → ... → r N + δ 1 → R as l N → ∞ ∀ δ 1 < δ 2 and 0 < r 1 + δ 1 < 1 and 0 < r 2 − δ 2 < 1, ..., 0 < r N + δ 1 < 1. A trivial deterministic numerical example is the limit {r 1 = 0.85, δ 1 = 0.05}, {r 2 = 0.935, δ 2 = 0.035} and R = 0.90. In other words, as soon as there are even-length sub-vectors with uneven correlations, which should be expected to occur purely from random draws, then for every sub-vector pair with correlation δ 2 above the total vector correlation, there must be a sub-vector with δ 1 > δ 2 below the total vector correlation. This dis-proportionality increases with a power 2 for squared correlations. Since this argument can be repeated iteratively on the sub-vector pairs, it holds that always a larger divergence ∥r N 2 − R∥ > ∥r N 1 − R∥ ∀ N 2 < N 1 can be found. Note that in all cases, the sub-vector correlations r N 1 , r N 2 , ..., r N X are generated by the same process that has a correlation R. Strictly, this also makes the sample estimates of the R 2 not directly comparable across unequal samples sizes, unless N → ∞ so that we deal with the deterministic limit R N → R∞. 19 The 2020 prevalence data were not yet available when this work was done. At the time of reading, the official 2020 figures are likely published and the reader is encouraged to compare, bearing in mind that the model projects unmitigated impacts not reflective of supportive policies. Future updates of the projections will include the new data. Generally, there are three interesting moments in the year to update the results. In April when the WEO is published, after July when the historical annual food insecurity data are published, and in October when the WEO is revised. In January and June, some country projections may also be updated based on the World Bank's Global Prospects report.
20 As the WEO identifies only a global average risk, the downside scenario is generated under the sim-In both projections, the poverty rates are carried forward using an income group-specific poverty -GDP per capita elasticity. Simple GDP-based extrapolations have been shown to work well, even when compared to more sophisticated methods (Mahler et al., 2021). Early work on monitoring pandemic-related impacts on poverty by Lakner et al. (2020) estimated that in 2020, global poverty rates were hit approximately 2.5 times harder than GDP per capita (ppp), and up to 3 times in downside estimates. As a conservative estimate, the poverty rates in 2020 are increased by 2 times the country-specific GDP per capita shock. The percentage of imports, land cover, and agricultural GDP shares are kept fixed at their last known value. Outlooks are typically not available for these variables. The percentage of agricultural land and GDP typically evolve slowly over time and fixing it may thus be reasonable. In section III.D, the analysis will use numerical optimization and counterfactual techniques to investigate optimal agricultural GDP values to simulate how a reorganization of the domestic food production system may help turn the tide on rising food insecurity. Note: The total food insecure population is based on the predicted prevalence rates, combined with the country population totals taken from the WDI database, and the population growth rates projected by the WEO. The black solid line corresponds to the period with observed data (2015-2019), the historical part of the black dashed line are predictions generated based on historical covariate values from the World Bank WDI database and the future values are generated using the IMF's WEO outlook of April 2022. The red dash line is a downside projection that considers the slowed growth rates and higher inflation rates identified in the WEO's downside analysis. Source: Figure prepared by the author for this paper.
plistic scenario in which the GDP rate reductions are applied equally to all countries as point percentage reductions and the food price inflation rates are applied as relative multipliers (times 2 in 2022, times 1.5 in 2023) to the default WEO CPI inflation rates.

ANDRÉE SEPTEMBER 28, 2022
Using global growth expectations published in the World Economic Outlook (WEO) of April 2022, the results in figure 1 put the 2021-2023 severely food insecure population at above 1 billion (a prevalence rate of 15%) in the World Bank's 144 IDA and IBRD countries, an increase of over 172 million people (a relative increase of 21%, or a 2 point increase in the prevalence rate) over 2017-2019 pre-pandemic estimates. This projection concerns unmitigated impacts extrapolated from 2019 data that do not reflect the large amounts of aid and supportive policies that characterized the pandemic years, but have also not yet factored in the effects of sanctions and export restrictions imposed after March 31, 2022. Nevertheless, the projection is clear in direction and paints a protracted picture in which the previously anticipated rebound in headline growth occurs unequally, leaving many behind in food insecure conditions. The projected improvement of about 0.5 point in the prevalence rate after the 2022 peak is not sufficient to offset continued population growth. Without large-scale targeted interventions, the figure is projected to remain above 1 billion through 2027.
The projected prevalence of severe food insecurity also clearly exceeds the estimates for the 2008 World Food Price Crisis period. When comparing the current projection to the modeled estimates for this historical period, it is clear that the current rise in food insecurity is much more pronounced in the modeled estimates. To this regard, it is important to highlight that the current rise in food insecurity occurs in a vastly different global economic environment. First, from 2007 through 2012, global poverty rates at 1.90$ fell each year including by almost a percentage point from 2008 to 2009, while the preliminary consensus is that the pandemic year has triggered a reversal in poverty for the first time in decades. Second, while the 2008 and 2011 global food price spikes led to several severe crises, the global prevalence of undernourishment continued to fall, albeit more slowly in 2008. 21 The current developments however extend from slowed poverty reduction since around 2013, an increase in global severe food insecurity since 2015 (start of measurement) and a reversal in the global undernourishment rate in 2018. Finally, global GDP per capita fell by -4.3% in 2020 compared to -2.9% in 2009, so the recent shock has also been sharper.

C. Projecting trends in response costs
Development assistance has to account not only for changes in prevalence, but also for population growth and changes in response costs. This section showcases how the results can be used together with response costs assumptions to project trends in development financing needs. 22 21 The SOFI 2019 even found that world hunger did not rise during the 2007-08 global food crisis and 2008-09 financial crisis (box 10 on page 56 of the report), which runs counter to the modeled estimates here. The discrepancy is in the definition, in this box the report uses undernourishment, rather than severe food insecurity, to refer to 'hunger'. 22 For strict humanitarian response cost estimation, focusing on acute hunger rather than chronic food insecurity may result in a more precise estimate. Nevertheless, the arithmetic here otherwise shows that an unsustainable amount of development funds would need to shift away from previously In April 2022, immediately following the outbreak of the Ukraine crisis, the FAO Food Price Index (FPI) stood at 154.9, a +27.5% increase over the previous year (+31.7% taking the March reading). To offset the impact of this level of price increase on household expenditures, an aid program would need to meet a 21.5% replacement cost (24.1% taking the March reading) or roughly 25% to account for basic overhead. The cost of a minimum calorie sufficient diet has been put previously at a conservative average figure of US$0.75 per capita per day for the IDA countries in the WFP cohort in 2020 (FAO, IFAD, UNICEF and WHO, 2021). 23 In some regions this figure is higher. 24 For tractability and simplicity, the cost coefficient is kept fixed across countries, thus assuming food for aid is bought at IDA-typical prices everywhere. Narrower future needs assessments leveraging the proposed projections could use region-specific values or take into account the prices that prevail in source countries.
Using the projected severely food insecure populations and the per capita response costs of US$0.75, adjusted annually for food price inflation, figure 2 projects the 25% replacement cost needed to offset an approximate 30% food price inflation in the minimum costs of daily caloric needs. 25 The compounding effects of rising food insecurity and rising food prices since 2020 have a drastic effect on the projected needs, reaching over US$106 billion in 2022 alone, up from under US$60 billion for an equivalent 25% expenditure replacement prior to the pandemic. In IDA countries alone, the projected needs exceed the total IDA commitments that followed the pandemic by about 14%.
The result also highlights that while food insecurity was already on the rise during 2015-2019, the impact on financing resources were largely evened out by favorable developments in global food prices so that development assistance financing needs rose only modestly. This signifies the difference in the current global situation. The long-term projection highlights that even though the prevalence rate was projected to improve slightly, the continued growth in population totals offsets this, keeping the global population that is in need of assistance relatively flat through 2027. If this occurs jointly with 'only' a normalization of food price inflation toward the 2005-2022 long-term average, then development assistance needs are set to spiral largely out of control, reaching about US$150 billion set development objectives, such as improved schooling or climate change resilience, toward addressing chronic food insecurity. The shortfall in funding itself introduces a major risk for humanitarian finances.
23 It is important to note that the costs of a nutrient adequate (US$2.33 global average) and healthy diet (US$3.75 global average) are higher, and unaffordable even for the populations who live above the US$1.90 poverty line but are vulnerable to becoming poor. 24 The Global average is US$0.79, while it is as high as US$0.88 in lower middle-income countries, and as low as US$0.70 in low-income countries. Note: The total food insecure population is based on the predicted prevalence rates, combined with the country population totals taken from the WDI database, and the population growth rates projected by the WEO. The black solid line corresponds to the period with observed data (2015-2019), the historical part of the black dashed line are predictions generated based on historical covariate values from the World Bank WDI database and the future values are generated using the IMF's WEO outlook of April 2022. The red dash line is a downside projection that considers the slowed growth rates and higher inflation rates identified in the WEO's downside analysis. Projected population totals are converted to financing needs based on the assumption of a 25% replacement cost, roughly sufficient to offset a 30% price hike, and assumes a US$0.75 daily cost of a minimum calorie sufficient diet (2020), adjusted annually for food price inflation using the FAO FPI. Source: Figure prepared by the author for this paper.
by 2027 compared to below US$60 billion prior to the pandemic. In the downside scenario, the cash value of feeding the entire severely food insecure population a full year (4 times the plotted values, a 100% replacement cost) increases by US$387 billion from 2018 to 2027. This top up alone is 172% of the total cash value of feeding the entire 2017-2019 average size of the severely food insecure population. The basic arithmetic suggests that global food prices must come down in order for the global food security situation to become more manageable.

D. Counter-factual estimates
This section uses the model to further investigate whether a basic scenario of normalizing growth, poverty and inflation is sufficient to revert food insecurity back to pre-pandemic levels. Several counterfactual scenarios are considered and the reductions in the predicted prevalence rates are compared. First, to understand the average importance of the different covariates in the model, table 3 summarizes nonlinear and linear variable importance in the final Cubist model.
There is a strong difference between linear (t-statistics of univariate linear regressions) and nonlinear (percentage of times variable is used in Cubist's base learners) importance. The discrepancy suggests that the model reacts differently to changes in inputs depending on the value ranges and thus by region. Regardless, the high importance of unchanging factors (historical child mortality and country size) and relatively stable variables (the land cover shares), suggest that the Cubist model first partitions based on these variables and then uses relatively simple linear models that use different selections of the remaining variables as local predictors. Counterfactual results are therefore likely to vary substantially across regions. Note: Linear (t-statistics in single-predictor linear model) and nonlinear (percentage of times variable is used in base learners) variable importance. Rows are ordered by nonlinear importance, as this reflects the role that the variables play in the final model. Both metrics do not truly represent the incremental predictive impact of the variables in a nonlinear context, but they are nonetheless easily interpreted. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The first column gives the percentage of times where each variable was used in a condition and/or a linear model. These percentages reflect all the models involved in prediction (as opposed to just the frequency om terminal models). The variable importance used here is thus a linear combination of the usage in the rule conditions and the model. The second column runs a linear regression with the prevalence rates as dependent variable and the row variable as predictor and extracts the linear t-statistic. This shows also that those variables that are linearly important may not necessarily also be key predictors in the nonlinear model vice versa. Source: Results have been estimated by the author for this paper.
The analysis considers five scenarios, each time generated by modifying the 2025 input data and comparing the reduced prevalence rates predicted under 22 ANDRÉE SEPTEMBER 28, 2022 these counterfactuals to the baseline 2025 prediction. The first four scenarios are constructed by simply 1) normalizing 3-year average GDP growth, 2) normalizing the 3-year average inflation rate, 3) normalizing both, 4) normalizing both GDP growth and inflation while reducing poverty at twice the historical average rate. The exact data modifications are detailed in the note below table 3. Note that due to the increased income inequality that resulted from the pandemic year, the fourth scenario is most in line with a complete normalization toward pre-pandemic macro trajectories. Note: The bars indicate the predicted reduction in the total size of the 2025 severely food insecure population across regions under 5 scenarios. The analysis excludes Somalia, South Sudan, Yemen Rep., Ukraine and the Russian Federation due to impacts from situations whose solutions likely lie beyond simple domestic policies. China and India are excluded due to their otherwise enormous populationdriven weight in the results. The first bar compares the projected 2025 severely food insecure population to the lower predicted population total when the 3-year moving average GDP growth rates are restored to the higher of 1.5%, the projected WEO rate of 2025, and the pre-2019 5-year average. GDP per capita is adjusted assuming 6-year compound growth at this rate extrapolated from the pre-pandemic 2019 values. The poverty rate is reduced using the change in GDP per capita ppp and the regional average historical elasticity. The second bar performs the comparison with the 3-year inflation rate restored to the lower of the WEO outlook and the pre-2019 5-year median, and not below 1% or above 5%. The third bar combines both inflation and growth normalization. The fourth plots the result obtained when poverty improves by twice the historical regional average elasticity. The final fifth bar lots the result obtained when using the fourth scenario data, and optimizing the Agricultural GDP share (using a grid of single percentage increments) allowing for a maximum deviation of 15 percent points from the actual values of 2019. Source: Figure prepared by the author for this paper.
The fifth scenario simulates restructuring the domestic food supply systems through technological means (i.e. carbon neutral changes, as opposed to carbon positive strategies that rely on conversion of forest land to agricultural land). This is done by taking the covariates of the fourth scenario and minimizing the predicted prevalence of severe food insecurity over a grid of agricultural GDP shares. The values considered are single percentage point increments ranging 15 points above or below the last known share. Naturally, zero or negative values are not considered.
The results in figure 3 provide a number of insights into the important drivers of food insecurity in the model, and possible future scenarios for improving food security. First, it is clear that the predicted improvements under basic scenarios of normalizing growth and inflation by 2025 are not sufficient to revert food insecurity levels back to pre-pandemic levels. Figure A1 shows that respectively, each region would need to reduce the severely food insecure by -34% in South Asia (SA), -28% in Europe & Central Asia (ECA), -19% in Middle East & North Africa (MENA), -24% Sub-Saharan Africa (SSA) and also -24% in Latin America & Caribbean (LAC). East Asia & Pacific (EAP) is already projected to reach below pre-pandemic levels under the April 2022 WEO outlook. With all measures combined, only the predicted values in ECA and LAC improve below pre-pandemic values. In most regions, resolving domestic price inflation issues alone barely has an effect under the model. The stark difference is ECA, where solving domestic inflation alone is already sufficient to revert food insecurity to pre-pandemic levels.
In all other regions, food insecurity situations are only likely to improve considerably if 1) global food prices normalize so that development assistance needs are not impacted as severely as otherwise projected in the previous section, and 2) targeted interventions are made. For instance, the model predicts much stronger improvements relative to restoring growth, inflation and poverty alone, when the agricultural GDP shares are optimized. The synthetic change in data is straightforward to simulate, but in reality, represents a complex set of country-specific agricultural reforms that in most cases imply some form of technological advance that improves yields. Regardless, the result suggests that agricultural reforms that boost productivity through technological improvement will likely be a necessary ingredient of a scenario in which the world solves food insecurity. Since in most regions, even this scenario is not sufficient under the model, the needed policy actions are thus likely to run much broader.

A. Variables
There are two important pragmatic arguments for confining the modeling exercise to a narrow set of economic indicators, even though more advanced data sets or approaches may be available that could make predictions more accurate.

ANDRÉE SEPTEMBER 28, 2022
The model explains around 95% of variation in the data with only 11 indicators. At this point there are diminishing returns associated with additional complexity. Next, being able to project the covariates accurately or have reasonable scenario values for them is critical to the application and limits the data that can be used.
Most importantly, the model cannot rely on future unknowns for which we have only historical data. 26 Others have also put forward broader arguments in favor of parsimony. Baylis et al. (2021) argue that the use of domain knowledge of economic mechanisms is critical to interpretability of prediction results in the agricultural domain. Zhou et al. (2022) focus more concretely on narrowing down on inputs in their call for transparent machine learning models for food security tailored to specific policy needs, by proposing to carve out the path for future efforts and adoption with a simple prototype model. McBride et al. (2022) similarly observe that related but distinct objectives for food insecurity modeling will require choosing to use distinct data sources and urge for careful consideration of the purpose and use cases when making such decisions.
Expanding the set of input features well beyond what can intuitively be grasped immediately diminishes usability from these perspectives. Andrée (2021a) explain that food crises result from complex interactions between conflict, poverty, extreme weather, climate, and food price shocks; see the analyses by Misselhorn (2005); Headey (2011); Singh (2012); D'Souza and Jolliffe (2013) for more focus on individual elements, that compound in the presence of long-standing structural factors, as pointed out also by Maxwell and Fitzpatrick (2012). The vast complexity of food insecurity provides ample directions for more data and rich opportunity to develop a complex model. The intrinsic assumption behind the relatively simple model developed here is that while the exact causal mechanisms that describe how food crisis shocks play out may be vastly complex, much of the impacts are endogenous to the economic state of a country. The corollary is that a good estimate of vulnerability may thus be produced from a few broad-based indicators that capture this internalized fragility. 27 The idea is supported by the good accuracy obtained in the empirical application; the 2019 data was forecast 26 If food insecurity could accurately be predicted from say an indicator of the quality of governance, but a breakdown in that indicator itself is as equally hard to forecast into the future as an outbreak of food insecurity itself, then little progress has been made in terms of developing a future outlook. Instead, the only thing achieved is to state one unknown in terms of another unknown. Keeping such an important indicator fixed in the outlook in turn bakes in the assumption that food insecurity will remain stable. Essentially, this restricts the exercise from using the wide variety of indicators whose unforeseen drastic deterioration go hand in hand with a drastic deterioration in food insecurity, including protests, political violence, violent conflict, weather impacts, and other indicators that have previously been used to model acute food insecurity situations in previously cited work and related targets such as conflict outbreaks (Celiku and Kraay, 2017).
27 This is essentially a more elaborate way of summarizing Amartya Sen's famous statement that no famine ever occurred in a democracy; see a discussion by Rubin (2012), or de Waal's statement that all modern famines are man-made (De Waal, 2018), both which are ways to say that critical food insecurity outcomes are produced fully endogenously from rising vulnerabilities in a country. It follows that if systems in a country are so broken or corrupt to produce such devastating outcomes, then surely the impacts should be visible across a wide spectrum of socio-economic outcomes and a simple reading of basic economic indicators should provide ample predictive signal.
with an R 2 of 0.97 using historical data only. Even though prediction errors may possibly be even smaller when making the vast number of possible subtleties explicit in the model through high-dimensional data, it will be more difficult to understand the types of errors and biases that are likely to emerge from the more complex methods and make the model less easy to sustainably support.

B. Possible missing data dimensions
While the model produces statistically good results, it is still useful to consider what data dimensions may possibly be missing as this helps understand what factors may typically drive prediction errors.
Ideally, food price inflation would be made explicit in the model. This is mostly complicated by data availability. Historically, the food component of CPI is available for many countries, and it would make sense to add it to the model as a description of food price developments compared to overall price developments would provide a vastly superior interpretation of food insecurity pressures stemming from price development compared to what can be inferred from CPI inflation alone. However, in highly food insecure countries and in fragile and conflict-affected countries particularly, the availability of detailed sub-indexes has remained rather limited. This is pointed out by (Andrée, 2021a,b) whose efforts have recently sought to produce food price inflation data from alternative sources to fill important data gaps. Such approaches are however not yet fully mainstreamed and would introduce a complex dependency into the model's data pipeline. More importantly, since the application is interested in projecting future food insecurity, there would also be the need to have an outlook for food price developments. The WEO, which is central to the empirical application here, only provides forward guidance on CPI inflation rates and not food price inflation rates, which complicates the issue further. 28 There is also some scope to argue that CPI inflation rates alone should provide a reasonable predictive signal. The earlier references already highlighted that a deterioration in inflation captures broad macro-economic deterioration that has preceded major historical food crises. Second, some degree of substitution is possible within household expenditures so that households can offset increased food prices by lowering expenditures in another part in their consumption basket. Only when price increases are also broad-based, and there is less flexibility to cope with food price rises by adjusting other expenditures, will households be forced to reduce food intake. Third, from a statistical perspective, CPI inflation and food price inflation tend to be strongly correlated so much of the statistical information about the position of one of the variables is contained in the other from a model's perspective. ANDRÉE SEPTEMBER 28, 2022 There is also a benefit to relying on CPI inflation alone. Macro-economic frameworks and monetary policies are centered around general consumer prices. For instance, central banks often focus on long-term inflation metrics less impacted by short-term price volatility, and consider a "core CPI" for this that excludes items such as food, shelter, energy, and used cars and trucks. In other cases, an index of Personal Consumption Expenditures (PCE) is preferred which similarly excludes volatile commodities. This suggests also that monetary policies aimed at stabilizing core CPI or PCE can in essence be reached without normalizing food prices and so it is an important question whether restoring the CPI inflation rate is sufficient to improve global food insecurity conditions. The simple specification of the model made investigating that through counterfactual predictions much simpler. The counterfactual analysis revealed that stabilizing domestic inflation in the ECA region would be sufficient to revert hungry populations back to prepandemic levels there, but not in other regions. This is a potentially important finding that would be worth investigating further, particularly given the result that international prices need to come down to keep development assistance needs to manageable levels in these regions.
A second possible shortcoming is that the percentage of forest land cover provides only a meager description of climate factors. The first challenge to improving this is that most climate data sets describe phenomena that are inherently spatial and not at all straightforward to summarize at a national level, particularly for larger countries. For instance, rainfall and NDVI (Normalized Difference Vegetation Index) data easily capture area-specific droughts and floods, and have shown to be highly predictive of local severe food insecurity conditions in some of the work that has been cited earlier. However, the average rainfall level measured over the entire area of countries like Brazil, the Russian Federation or China would hide any and all local climatic shocks. Recent work has focused on mapping subnational agricultural GDP (Blankespoor et al., 2022) and such approaches could in theory be combined with spatially explicit data sets on hazards and climate shocks to estimate the fraction of agricultural GDP that is produced in areas vulnerable to climate. Alternatively, Koomen et al. (2022) develop spatial explicit simulation methods that may project populations vulnerable to hazards and climatic shocks under alternative Shared Socioeconomic Pathways (SSP) (scenario definitions commonly used in climate scenario analysis). Efforts to summarize such vulnerability trends to a national level likely deserve a full paper on their own, particularly if the estimates were to be projected several years into the future. Finally, Andrée et al. (2019a) provide projections through 2030 for deforestation, human exposure to air pollution, and carbon footprints for countries that account for a combined 85% of the world's population. This could capture some climate and environmental risk components, but their machine learning model relies on economic variables that have overlap with the covariates used here, showing that much of the information on these environmental aspects may already be contained in poverty, income and variables describing the structural composition of economies. Future efforts may however still be taken to enrich the list of covariates with climate variables for which climate models can produce annual projections.
A final shortcoming is that the list of variables related to pre-existing vulnerabilities seems rather short, with only historical child mortality providing a proxy that allows distinguishing between the quality of governance and social systems. 29 The choice of the variable was primarily motivated by data availability, but it is reasonable to believe that a country that has ranked worse on child mortality in the recent past also ranked worse on a wide variety of other indicators that are predictive of weak socio-economic systems. Vastly superior indicators of the quality of governance in particular could be obtained from the estimates of Kaufmann et al. (2011) who maintain the annual Global Governance Indicators. The technical challenge would again be that these data may be predictive, but no outlooks can be produced without introducing new assumptions about the future. Forecasting government breakdown, after all, is at least as difficult as forecasting food crises. 30

V. Conclusion
This paper developed a statistical model to estimate and project global food insecurity conditions up to 6 years ahead using macro-economic outlooks. The long-term nature of the outlook aligns with the slow replenishment cycle of development finance, but the results can be revised on a semi-annual basis following the periodical assessments of broader economic outlooks. Extreme food insecurity situations are multifaceted in nature and frequently develop as part of multiple interlocking challenges. The paper's approach anchors the food insecurity outlook to economic assumptions in an attempt to contribute to broader foresight that is internally consistent so that multiple policy actions can be informed jointly by a cohesive set of expectations. The paper concludes that the proposed framework provides a robust and low-cost approach to maintain reliable projections and scenario analyses that can be interpreted within the context of available economic outlooks, contributing to the capacity to monitor global food insecurity trends.
The model is based on historical associations between the prevalence of severe food insecurity and covariates that capture economic and structural drivers of food insecurity. All the data can be obtained from official public sources, and outlooks are available for the key predictors. The model was validated using holdout data that explicitly tested the model's ability to forecast new data from 29 One could argue that in order to be more accurate, the model should also include data on aid or supportive measures. However, since the modeling effort is itself motivated by a desire to trigger action, any resulting forecasts themselves should not bake in the assumed effect of assumed protective measures not yet taken. In a related early-warning early-action effort, Andrée et al. (2020) provide more discussion on isolating the impacts of humanitarian action from forecast risk signals.
30 A secondary challenge is that governments notoriously push back on governance indicators and no official data are widely available across countries. Since the objective of this paper is to produce outlooks that could help with resource allocation, the overall preference would be to reach sufficient accuracy using official or measured data that cannot be disputed alone. SEPTEMBER 28, 2022 history and extrapolate beyond observed intervals. Overall, the model explained holdout data with an R 2 of above 0.95 using 11 basic indicators. The model was applied to the IMF's WEO database of April 2022 to project the severely food insecure population across the World Bank's full cohort of 144 IDA-eligible and IBRD countries that together cover approximately 98% of global historical food insecure populations.

ANDRÉE
Based on the economic forecasts of the WEO of April 2022, the number of people that are severely food insecure for a sustained period of 3 years is estimated to reach over 1 billion people in 2022 globally (an IDA + IBRD wide prevalence rate of 15%). This constitutes an increase of over 172 million people (a relative increase of 21%, or a 2 point increase in the prevalence rate) over 2017-2019 prepandemic estimates from the same model. This projection concerns unmitigated impacts extrapolated from 2019 data that do not reflect the large amounts of aid and supportive policies that characterized the pandemic years, but have also not yet factored in the effects of sanctions and export restrictions imposed after March 31, 2022. Bearing these uncertainties in mind, the overall direction of the projection remains clear and paints a protracted picture in which the previously anticipated rebound in headline growth occurs unequally, leaving many behind in food insecure conditions. The projected improvement of about 0.5 point in the prevalence rate after the 2022 peak is not sufficient to offset continued population growth. Without large-scale targeted interventions, the size of the population in need of assistance is projected to remain above 1 billion through 2027.
While the projected population in need grows sharply, rising commodity prices have also increased the per capita costs of a humanitarian response. Using basic assumptions about response costs, adjusted annually for varying global food prices, the analysis estimated temporal trends in global development assistance needs. The analysis highlighted that while the prevalence of food insecurity had already been on the rise during 2015-2019, this development was largely countered by favorable developments in global food prices so that development assistance financing needs rose only modestly. The compounding effects of rising food insecurity and rising food prices since 2020, however, have a drastic effect on projected future needs. In a downside scenario, the annual cash value of feeding the entire severely food insecure population increases by US$387 billion from 2018 to 2027. This top up in single year needs alone is 172% of the total cash value of feeding the entire 2018 severely food insecure population a full year. The basic arithmetic suggests that global food prices must come down in order for the global food security situation to become more manageable.
The analysis finally investigated whether normalizing growth, poverty, and inflation is sufficient to revert food insecurity back to pre-pandemic levels, and found that it is not. In most regions of the world, targeted interventions may be needed to mitigate the impacts of the current crisis even if previous economic trajectories are restored. Numerical optimization was used to simulate a restructuring of domestic food supply systems. The counterfactual prediction results showed that improved domestic food supply strategies are likely a necessary ingredient for reaching long-term food security in the presence of continued growth in global populations and deterioration in climatic conditions.
Recognizing the magnitude of the projected development assistance needs, the international community will need to tightly balance between monitoring the outbreak of local food crises and rapidly responding after the fact to prevent mortality in vulnerable populations, while also investing in prevention. The protracted multi-year nature of the forecast calls for a sustained approach to investing in bettering the state of food security throughout the world. Particularly in those areas where food insecurity is on the rise but still primarily chronic, markets are often still functional and so a wide range of economic interventions can still be considered as part of prevention and resilience strategies. This can help allow scarce humanitarian resources to be preserved for aid in the most critical situations.
While the costs of taking action seem towering, the costs of inaction could be larger. Compounding the long-lasting effects of unmitigated acute hunger over multiple countries and many decades may well dwarf the upfront investments now needed to prevent disastrous outcomes. Moreover, as food insecurity exacerbates existing frustrations and disrupts the social and political order of a country, governments have an opportunity to adopt food security policies to positively shape the nature of societal reactions to rising frustrations. ANDRÉE SEPTEMBER 28, 2022 from Koopman and Durbin (2000). Whenever the full time series of a variable is missing for a given country, a K-Nearest Neighbor (KNN) algorithm is used. KNN methods are a popular way to impute missing data, and have standard implementations in common Machine Learning packages such as the widely used Classification and Regression Training (CARET) library in R. A modified version of a KNN algorithm is implemented here that imposes additional structure on the imputed country time series. The exact approach is as follows. A Gaussian Kernel is used to calculate similarities between countries which can be expressed as an N × N contiguity matrix following Andrée et al. (2019b); Andrée (2020) based on a set of other covariates over which similarity is calculated. 31 Missing country series are imputed by taking the average series of 4 countries that are most similar in terms of Child Mortality Rates, Life Expectancy, GDP ppp per capita, GDP, the share of Agriculture in GDP, and country coordinates. Whenever more than 5 observations are missing for a country, but some data is available, this 4-nearest neighbor average is indexed and used to extrapolate the observed data.
Overall, only a negligible share of the data needs to be imputed. The only variable for which a noticeable share of data needs to be imputed is the poverty rate. Here, a simple moving average interpolation is first taken between any two years for which observations are available. The Kalman Filter is then used to extrapolate whenever a maximum of 5 data points remain missing. After applying these standard methods, 11% of the data remains missing and the KNN algorithm is used to complete the remainder.
The KNN method is essentially a pattern-matching algorithm that fills gaps with synthetic patterns that follow internally consistent time series dynamics. Similar approaches that rely on Time Series Similarity to replicate and impute large time series gaps with complex and plausible patterns are popularized by a class of (Dynamic) Time Warping algorithms that have been proven to work well for complex time series data sets such as voice data, seasonal data, or even price data. A high number of books and papers has been written on the topic which a simple search on the topic will reveal.
One critique against pattern-matching is that it is model-free and thus not optimized explicitly under a notion of statistical accuracy. To assess the possible 31 Using the Gaussian kernel, define an element k ij := k(Z i , Z j ; nz) = exp . This is a measure of similarity between Z i and Z j for any nz > 0 ( Andrée et al., 2019b;Andrée, 2020). This can be used to construct a matrix: where the diagonal elements will be 1 (each element i is identical to itself). For all off-diagonal entries, elements closer to 1 indicate a higher degree of similarity between Z i and Z j . When Z i and Z j are vectors that describe a time series sequence (Z := {z} T t ), and estimate for missing dataX i = H h=1 (X h ) −H by selecting H off-diagonal entries for which h maximizes similarity over K|i ̸ = j.
impact of fixing this choice of imputation method, a sensitivity analysis was performed. Andrée (2021a) proposes a matrix-completion algorithm that leverages multiple machine learning models in a regression chain following the framework of van Buuren and Groothuis-Oudshoorn (2011); Azur et al. (2011) to iteratively build fully optimized machine learning models that can fill gaps in any one covariate based on the values in others. This is the same algorithm used to generate synthetic data for the prevalence of severe food insecurity. Due to the multivariate nature of the algorithm, it can handle missing values in multiple variables and so it can also produce multiple likely values for the poverty rate. The algorithm requires the missing data to be initialized at plausible values, and the regression chain will subsequently update the missing data estimates using optimized prediction models. When the imputations reach optimality in terms of the conditional densities, the imputation updates become random and a function of model error which can easily be canceled out by imputing multiple times and averaging the results of the final prediction model that is constructed from the data. The framework was used to search for possible improvements in the KNN-imputed poverty rates, taking the pattern-matching imputations as the starting point. In all experiments, the model is optimized for R 2 and the associated MAE and RMSE are given. Source: Results have been estimated by the author for this paper. The Cubist synth column is reproduced from the main text and added here for ease. Table A1 here extends the cross-validation results previously presented with the ones obtained using this multiple imputation approach. The prediction accuracy of the linear model using this more complex imputation approach was virtually identical, the vanilla Cubist model saw a minor improvement or deterioration depending on the holdout specification, while the Cubist model with synthetic data improved only slightly on the holdout objectives when the poverty imputations were optimized using the multiple imputation approach. The results suggest that 40 ANDRÉE SEPTEMBER 28, 2022 the initial KNN imputations are already competitive. It is still plausible that the imputation strategy improves results for countries that have substantial data gaps, something the validation cannot show, but regardless this would come at the cost of a sharp increase in the overall complexity and computational burden of the learning task.
Since missing data is only noticeably prevalent in the poverty data, the need for any form of imputation could be drastically reduced by leveraging poverty estimates from elsewhere. Recent work has sought to predict poverty, the work by Mahler et al. (2021) could provide a promising direction, but the analysis here suggests that the current imputations may in fact be reasonably similar when the accuracy and types of predictors used by the different methods are compared. 32 Alternatively, due to the similarities in the poverty prediction models of Mahler et al. (2021) and the food insecurity model here, it may also be interesting to explore whether the simulation-based learning approach here improves the poverty predictions. 32 Mahler et al. (2021) nowcast poverty rates using other variables in the WDI, WEO, and some remote sensing indicators. From across numerous machine learning methods using the hundreds of indicators, their best overall method has a Mean Absolute Error (MAE) of 3.65 trained and cross-validated using over 2,000 US$1.90 rates from 168 countries included in PovcalNet. They highlight that a few GDP related predictors taken from the WDI carry most of the predictive power, and suggest that a simple GDP derivation is reasonably competitive. This is conceptually not far from the imputation performed here. Using the narrow WDI data set, the prediction-optimized chained machine learning equations approach is used to impute poverty at $1.90 and national rates as well as undernourishment rates simultaneously, initialized with the pattern-matching algorithm. The predictions for the US$1.90 rates are cross validated at an MAE of 1.30 using the 2,886 moving average interpolated observations taken from the full set of 152 countries in the training data. These figures are not directly comparable, since RMSE cannot be compared between data sets. However, the data sets are at least similar so the result suggests that the current implementation has performance that falls at least within a similar ballpark range. Expressed as out-of-sample explained percentage of absolute error (a robust Pseudo R 2 calculated from prediction error instead of correlation) the chained equations imputation accuracy sits above 0.93. However, using this as an alternative to the pattern-matched imputations did not significantly impact the main results of the paper. Future work on poverty projections may however explore the two-step technique of generating synthetic cases from an accurate historical model and using these to improve the more parsimonious model for forecasting. Note: The total food insecure population is based on the predicted prevalence rates, combined with the country population totals taken from the WDI database, and the population growth rates projected by the WEO. Each region combines results from the IDA-eligible and IBRD countries and thus exclude (most of the higher income) countries that are not classified as lending countries by the World Bank. The black solid line corresponds to the period with observed data (2015-2019), the historical part of the black dashed line are predictions generated based on historical covariate values from the World Bank WDI database and the future values are generated using the IMF's WEO outlook of April 2022. The red dash line is a downside projection that considers the slowed growth rates and higher inflation rates identified in the WEO's downside analysis.  Note: IDA-eligible combines the World Bank's IDA and Blend countries that consist mainly of low and lower middle-income countries. IBRD consists mainly of upper middle-income countries. The black solid line corresponds to the period with observed data (2015-2019), the historical part of the black dashed line are predictions generated based on historical covariate values from the World Bank WDI database and the future values are generated using the IMF's WEO outlook of April 2022. The red dash line is a downside projection that considers the slowed growth rates and higher inflation rates identified in the WEO's downside analysis. The prevalence rates are predictions from the model combined with observed data in the 2015-2019 period. The total food insecure population (middle row) corresponds to the plotted prevalence rates and the country population totals taken from the WDI database, projected using the population growth rates from the WEO. The bottom graphs plot associated humanitarian needs under basic assumptions: it combines the headcounts of the middle plots, assumes a 0.75$ daily cost of a calorie-sufficient diet in 2020 adjusted annually for food price inflation using 2005-2022 annual values of the FAO Food Price Index, and plots the cash value of a 25% replacement cost that is sufficient to offset a 33% increase in prices, roughly the food price spike observed globally in the first half of 2022. Source: Figure prepared by the author for this paper.