Policy Research Working Paper                     10532




                        Minding the Gap
  Aid Effectiveness, Project Ratings and Contextualization

                               Diana Goldemberg
                                  Luke Jordan
                                Thomas Kenyon




Independent Evaluation Group
July 2023
Policy Research Working Paper 10532


  Abstract
 This paper applies novel techniques to long-standing ques-                         model on World Bank projects, the paper shows instead
 tions of aid effectiveness. It first replicates findings that                      that the strongest predictor of these projects’ contribu-
 donor finance is discernibly but weakly associated with                            tion to outcomes is their degree of adaptation to country
 sector outcomes in recipient countries. It then shows                              context, and the largest differences between ratings and
 robustly that donors’ own ratings of project success provide                       actual impact occur in large projects in institutionally weak
 limited information on the contribution of those projects                          settings.
 to development outcomes. By training a machine learning




 This paper is a product of the Independent Evaluation Group. It is part of a larger effort by the World Bank to provide
 open access to its research and make a contribution to development policy discussions around the world. Policy Research
 Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at
 dgoldemberg1@worldbank.org, lukej@mit.edu, and tkenyon@worldbank.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
  Minding the Gap: Aid Effectiveness, Project Ratings
                and Contextualization

               Diana Goldemberg1 , Luke Jordan2,, Thomas Kenyon1




Keywords: aid effectiveness, machine learning, World Bank projects
JEL Codes: O12, O15, O19



   ⋆ The findings, interpretations, and conclusions expressed in this paper are entirely those of

the authors. They do not necessarily represent the views of the World Bank and its affiliated
organizations, or those of the Executive Directors of the World Bank or the governments they
represent.
  ⋆⋆ The authors would like to acknowledge the valuable contributions of Lily Chu, St´    ephane
Guimbert, Jed Friedman, Michael Woolcock, Lily Tsai, Daniel Honig, Justin Shenk, Jos
Vaessen, Christopher Nelson, Stephen Francis Pirozzi, as well as seminar participantsat the
MIT GOV/Lab Seminar and the RMES Results Peer Learning Series. All errors are our own.
     Email addresses: dgoldemberg1@worldbank.org (Diana Goldemberg), lukej@mit.edu
(Luke Jordan), tkenyon@worldbank.org (Thomas Kenyon)
   1 World Bank Group
   2 MIT
1. Introduction

    Numerous empirical studies have investigated whether foreign aid effectively
improves development outcomes in recipient countries. This literature has relied
mainly on two levels of analysis. One focuses on the aggregate country-level
impacts of aid, typically on economic growth or sector outcomes. Another takes
a micro-level approach, with development projects as the unit of analysis, most
often using donors’ own ratings of project outcomes as a measure of effectiveness.
    This study bridges those two strands of research by focusing on the asso-
ciation between donor-financed projects and observable development impact,
treating project ratings as intermediating variables. This enables us to ask
whether project ratings convey information about those outcomes. We use rat-
ings from projects undertaken in 183 developing countries by eight donors since
the 1990s, concentrating on a few service delivery sectors with readily available
data on beneficiary-level outcomes. We succeed in replicating previous findings
of small positive effects of aid on sector outcomes. However, our results suggest
that the project ratings convey little information about impact.
    The second and more important contribution of this study is to describe
and analyze the correlates of projects’ contributions to improvements in sector
outcomes. Focusing on projects undertaken by the World Bank, for which more
granular information and extensive text documentation are available, we use
state of the art methods to assess what aspects of a project’s production process
are associated with stronger outcomes. We first create what are called “text
embeddings” of project documents using the latest generation of transformer
models,3 turning texts into numerical representations of their similarity and
differences. Then, we train machine learning models to predict projects’ sector
outcomes, and probe what features of the projects the model paid most attention
to. We find that projects with what appear to be high degrees of tailoring
to country context and concentration of funds in fewer sectors are associated
with stronger outcomes. In doing so, we use newly available data on project
characteristics and draw on methodological advances at the intersection of causal
inference and econometrics with machine learning. To our knowledge, this is
the first attempt to quantify the importance of project contextualization to
development effectiveness.
    Our findings have actionable implications for the system through which the
World Bank and other development institutions evaluate project performance,
by offering a cautionary tale against the over-reliance on project ratings as
impact metrics. It also has implications for the design and staffing of these
projects.

   3 The same class of models that power all state-of-the-art translation, search engines, AI
text generators, as well as most plagiarism detectors.




                                             1
2. Literature and Theory

2.1. Development Effectiveness
    A burgeoning literature on aid effectiveness has focused on development
projects as the unit of analysis, examining the association between project char-
acteristics and country-level variables on the one hand and project success on
the other (for a summary, see Ashton et al. 2023). The most commonly used
measure of project success is donors’ ratings of project outcomes, which this
literature considers a noisy but valid measure of project performance (Denizer,
Kaufmann, and Kraay 2013). Explanatory factors for project success cluster
around: (i) country characteristics, such as institutional quality, political and
economic stability, and regime type; (ii) project characteristics, such as dura-
tion, size, sector, and lending instrument; (iii) aspects of project design, such
as the clarity of results frameworks and the number of components; and (iv)
aspects of project supervision, such as the intensity, timing, and continuity of
oversight.
    The utility of this literature is limited in two respects: first, in that no
study of which we are aware succeeds in explaining more than 30 percent of
variance of project outcomes ratings; second, in that there are grounds for
questioning the meaning of these ratings in the first place. Individual donor-
financed projects often anticipate and rate only local impacts, seldom claiming
a linkage to economic growth or country-level outcomes. Yet, the local and
national effects of aid projects are linked by definition; the total impact of foreign
aid upon sector outcomes must be associated with the cumulative effect of the
individual projects. Nevertheless, it is an almost entirely neglected empirical
question as to whether these ratings are indeed correlated with the contribution
of external financing to development. We address this gap by investigating
whether project ratings convey information on observable development impact.
    A related literature strand focuses on the aggregate country-level impacts
of aid. A few studies have attempted to estimate the relationship between
donor financing and sector outcomes in education, energy, health, sanitation,
and water. They employ similar strategies: panel data estimation techniques
controlling for country-specific effects and potential endogeneity of regressors,
with sector outcomes as the dependent variable and aid flows as the central ex-
planatory variable. Mishra and Newhouse (2009) measure the reduction in the
infant mortality rate associated with increases in health aid per capita. Birch-
ler and Michaelowa (2016) examine the effect of education aid per capita on
net primary school enrolment rates. And Ndikumana and Pickbourn (2017)
investigate whether foreign aid to the water and sanitation sector has helped to
expand access to water and sanitation services in Sub-Saharan Africa. These
studies do not measure or estimate the relevance of the various characteristics
of development financing identified in the broader literature beyond its volume;
nor do they reflect the contribution of non-lending assistance, such as analytical
work, to development outcomes. They do nonetheless provide a point of depar-
ture for our analysis, and we replicate their results of small but positive effects
of aid on sector outcomes.


                                          2
2.2. The Role of Projects
    Projects are essential vehicles of development assistance, functioning as an
intervening and determinative structure between individual interventions and
sector outcomes.4 Their role encompasses not just the financing but also the
adaptation of interventions to local context, their implementation, evaluation,
and replication through the policy cycle. In doing so, they remediate the ‘im-
plementation gap’ between what is planned, or conceived on the basis of what
might have worked elsewhere, and what is achieved. Variations in project char-
acteristics can also undermine the generation of inferences from randomized
control trials and limit the extent to which they may be extrapolated across
sites.5
    It is not unreasonable to expect that projects should cumulatively be asso-
ciated with sector outcomes. At least since the early 2000s, aid agencies have
increasingly combined project financing with technical assistance, strengthen-
ing government systems and claiming to improve the quality of a program of
government expenditures beyond their own financing. Their effect should be
detectable not just on the management of interventions financed by the project
but, through institutional spillovers, on other areas of government activity.
    Qualitative analyses of the effectiveness of development projects have em-
phasized goodness of fit with country circumstances as a critical determinant of
success. This is partly because, to be successful, any policy has to be not only
technically correct but also politically supportable and administratively feasible
(Moore 1995); partly because technical correctness itself requires judgment as to
the similarity between local context and the factors that determined the outcome
of an intervention elsewhere. To the extent that this dance of contextualization
occurs in World Bank projects, it is mostly during project preparation. But
it has largely been altogether ignored in the quantitative literature on project
outcomes, for want of measurability.
    The scope of prior analyses has instead been constrained by the ready avail-
ability of publicly-disclosed data on aspects of project design and supervision.
These are for the most part either only weakly linked in theory to project ef-
fectiveness or only rough empirical proxies for theoretically-relevant variables.
Thus, for example, while a few studies have attempted to evaluate the contri-
bution of economic analysis or clear results frameworks to project effectiveness,
most have restricted themselves to easily observable characteristics like size, du-
ration, sector and sources of financing, often with inconsistent findings. To the
extent that they have examined the role of donor agency staff, this has been
limited to the project manager, with little attention to other participants in
the process. Similarly researchers have depended on country-level measures of
institutional quality, even though familiarity with and capacity to implement

   4 For a fuller description of the role of projects and their place within a broader conceptual

framework, see Section 1 of Ashton et al. (2023).
   5 For more on implementation gaps see Williams (2019) and for a discussion on similar

issues in evidence-based medicine, see Ford and Norrie (2016).



                                               3
donor-financed projects varies significantly within countries.
    We would expect projects to be more effective when they (i) incorporate
prior analysis of the conditions under which an intervention functioned else-
where and awareness of any material differences between it and the context to
which it is to be transplanted; (ii) identify any necessary adaptations and resist
external pressure towards over-rapid or unthinking replication; (iii) provide the
financial and human resources needed to implement the project (Honig 2018).
The likelihood of their doing so depends on a process involving not just the
project manager, but the leadership and other team members on the donor
side, and a project implementation unit generally staffed by civil servants on
the government side. All investment projects also depend to a greater or lesser
degree on the effectiveness of government procurement and financial manage-
ment systems. These inputs are often poorly captured by standard indicators
of bureaucratic quality (Blum 2014).

2.3. The Project Evaluation Process
    The Development Assistance Committee of the Organisation for Economic
Co-operation and Development (OECD-DAC) has long spearheaded an agenda
on evaluation practice, encouraging analysis of aid effectiveness and results (in-
stead of only inputs and activities), publishing its first set of principles for
evaluation of development assistance in 1991. Nowadays, most bilateral and
multilateral donors have an established process for evaluating their development
effectiveness, aligned with OECD-DAC’s normative framework that consists of
six evaluation criteria – relevance, coherence, effectiveness, efficiency, impact
and sustainability.6
    At the World Bank, project evaluations are overseen by the Independent
Evaluation Group (IEG). IEG rates several aspects of project performance, but
the focal metric - reported most saliently to its Board and most commonly
used by researchers - is the ‘outcome’ rating, which assesses whether the project
achieved its stated objectives. The ratings are the culmination of a two-stage
process: first the project management’s own self-evaluation – the Implementa-
tion Completion and Results Report (ICR) – and subsequently the ICR Review
(ICRR), in about 20 percent of cases followed two years later by a more detailed
report, the Project Performance Assessment Report (PPAR), both conducted
by IEG. Together these lead to a six-point outcome rating, ranging from highly
unsatisfactory to highly satisfactory.
    The other seven donors in our database - the Asian Development Bank
(ADB), the Global Fund to Fight AIDS, Tuberculosis and Malaria (GFATM),
the German Society for International Cooperation (GiZ), the International Fund
for Agricultural Development (IFAD), the Japan International Cooperation


   6 Together they describe the desired attributes of interventions: all interventions should be

relevant to the country context, coherent with other interventions, achieve their objectives,
deliver results in an efficient way, and have positive impacts that last.




                                               4
Agency (JICA), the German Development Bank (KfW) and the United King-
dom’s Department for International Development (DFID) - similarly summa-
rize their self-evaluations in a single ‘outcome’ rating. Precisely due to their
widespread availability and ease of harmonization, the use of such project out-
come ratings is prevalent in the aid effectiveness literature.
    However, there are several grounds for doubting whether these outcome rat-
ings capture either donor contribution or the likelihood of sustained improve-
ments in development outcomes. First, they are an aggregation of several sub-
ratings and therefore mask variance in the contribution of individual compo-
nents or interventions. Second, by assessing primarily whether a project has
achieved its stated objectives, they may encourage project designers to limit
their ambition to what can be easily, and sometimes already has been, achieved.7
Third, they can reorient time horizons to short-term outputs, which are easier
to measure within the project life-cycle, over longer-term efforts to resolve core
problems (Andrews 2021).


3. Methods

3.1. Research Questions
    We seek to advance the existing literature by considering four questions.
First, can any general statements be made about the impact of development aid
on sector outcomes? Second, does consideration of aggregate project outcome
ratings within each sector mediate the terms of that relationship? In other
words, do outcome ratings provide information on the relationship between aid
and outcomes? Third, can the application of novel econometric and machine
learning techniques better detect associations between project characteristics
and development outcomes? And finally, what can such methods illuminate
about the characteristics of development projects associated with positive sector
outcomes?

3.2. Data
   Data on sector outcomes and country characteristics are obtained from the
World Development Indicators (WDI, World Bank 2021), covering 1990−2015.
For official development assistance (ODA) flows, we used the AidData Core


   7 This may explain the inability of previous researchers to explain more than 30 percent

of the variance in outcome ratings. If every project defined objectives to achieve the highest
rating, the correlation between ratings and project/country characteristics would be zero. IEG
does assess the ‘Bank’s contribution’, defined as ‘the extent to which the services provided by
the World Bank ensured quality at entry of the project and supported effective implementation
through appropriate supervision.’ Its evaluation of project efficacy, or the extent to which
outcomes were achieved, is also required to examine ‘whether the achieved outcomes can
plausibly be attributed to the government program or project.’ But the first focuses largely
on compliance with fiduciary and reporting requirements; while the second does not evaluate
whether the objectives would have been achieved in the absence of the Bank’s involvement.




                                              5
Research Release (Tierney et al. 2011).8 For project outcome ratings, we used
the Project Performance Dataset (PPD), a consistent six-point project outcome
score based on donor-reported outcome data (Honig, Lall, and Parks 2022). We
aggregated from AidData to produce measures of total ODA per country-year,
and used the PPD in combination with AidData to construct simple and size-
weighted averages of ratings for projects completed in a given year.
    For World Bank projects we use a scraper to download the three main doc-
uments of every project - the project information document (PID), the project
appraisal document (PAD), and the implementation completion report (ICR).
Respectively, they contain the information available at the beginning and end
of project preparation, and at project closure. The documents had already been
subject to plain text extraction by the World Bank, and we performed a min-
imal amount of post-processing to clean the files.9 For project characteristics,
we used the data compiled for Ashton et al. (2023). This includes much that
is publicly available via the World Bank’s project portal, as well as some inter-
nal data on team and management characteristics, and project preparation and
supervision steps.
    We concentrate on projects in five sectors: health, education, water and
sanitation (WASH), energy and fiscal management, due to their combination of
prior literature, data availability and size. Together, they account for 35% of
aid flows, and one third of PPD projects. When combining sector aid flows and
projects with outcomes and country characteristics from the WDI we conducted
standard smoothing for noise and minimal interpolation for variables with high
missingness.

3.3. Linear Methods: Replication, Sectoral Extension and Ratings
    We first replicate the specifications in health, education, and WASH from
Mishra and Newhouse (2009) for health, Birchler and Michaelowa (2016) for
education, and Ndikumana and Pickbourn (2017) for WASH. Using identical
model forms, we are able to reproduce the results in each paper closely (see
Appendix A). Since our primary interest is in replicating these models as a
baseline, we do not extend them through instrumentation or other techniques,
nor do we make more than associative claims based on them.
    We then extend this prior analysis by varying controls to check for robustness
and adding two sectors: energy, and fiscal management. For energy, we use as a
baseline the same controls as in WASH. For fiscal management, we use a limited
set of controls, for income level and institutional quality. The general form of
the regression equations was as follows:

    8 The AidData is based on the Organisation for Economic Co-operation and Development

(OECD) Creditor Reporting System (CRS) donor-reported data, with added granularity on
purpose and activity coding.
    9 The scrapped dataset of public documents related to World Bank development projects

is openly available at HuggingFace (https://huggingface.co/datasets/lukesjordan/worldbank-
project-documentS).




                                            6
              Ycts = γ0,s + γX Xc,t−L,s + γW Wcts + fcs + fts + ϵcts              (1)
    Where Ycts is the relevant sector s outcome in country c at time t, Xc,t−L,s
is the relevant aid variables with a lag of L (e.g., volume of aid to sector s),
Wcts is the set of controls taken for each sector from the cited literature, and fcs
and fts are sector-specific sets of country and period fixed effects. The controls
include macro-economic (e.g., GDP per capita), demographic (e.g., youth share
of population), and institutional (e.g., Freedom House ratings) measures. The
regression tables in Appendix A provide the exact outcome and controls used
for each sector.
    We conduct our primary extension for each model by adding the project
outcome ratings as exogenous variables (in Xc,t−L,s ), varying the specification
for robustness. First, we construct a weighted average each year with the weights
provided by the relative size of the rated projects in aid flows. We then take
the mean and max of those ratings over rolling five year periods. As well as
the rating itself (6-point scale) we use a binary variable according to whether
the average rating was “moderately satisfactory” and higher, or “moderately
unsatisfactory” and below. We also restricted the volume of aid to only that
from donors in the PPD, or only World Bank Group projects.

3.4. Machine Learning Methods: Residual Outcomes and Text Embeddings
    We then focus our analysis on projects undertaken by the World Bank, for
which more granular information and extensive text documentation are avail-
able. At the project level, we applied debiased machine learning (Chernozhukov
et al. 2018) to estimate treatment parameters for project effects by utilizing a
linear model to partial out fixed effects and controls, then utilizing linear and
non-linear models to estimate the residual using only project-level characteris-
tics.
    We denote by Y  ¯cts the predicted value in country c at time t for the rele-
vant outcome in sector s, estimated using only the controls and fixed effects in
Equation 1. In other words, Y ¯cts = γ0,s + γW Wcts + fcs + fts . We then removed
this prediction to generate, in each sector, in a residual term Y               ¯cts .
                                                                  ˜cts ≡ Ycts − Y
Together:

                    ˜cts = Ycts − (γ0,s + γW s Wcts + fcs + fts )
                    Y                                                             (2)
    The coefficients in equation 2 are estimated independently for each sector,
and the resulting Y˜sct are each residual terms (and hence normalized scalar val-
ues). These residual outcomes are then the targets for project-level prediction.
    We then extend our analysis to encompass numerical representations of text
related to the project that, we argue, capture the degree to which project con-
tent is tailored to country and sector context (see the Appendix B for details
of their construction). These representations are known as “text embeddings”.
In theory, such embeddings can be very simple: for example, a vector repre-
senting counts of key words in a document is an embedding. The embeddings


                                          7
we utilize are several orders of magnitude more powerful than such counts, or
similar statistical measures of topic frequency, because they capture not only
the relative presence of key words and terms, but the interrelationship among
words. These embeddings capture not only what language is used but how it
is used. The same word in different parts of a block of text, or surrounded by
different language, will be embedded differently in the high-dimensional space.




Figure 1: Dimensionality-reduced contextual embeddings of Project Development Objectives

    We find strong indication that these embeddings are capturing meaning-
ful interrelationships among projects. To visualize this, we reduce these high-
dimensional embeddings to two dimensions. Though the axis have no direct
interpretability, the plot of embeddings for all World Bank projects shows clear
separation by sector, even though their area of focus was not included in any
of the information provided to the embedding pipeline (Figure 1). Moreover,
when considering health projects, the mean embedding is stable throughout the
period 1990−2019, with slight fluctuations in standard deviation per decade
(see Table 1). In 2020−21, however, the mean shifts dramatically, and variance
collapses - as we would expect, given the COVID-19 pandemic and consequent
focus on emergency response.
    We then use these embeddings to construct a novel measure for the degree of
tailoring of a project to its context, which we will call project contextualization.
To do so, we calculate the mean embeddings for each decade in each sector
and in each country, and then compute the Euclidean distance between each
project’s embedding and the mean for its sector and country in the decade it
was approved. Formally, for each sector s and decade d, with projects in the


                                           8
                Table 1: Health projects embeddings evolution over time
            Period         N      Mean X         Mean Y     Mean Distance
            1990−99       195      7.84           9.64          5.01
            2000−09       275      7.75           7.65          5.55
            2010−19       308      7.65           9.89          2.79
            2020−21       203      9.44           11.76         1.19
       Note:     Embeddings are projections into abstract high-dimensional vec-
       tor space expressing inter-relationships, so lack physical units. This table
       reports the reduction of health projects embedding in two dimensions (X
       and Y), centered on the 1990−1999 mean.



sector and decade Psd and |Psd | = Nsd , we calculate:
                                           Nsd
                              ∗       1
                             Esd =               Ep , p ∈ Psd                         (3)
                                     Nsd   n=1
                                ∗
    and similarly construct Ecd    for each country c. We then define the sector
                     s                                 ∗
distance, denoted Tp for a given project p, as ∥Ep − Esd  ∥. Similarly, the country
                      c                         ∗
distance, denoted Tp , is defined as ∥Ep − Ecd ∥. The sector distance can be
interpreted as the degree to which the project document is, in the deep tex-
ture of its language, adjusting sectoral knowledge to local country realities, and
the country distance a similar measure of how sectoral peculiarities are being
brought to bear on local problems.
    Further work is needed to determine what design aspects, as reflected in the
text of the project documents, are most clearly associated with development
outcomes. In this paper, we restrict ourselves to embeddings generated from
the project development objectives (PDO), results framework indicators and
implementation completion report (ICR). But even these appear to be capturing
the degree of specificity to country circumstances. To illustrate, here is the PDO
of a health project with very low (sector distance more than 1 standard deviation
smaller) contextualization:

      The revised PDOs are to: (i) improve coverage, utilization and qual-
      ity of health care services in the territory of the Recipient, and (ii)
      strengthen the Government’s stewardship functions in the health
      sector.

   By contrast, more contextualized PDOs are more specific with respect to
population groups, periods and outcomes. Here are another two examples with
high (more than one standard deviation above average distance to sector mean)
contextualization:

      The Project’s development objective is to ensure access to improved
      and sustained water and sanitation services in rural communities
      in [redacted country name]. This would be accomplished through
      the implementation of the new Rural Water Supply and Sanitation


                                            9
      (RWSS) sector policy and the preparation of a National RWSS pro-
      gram. To this end, the Project would support a decentralized and
      demand responsive delivery mechanism and help build the institu-
      tional foundation for implementing the National RWSS Program
      both at the central and local governments levels.

   And:

      The specific objective of this project is to support programs designed
      to halt transmission of HIV/AIDS among vulnerable populations
      (PLWHA, IDUs, CSWs, and their clients and sexual partners) and
      between these vulnerable populations and the general population.
      Key outcome indicators include: Percent of vulnerable groups in
      participating provinces reporting safer injection practices (from an
      estimated 20% at baseline to 70% at the project end ); Percent of
      vulnerable groups in participating provinces reporting condom use
      in sexual intercourse (from an estimated 40% at baseline to 80% at
      project end).

    Finally, we seek to identify relationships between the contextualization vari-
able and project preparation and supervision characteristics. These characteris-
tics comprise project region, recipient income levels, fragility, and institutional
strength; the time and cost of project preparation; the location (headquarters or
country office) and experience of management; whether or not analytical work
in the sector and country was conducted in the years prior to the project; and
various characteristics of the project manager, including education level (PhD
or not), age, experience, and prior work in the sector, analytical or lending.
Since we have embeddings for all historical World Bank projects, we are able to
probe for these relationships across a somewhat larger dataset (N = 4, 260).

3.5. Non-Linear Models
    Having constructed the residual outcomes and text embeddings, we set two
prediction tasks:
  1. Binary classification. For each project, the target prediction is whether the
     residual outcome in the project sector is positive five years after project
     completion.
  2. Regression. For each project, the target prediction is the precise residual
     outcome in the project sector five years after project completion.
    The lag-time for prediction follows the replicated studies and is used because
projects may not target the specific sector outcome during the project period
itself, or may take place in only a few districts at a time. However, as discussed
in section 2.2, this is justified in that almost all development projects aim for
systemic effects, whether via demonstration, capacity building or other channels.
    As inputs to the non-linear models used for each prediction we construct
vectors for each project p consisting of:


                                        10
   1. Basic quantitative data on the project, such as the size of the loan (in
      2010-adjusted US dollars), its duration (in months), and the percentage
      of its budget allocated to its primary sector. We extend several of these
      features, for example calculating the Herfindahl-Hirschman Index (HHI)
      for budget allocation across sectors (see Table D.11).
   2. Categorical features, such as the funding source (IBRD or IDA10 ) and the
      financing instrument (the loan or grant type). The features are one-hot
      encoded and described in Table D.12.
   3. Text embeddings for the project title and project development objective
      (PDO), as well as for the implementation completion report (ICR) and
      the results framework indicators, where they exist.
   4. Project contextualization features, i.e., sector-distance and country dis-
      tance for each of the embeddings.11

    We concatenate the numeric and categorical features, the text embeddings,
and the distance measures, to generate the combined project feature vector
Xp . Combining across health, education, WASH and energy, this results in
n = 1, 457 projects as inputs, with the lagged residual outcomes (as described
above) as targets for predictions. Following standard practice we construct a
test set of Ntest = 146 projects and train on Ntrain = 1, 311.
    We then utilize standard techniques to search among model architectures and
among hyper-parameters for the models. In each case we trivially use classifica-
tion and regression variants of the model architectures. We use a linear model
as a baseline, in both modern variants (Lasso and Ridge). We also consider sup-
port vector machines, decision tree ensembles (random forest), gradient boosted
trees (XGBoost), and fully-connected neural networks (small in size, given the
limited data). The full list of architectures and hyper-parameters is provided in
Table D.13.
    To measure predictive performance, we use the receiver-operator area under
curve (ROC AUC) metric for the binary classification, measured on the test set.
ROC AUC can be interpreted as the probability that the model will rank more
highly a random project associated with a positive residual than one associated
with a negative lagged residual of being positive. A ROC AUC of 0.5 means the
model is only as good as random choice in distinguishing between positive and
negative projects, and a ROC AUC of 1 means it distinguishes such projects
perfectly. We also report the r2 of the corresponding regression models on the
training set, in order to compare results to more traditional regression techniques
in the development literature, which do not use train-test splits.12

  10 International Bank for Reconstruction and Development or International Development

Association.
  11 We include both these distance features and each project’s raw embedding, since the

embedding on its own can (and by our empirical results does) contain information about a
given project’s relationship to others not captured in the sector- and country-distances alone.
  12 We do not use results on the training set to select models or make claims for them,

following standard practice.



                                              11
3.6. Robustness and Interpretation
    We perform multiple checks for robustness of both the linear and non-linear
models. For the linear models, we search over multiple possible linear speci-
fications by adding and removing controls and adjusting lags and observe the
effects on significance measures and coefficients for exogenous variables.13 . We
also test for orthogonality between X and W in equation 1, to test for the pos-
sibility that W is a function in part of past X and hence that the residual in
equation 2 is prematurely purged of the influences of X , weakening the associa-
tion unintentionally. In other words, we test that associations between past aid
and present controls are not muddying the results. The non-linear models are
all tested using the standard practice of K-fold cross-validation.14
    We interpret the models in part using standard methods specific to model
types, such as coefficient magnitudes and significance for linear models and
impurity-based feature importance in decision-tree ensembles, supplemented by
Shapley Additive Explanations (SHAP values, see Lundberg and Lee 2017).
SHAP values use a game theoretic approach to explain the output of a model by
attributing contributions to the final prediction to model features, analogously
to attributing the contribution of individual players within a team to the final
result of a game.

4. Results

4.1. Sectoral Aid Effects: Can any general statements be made about the impact
     of development aid on sector outcomes?
   The linear model regression results are reported in Appendix A. In each
case, the volume of aid per capita had a statistically significant effect on sector
outcomes, appropriately lagged and smoothed. The coefficients were, though,
modest in each sector:

    • Doubling per capita education aid is associated with an 8 percentage point
      increase in net primary school enrolment.
    • Doubling per capita health aid is associated with a 2 percentage point
      reduction in the infant mortality rate.
    • A 1 percentage point increase in WASH aid as percentage of GDP is
      associated with between 1-5 percentage point increase in rural access to
      water and sanitation.

  13 In other words, we introduce causal perturbation, following the practice in the DoWhy
library (Sharma and Kiciman 2020).
   14 The training set is itself divided five times into a “hold-out” (or validation) set and a

training set proper, with a candidate model trained on the training set proper and scored on
the validation set. After the five runs both the scores and the models themselves are averaged,
and excessive variance between each “fold” is examined.




                                              12
    • Doubling per capita energy aid is associated with a 2 percentage points
      increase in access to energy.
    • Doubling per capita aid to fiscal management is associated with a 4 per-
      centage points increase in the tax (net of social contributions) to GDP
      ratio.

    These results were largely robust to causal perturbation. Coefficients re-
mained significant with only minor changes in magnitude when controls were
added or removed, with the partial exception of education, where the addition
of a lagged prior enrollment figure resulted in the aid coefficient becoming in-
significant. The consistency of the results gives us confidence in saying that aid
is associated with improved sector outcomes, but the effect is generally modest,
and dwarfed by other variables.
    As a robustness check, we also find that the aid variable has low-to-trivial
correlation with the controls in almost all cases, and the cases of moderate
correlation argue more for W causing X in equation 1 than vice-versa. For
example, HIV prevalence (Pearson coefficient of 0.39 with period-average mean
per-capita health commitments) and fertility (0.26 on the same measure) are
the only health controls with more than a 0.1 correlation with per capita aid,
and the coefficient is positive, i.e, greater levels of HIV and fertility result in
more aid.
    These results do not change when longer lags are introduced to X . For
example, the correlation of last-five-years’ aid and pupil-teacher ratios is −0.16
and that between 5-to-10-years’ aid and the same ratio is −0.17. More aid is
then extremely weakly correlated to lower pupil-teacher ratios, but that, and
Freedom House ratings (−0.3), are the strongest correlation between the controls
and treatments and those correlations do not strengthen (even trivially) when
lagging X . This strongly suggests that X and W are largely orthogonal, and
that aid’s effects are not stronger through some lagged or cumulative effect on
the controls.15
    On the other hand, the coefficients on aid are probably an underestimate of
its true effect. First, we are comparing “flows” of aid to an effective “stock”
of sector outcome performance. Second, some proportion of the aid flows will
be unrelated to the specific outcome used as the dependent variable (e.g. to
higher rather than primary education, or to learning outcomes as opposed to
enrollment). But the degree of underestimation is likely limited by the inten-
tional alignment between official aid flows and the outcome-level indicators to

  15 Correlations remain small when lagging total aid as far back as a decade, and, while aid

within the sectors summed country-wise over the period are moderately to strongly correlated,
such aid is not so correlated when disaggregated over time. In other words, there is little reason
to believe that the effect of sectoral aid is being weakened by a relationship between overall
aid and growth followed by growth and sectoral outcomes. Further avenues to try to increase
the size of the effect of aid’s effectiveness are beyond this paper, whose principal purpose is
not to investigate this relationship in itself.




                                               13
measure progress towards the Millennium Development Goals (MDGs) that we
use as our dependent variables.

4.2. Project Rating Significance: Do project outcome ratings provide informa-
     tion on the relationship between aid flows and sector outcomes?
    The results for project ratings, reported in detail in Appendix A and sum-
marized in Table 2, are also clear. Only for fiscal management outcomes do
the weighted average ratings convey information about outcomes. In the other
sectors, the ratings are not significant: the coefficients are near zero and their
inclusion makes no difference to the coefficients on aid volume. These results
are robust to using the alternate measures of ratings and to the restriction of aid
flows to particular donors or the World Bank Group alone. The one exception is
in a specification for sanitation, but with a small sample size, a small coefficient
and a negative sign.
    The more sensitive “debiased/double machine learning” techniques confirm
the absence of effects seen in the traditional regressions. Table 2 shows, for
each sector, the r2 of the partialing out step, the r2 of the treatment test, the
coefficient on the treatment (weighted average rating) in the treatment test and
the p-value of the treatment test.
    It might be argued that ratings’ absence of information in four out of five
sectors is a result of projects focusing on other outcomes than those we are
testing against. But this cannot explain why aid volume enters significantly
against the sector outcomes, and, when examined in detail, requires implausible
assumptions to account for the results. Assume that some proportion X of the
aid in a sector targeted the MDG sector outcome, and the rest targeted entirely
unrelated outcomes. Then the “true” coefficient on aid volume would be 1/X
times the coefficient detected in our regression. If overall project ratings did
convey information, then they should convey information on the X proportion
targeting the outcome, and hence should modify the coefficient on volume, even
if attenuated. The ratings on the 1 − X share of aid explicitly targeted to
non-MDG outcomes would diminish the rating effect. But the coefficient could
only be reduced to insignificance if the ratings on the 1 − X proportion were
negatively correlated to the X share and neutralized them precisely.
    We also note that, during this period, projects were predominantly MDG
related and that dividing the period into two, one peak MDG period and one
after, does not alter the estimate. Further, we do detect a relationship for fiscal
management, even though not all fiscal projects explicitly aim to increase the
tax share of GDP. Finally, as noted in section 3.5, the use of lags makes it even
more reasonable to expect effects on the dominant MDG outcome from projects
in the sector. In all, it seems far more plausible that ratings are not providing
information than that ratings on a small share of non-primary-outcome projects
precisely cancel out information in the primary-outcome projects, even though
the results are unchanged in periods where that share was trivial and even ac-
counting for lags in within-sector spillovers, and not least because this neutering
effect would have to mysteriously vanish in one out of five sectors.


                                        14
               Table 2: Significance and Magnitude of Coefficient on Ratings
                                            2       2
                 Sector           N      rRc       rt    Rating coefficient
                 Education       731     0.48    0.00    -0.01
                 Health          250     0.80    0.00    -0.01
                 WASH            406     0.84    0.01    -0.04**
                 Energy          317     0.87    0.01    -0.02
                 Fiscal          539     0.62    0.80    0.07***
         Notes:    Sector outcome ratings were incorporated as the dollar-weighted
                                              2 denotes the adjusted r 2 on the initial
         average rating in the prior period. rc
         regression of the sector outcomes against the controls, and rt 2 denotes the

         adjusted r2 for the regression of the residual outcomes. Significance of
         rating coefficients indicated as ***p<0.01, **p<0.05, *p<0.10.


4.3. Contextualization and Sector Outcomes: Can we detect associations be-
     tween project characteristics and development outcomes?
    We used the strongest performing non-linear model to test for feature impor-
tance. The results are shown in Figure 2. The embedding features are dominant,
followed by the measures of sector concentration, and only afterwards country
institutional quality and other project characteristics more commonly used in
the literature.
    The most important embeddings are those associated with the PDO, espe-
cially the distances of that embedding to sector and country mean, followed by
results framework indicators and the ICR. The importance of the contextualiza-
tion features indicate that the model is learning to detect from the embeddings
the degree to which a project has been contextualized to its country, via its
distance to its sector mean, and to which country considerations have been ad-
justed in light of the sector’s characteristics, via the distance to the country
mean.16
    We consider feature importance when adding in during-project and at-review
features. We find that the embeddings of the ICR report itself then join the
PDO embedding and contextualization measures as one of the most important
features, and on some specifications becomes the most important feature. The
actual project length (as opposed to its proposed length) is similarly important.
However, neither fully displaces those from approval, and the PDO embeddings
and contextualization and concentration measures retain high importance. In
keeping with our other results, project ratings are unimportant.
    Roughly half of the projects in the dataset had a positive residual outcome,
which we would generally expect given their construction. Ensemble based

  16 It is important to remember here that these are measures of relative contribution to the

performance of a non-linear model on the entire dataset and cannot be used in the trivial
manner of a coefficient in a simple linear regression, to read off that, for example, a propor-
tional increase in contextualization leads immediately, ceteris paribus, to a certain increase in
predicted performance. More simply, nothing in these results should be taken to imply that
writing a longer PDO with the names of some local programs will lead to improved sector
outcomes (or even that simply an increase in intellectual effort across the PAD will do so).


                                                15
Figure 2: Feature importance with only at approval (left) and at review (right) features,
measured in relative weight (i.e., with all features’ importance summing to 1)


methods achieved ROC AUCs approaching 0.7, indicating that the model cor-
rectly predicted a positive or lagged residual outcome 7 out of 10 times. Sim-
ilarly, ensemble models’ regression performance on the training set was high,
with an adjusted r2 of 0.76. Further details and full results are provided in
Appendix B.2.
    We also attempt to identify the correlates of the gap between project ratings
and outcomes by training a non-linear model to predict this distance, which
might be characterized as the degree to which ratings have been gamed. We
find that loan size and region are important in explaining the distance between
the rating and the residual outcome (see Table C.10). Specifically, gaming
appears more likely for large loans, particularly those in countries with weak
institutional assessment scores at project approval.

4.4. Determinants of Contextualization: What do we know about the character-
     istics of development projects associated with positive sector outcomes?
    We cannot rule out that contextualization and sector outcomes are both
driven by other unobservable factors. It is plausible, for example, that very
effective actors from donor agencies and government counterparts design bet-
ter and more contextualized projects, and that these actors also drive better
results. Nevertheless, given the importance of contextualization in explaining
sector outcomes, we investigate what aspects of the project preparation process
might be conducive to it. Here our findings are only tentative. We find some ev-
idence that it is associated with longer preparation times and existence of prior
analytical work, but our results do not lend themselves to robust interpretation.
A random forest model is able to explain 30% of the variance in embedding
sector distance (as described in section 3.6), while linear models explain less
than 10% of the variance.




                                           16
5. Conclusion

    We present two main findings. The first is that in four of the five sectors
for which we have data, donor agencies’ project outcome ratings provide no
information as to the long-run impact of their projects. It follows that they
are a poor measure of aid effectiveness, though they may still be useful for
monitoring other aspects of project performance. Our second finding is that
the single most important correlate of impact is what appears to be a proxy for
the degree of contextualization of project design to country circumstances, far
ahead of country institutional quality, project size or other commonly identified
factors. Our methods enable us to explain around 70% of the residual variance
in development outcomes, a much higher proportion than previous analyses of
the determinants of project outcome ratings.
    These findings have significant implications for how we think about project
preparation. Further work is needed to establish what the embeddings mean.
But our results at least suggest that greater attention be paid to country contex-
tualization. This does not appear to correlate with the standard determinants
of project quality identified in the literature, such as project manager’s age,
education or prior experience. Nor does it correlate with whether the staff is
based in headquarters or in the field - consistent with IEG’s own assessment
that corporate field staffing targets have failed to ensure that decentralization
is tailored to country and program needs or applied to areas where it can bring
the most benefits (Independent Evaluation Group 2016). While we find some
evidence that the length of project preparation and existence of prior analytical
work are positively associated with impact, we are unable to say to what extent
they matter.
    Our analysis also has implications for how we think about project evaluation.
On the one hand, it is encouraging that ICRs provide sufficient information to
accurately predict the likely contribution of projects to long-run outcomes. On
the other hand, the incorporation of this information into summary ratings
appears to have resulted in them becoming disassociated from development
impact. This may warrant a more careful reading of the qualitative evidence in
ICRs, as well as more attention to how project outcome targets are calibrated,
perhaps by requiring teams to specify an ex ante counterfactual absent the
World Bank’s involvement.




                                       17
References

Andrews, Matthew (2021). “Successful Failure in Public Policy Work”. In: CID
   Faculty Working Paper Series No. 402. url: https://www.hks.harvard.
   edu/centers/cid/publications/faculty-working-papers/successful-
   failure-public-policy.
Ashton, Louise et al. (2023). “A Puzzle with Missing Pieces : Explaining the
   Effectiveness of World Bank Development Projects”. In: The World Bank
   Research Observer 38 (1), pp. 115–146. url: https://doi.org/10.1093/
   wbro/lkac005.
Birchler, Kassandra and Katharina Michaelowa (2016). “Making aid work for
   education in developing countries: An analysis of aid effectiveness for primary
   education coverage and quality”. In: International Journal of Educational
   Development 48, pp. 37–52. url: https://doi.org/10.1016/j.ijedudev.
   2015.11.008.
Blum, Jurgen Rene (2014). “What factors predict how public sector projects
   perform? A review of the World Bank’s public sector management portfolio”.
   In: World Bank Policy Research Working Paper 6798. url: http://hdl.
   handle.net/10986/17299.
Chernozhukov, Victor et al. (Jan. 2018). “Double/debiased machine learning for
   treatment and structural parameters”. In: The Econometrics Journal 21.1,
   pp. C1–C68. issn: 1368-4221. url: https : / / doi . org / 10 . 1111 / ectj .
   12097.
Denizer, Cevdet, Daniel Kaufmann, and Aart Kraay (Nov. 1, 2013). “Good
   countries or good projects? Macro and micro correlates of World Bank
   project performance”. In: Journal of Development Economics 105, pp. 288–
   302. issn: 0304-3878. doi: 10.1016/j.jdeveco.2013.06.003. url: https:
   //www.sciencedirect.com/science/article/pii/S0304387813000874.
Ford, Ian and John Norrie (2016). “Pragmatic trials”. In: New England journal
   of medicine 375.5, pp. 454–463. url: https://www.nejm.org/doi/full/
   10.1056/NEJMra1510059.
Honig, Daniel (2018). “Navigation by Judgment: Why and When Top Down
   Management of Foreign Aid Doesn’t Work”. In: doi: 10.1093/oso/9780190672454.
   001.0001. url: https://oxford.universitypressscholarship.com/10.
   1093/oso/9780190672454.001.0001/oso-9780190672454.
Honig, Daniel, Ranjit Lall, and Bradley C Parks (2022). “When does trans-
   parency improve institutional performance? Evidence from 20,000 projects
   in 183 countries”. In: American Journal of Political Science. doi: https:
   //doi.org/10.1111/ajps.12698.
Independent Evaluation Group (2016). Behind the Mirror: A Report on the
   Self-Evaluation Systems of the World Bank Group. World Bank. url: http:
   //hdl.handle.net/10986/24956.
Lundberg, Scott M and Su-In Lee (2017). “A Unified Approach to Interpret-
   ing Model Predictions”. In: Advances in Neural Information Processing Sys-
   tems 30. Ed. by I. Guyon et al. Curran Associates, Inc., pp. 4765–4774.



                                    18
   url: http://papers.nips.cc/paper/7062- a- unified- approach- to-
   interpreting-model-predictions.pdf.
McInnes, Leland, John Healy, and James Melville (2020). UMAP: Uniform
   Manifold Approximation and Projection for Dimension Reduction. doi: 10.
   48550/ARXIV.1802.03426. arXiv: 1802.03426 [stat.ML]. url: https:
   //arxiv.org/abs/1802.03426.
Mishra, Prachi and David Newhouse (2009). “Does health aid matter?” In:
   Journal of health economics 28.4, pp. 855–872. doi: 10.1016/j.jhealeco.
   2009.05.004. url: https://www.sciencedirect.com/science/article/
   pii/S0167629609000563.
Moore, Mark H (1995). Creating public value: Strategic management in govern-
   ment. Harvard university press. isbn: 9780735100046.
Ndikumana, L´  eonce and Lynda Pickbourn (2017). “The impact of foreign aid
   allocation on access to social services in sub-Saharan Africa: The case of
   water and sanitation”. In: World Development 90, pp. 104–114. doi: 10.
   1016/j.worlddev.2016.09.001. url: https://www.sciencedirect.com/
   science/article/pii/S0305750X1530543X.
Reimers, Nils and Iryna Gurevych (2019). “Sentence-bert: Sentence embeddings
   using siamese bert-networks”. In: arXiv preprint arXiv:1908.10084.
Sharma, Amit and Emre Kiciman (2020). DoWhy: An End-to-End Library for
   Causal Inference. arXiv: 2011.04216 [stat.ME].
Tierney, Michael J et al. (2011). “More dollars than sense: Refining our knowl-
   edge of development finance using AidData”. In: World Development 39.11,
   pp. 1891–1906. doi: https://doi.org/10.1016/j.worlddev.2011.07.
   029. url: https : / / www . sciencedirect . com / science / article / pii /
   S0305750X1100204X.
Vaswani, Ashish et al. (2017). “Attention is all you need”. In: Advances in neural
   information processing systems, pp. 5998–6008.
Williams, Martin J (2019). “External Validity and Policy Adaptation: From
   Impact Evaluation to Policy Design”. In: The World Bank Research Observer
   35.2, pp. 158–191. issn: 0257-3032. doi: 10.1093/wbro/lky010. url: https:
   //doi.org/10.1093/wbro/lky010.
World Bank (2021). World Development Indicators. url: https://databank.
   worldbank.org/source/world-development-indicators.




                                       19
Appendix A. Linear Regression Results




                              20
          Table A.3: Education Aid Volumes, Ratings and Sector Outcomes


                                 I              II          III          IV           V
Education Aid1                   0.08**         0.08**
                                 (0.03)         (0.03)
PPD-only Education Aid1                                     -0.01        -0.01        -0.01
                                                            (0.02)       (0.02)       (0.02)
Avg Education Rating2                           0.02        0.01
                                                (0.03)      (0.03)
Max Education Rating2                                                    -0.04
                                                                         (0.03)
Binary Education Rating2                                                              0.01
                                                                                      (0.12)
Young Population3                -0.08***       -0.08***    -0.03        -0.03        -0.03
                                 (0.02)         (0.02)      (0.02)       (0.02)       (0.02)
Pupil-teacher ratio4             -0.02          -0.02*      -0.02*       -0.02*       -0.02*
                                 (0.01)         (0.01)      (0.01)       (0.01)       (0.01)
GDP per capita PPP5              1.15***        1.15***     1.64***      1.69***      1.64***
                                 (0.28)         (0.28)      (0.31)       (0.31)       (0.31)
Cash surplus/deficit6            0.00           0.00        0.01*        0.02*        0.01*
                                 (0.01)         (0.01)      (0.01)       (0.01)       (0.01)
Inflation (%)                    0.00           0.00        0.00***      0.00***      0.00***
                                 (0.00)         (0.00)      (0.00)       (0.00)       (0.00)
Trade (% of GDP)                 -0.01**        -0.01**     -0.00*       -0.01*       -0.00*
                                 (0.00)         (0.00)      (0.00)       (0.00)       (0.00)
Freedom House7                   0.07           0.07        0.09         0.08         0.09
                                 (0.06)         (0.06)      (0.07)       (0.07)       (0.07)
R-squared                        0.53           0.53        0.56         0.56         0.56
R-squared Adj.                   0.49           0.49        0.51         0.51         0.51
No. observations                 1243           1243        999          999          999
      Notes:      The dependent variable is primary school enrolment (% net) in
      all specifications. Specification (I) closely matches Birchler and Michaelowa
      (2016). Independent variables: (1) Aid = log, average of $ commitments
      per capita in the prior 5y period; (2) Avg = $-weighted average project
      outcome rating in the prior 5y period, Max = maximum project outcome
      rating achieved in the prior 5y period, Binary = dummy for any satisfac-
      tory project outcome ratings in the prior 5y period; (3) Share of population
      ages 0-14; (4) in primary; (5) constant 2017 international $; (6) Govern-
      ment cash surplus/deficit as % of GDP; (7) Average of Freedom House
      Political Rights and Civil Liberties scores. Constant, fixed effects, and
      missing value indicators for imputed variables are included but not shown.
      Robust standard errors in parentheses. Significance of coefficients indicated
      as ***p<0.01, **p<0.05, *p<0.10.




                                           21
            Table A.4: Health Aid Volumes, Ratings, and Sector Outcomes


                                  I              II          III           IV           V          VI
Health Aid1                       0.00*          0.10***     0.01**        0.00         0.00       0.01
                                  (0.00)         (0.02)      (0.00)        (0.00)       (0.00)     (0.01)
Avg Health Rating2                                           0.08          0.05
                                                             (0.07)        (0.07)
Max Health Rating2                                                                                 -0.11
                                                                                                   (0.14)
Binary Health Rating2                                                                   0.00***
                                                                                        (0.00)
HIV prevalence3                   0.03***        0.02***     0.02***       0.02***      0.02***    -0.01
                                  (0.01)         (0.00)      (0.01)        (0.00)       (0.00)     (0.02)
Fertility rate4                   0.38***        0.34***     0.38***       0.37***      0.37***    0.30***
                                  (0.01)         (0.02)      (0.01)        (0.01)       (0.01)     (0.03)
GDP per capita PPP5               -0.05***       -0.07***    -0.05***      -0.07***     -0.07***   -0.04*
                                  (0.02)         (0.02)      (0.02)        (0.02)       (0.02)     (0.02)
Population total                  0.11***        0.11***     0.11***       0.10***      0.10***    0.05***
                                  (0.01)         (0.01)      (0.01)        (0.01)       (0.01)     (0.01)
Conflict (UCDP/PRIO)              0.02           0.00        0.03          -0.02        -0.02      0.14
                                  (0.09)         (0.08)      (0.09)        (0.08)       (0.08)     (0.11)
Access to water                                  0.02***                   0.02***      0.02***    0.01***
                                                 (0.00)                    (0.00)       (0.00)     (0.00)
Access to sanitation                             -0.01***                  -0.01***     -0.01***   -0.01***
                                                 (0.00)                    (0.00)       (0.00)     (0.00)
Physicians rate6                                 -0.03                     -0.02        -0.02      -0.05
                                                 (0.03)                    (0.03)       (0.03)     (0.06)
R-squared                         0.83           0.86        0.83          0.86         0.86       0.81
R-squared Adj.                    0.83           0.86        0.83          0.85         0.85       0.79
No. observations                  609            609         585           585          585        213
      Notes:     The dependent variable is under-5 mortality rate (per 1,000 live
      births) in all specifications. Specifications (I) and (II) closely match Mishra
      and Newhouse (2009). Independent variables: (1) Aid = log, average of
      $ commitments per capita in the prior 5y period; (2) Avg = $-weighted
      average project outcome rating in the prior 5y period, Max = maximum
      project outcome rating achieved in the prior 5y period, Binary = dummy
      for any satisfactory project outcome ratings in the prior 5y period; (3)
      Share of population ages 15-49; (4) births per woman; (5) constant 2017
      international $; (6) Number of physicians per 1,000 people. Constant, fixed
      effects, and missing value indicators for imputed variables are included
      but not shown. Robust standard errors in parentheses. Significance of
      coefficients indicated as ***p<0.01, **p<0.05, *p<0.10.




                                            22
            Table A.5: WASH Aid Volumes, Ratings and Sector Outcomes


                                Water I       II            III          Sanitation I    II         III
WASH Aid1                       0.01**        0.01**        0.02**       0.04***         0.04***    0.05*
                                (0.00)        (0.00)        (0.01)       (0.01)          (0.01)     (0.03)
Avg WASH Rating2                              0.00                                       -0.00
                                              (0.01)                                     (0.01)
Binary WASH Rating2                                         -0.02*                                  -0.06
                                                            (0.01)                                  (0.07)
Adult literacy3                 0.00          0.00          0.01**       0.01***         0.01***    0.00
                                (0.01)        (0.01)        (0.00)       (0.00)          (0.00)     (0.00)
Young Population4               0.02*         0.02*         0.02***      -0.02***        -0.02***   -0.01
                                (0.01)        (0.01)        (0.01)       (0.00)          (0.00)     (0.01)
GDP per capita PPP5             -0.72***      -0.72***      0.16***      0.12***         0.12***    0.46***
                                (0.10)        (0.10)        (0.05)       (0.04)          (0.04)     (0.08)
Conflict (UCDP/PRIO)            -0.11*        -0.11*        -0.02        0.22***         0.22***    0.09
                                (0.07)        (0.07)        (0.02)       (0.07)          (0.07)     (0.11)
Lag access to sanitation        0.04***       0.04***       0.00
                                (0.01)        (0.01)        (0.00)
Lag access to water                                                      0.02***         0.02***    0.01***
                                                                         (0.00)          (0.00)     (0.00)
R-squared                       0.63          0.63          0.99         0.62            0.62       0.84
R-squared Adj.                  0.60          0.60          0.98         0.61            0.61       0.82
No. observations                755           755           121          755             755        121
      Notes: The dependent variable is access to improved water source (% of
      population) in Water I, II and III specifications, and access to improved san-
      itation facilities (% of population) in Sanitation I, II and III. Specifications
      Water I and Sanitation I closely match Ndikumana and Pickbourn (2017).
      Independent variables: (1) Aid = average of WASH commitments as % of
      GDP in the prior 5y period; (2) Avg = $-weighted average project outcome
      rating in the prior 5y period, Binary = dummy for any satisfactory project
      outcome ratings in the prior 5y period; (3) Literacy rate (% of people ages
      15 and above); (4) Share of population ages 0-14; (5) log, constant 2017
      international $. Constant, fixed effects, and missing value indicators for
      imputed variables are included but not shown. Robust standard errors in
      parentheses. Significance of coefficients indicated as ***p<0.01, **p<0.05,
      *p<0.10.




                                             23
      Table A.6: Energy Aid Volumes, Ratings and Sector Outcomes

                                      I           II           III
     Energy Aid1                      0.02***     0.02***      0.03
                                      (0.01)      (0.01)       (0.03)
     Avg Energy Rating2                           0.00
                                                  (0.00)
     Binary Energy Rating2                                     0.02
                                                               (0.05)
     Adult literacy3                  0.01**      0.01**       0.00
                                      (0.00)      (0.00)       (0.01)
     Young Population4                0.06***     0.06***      0.07**
                                      (0.01)      (0.01)       (0.03)
     GDP per capita PPP5              0.34***     0.34***      0.55***
                                      (0.06)      (0.06)       (0.20)
     Conflict (UCDP/PRIO)             -0.05*      -0.05*       -0.01
                                      (0.03)      (0.03)       (0.07)
     R-squared                        0.96        0.96         0.98
     R-squared Adj.                   0.96        0.96         0.97
     No. observations                 753         753          104
Notes: The dependent variable is access to electricity (% of population)
in all specifications. Independent variables: (1) Aid = log, average of $
commitments per capita in the prior 5y period; (2) Avg = $-weighted av-
erage project outcome rating in the prior 5y period, Binary = dummy for
any satisfactory project outcome ratings in the prior 5y period; (3) Lit-
eracy rate (% of people ages 15 and above); (4) Share of population ages
0-14; (5) log, constant 2017 international $. Constant, fixed effects, and
missing value indicators for imputed variables are included but not shown.
Robust standard errors in parentheses. Significance of coefficients indicated
as ***p<0.01, **p<0.05, *p<0.10.




                                     24
Table A.7: Fiscal Policy Support Volumes, Ratings and Sector Outcomes

                                          I            II
           Fiscal Aid1                    0.04***      0.00
                                          (0.01)       (0.02)
           Avg Fiscal Rating2                          0.07***
                                                       (0.02)
           Conflict (UCDP/PRIO)           0.38***      0.18*
                                          (0.03)       (0.09)
           GDP per capita PPP3            0.01         0.07
                                          (0.01)       (0.07)
           ODA (% of GNI)                 0.01         -0.12***
                                          (0.01)       (0.02)
           Freedom House4                 -0.01        0.02
                                          (0.01)       (0.02)
           R-squared                      0.64         0.82
           R-squared Adj.                 0.62         0.80
           No. observations               2893         539
Notes: The dependent variable is tax (net of social contributions) to GDP
ratio in all specifications. Independent variables: (1) Aid = log, average
of $ commitments per capita in the prior 5y period; (2) Avg = $-weighted
average project outcome rating in the prior 5y period; (3) constant 2017
international $; (4) Average of Freedom House Political Rights and Civil
Liberties scores. Constant, fixed effects, and missing value indicators for
imputed variables are included but not shown. Robust standard errors in
parentheses. Significance of coefficients indicated as ***p<0.01, **p<0.05,
*p<0.10.




                                    25
Appendix B. Machine Learning Methods

Appendix B.1. Text Embeddings
    We extend our analysis to encompass numerical representations of text re-
lated to the project, known as “text embeddings”. These embeddings are
produced by complex functional forms (“transformer models”) that rely on a
mechanism called self-attention (Vaswani et al. 2017). The complexity of these
functional forms and the use of machine learning to set their parameters by
stochastic gradient descent means they are more difficult to interpret than sim-
pler statistical measures, as will be discussed further below, but compensate
with substantial gains in empirical results.17 We use a two-step process to
generate such embeddings for development projects. First, we use pretrained
transformer models to generate embeddings of each word in the text. Specif-
ically, we use an extension of these models to sentence embeddings, in which
whole sentences are encoded using a transformer architecture trained to embed
“close” sentences (measured by cosine-similarity of their word-level embeddings)
close to each other (Reimers and Gurevych 2019). That first-stage model pro-
duces a very high (n = 768) dimensional vector, too high to be used downstream
given the number of projects available. In our second step, therefore, we reduce
the embeddings’ dimension using a combination of principal-component analysis
(PCA) and uniform manifold approximation and projection (UMAP).18 UMAP
is a state-of-the-art technique for dimensionality reduction that combines ma-
chine learning and algebraic topology to learn a low dimensional manifold pro-
jection of a high dimensional set of data (McInnes, Healy, and Melville 2020).
We utilize PCA to reduce dimensionality to n = 76, then use UMAP to further
reduce to 2-dimensions. The resulting 2-dimensional embedding vector we label
Ep , for a given project p.
    As a caveat, although the embeddings capture interrelationships among texts
their absolute position is not in itself meaningful. That is even more the case
when the embeddings are passed through UMAP, which is a stochastic process
and therefore will result in random variation in the absolute position of any
particular embedding in its dimensionality-reduced form. The reduced-form
embeddings are meaningful only when used to construct intermediate relation-
ships, such as distances to means, and when conjoined with other features of
projects and fed through a training process as part of an entire dataset. As

   17 These models now power all state-of-the-art translation, search engines, AI text genera-

tors, as well as most plagiarism detectors.
   18 Dimensionality reduction is known to degrade the performance of downstream tasks uti-

lizing sentence embeddings, and so is avoided in ML research where possible. However, given
the limited size of our dataset, utilizing the full-width embeddings would have created its
own difficulties with over-fitting. On balance, we decided to reduce the embedding width,
but note that if a larger project-level dataset were constructed, more limited reduction might
lead to significant gains in the downstream model performance reported in Section Appendix
B.2. We explored alternate combinations of PCA, UMAP, as well as t-SNE for robustness,
but found that as well as having the most appealing theoretical justification the pipeline used
provided the most stable and accurate downstream results.



                                              26
an obvious robustness test, we rerun our non-linear model pipelines end-to-end
with different random instances of UMAP, and find that the results are stable.

Appendix B.2. Residual Outcome Predictions
    In each sector roughly half of the projects in the dataset had a positive
residual outcome (see Table B.8). Ensemble based methods achieved receiver-
operator area under curve (ROC AUC) approaching 0.7, indicating that the
model correctly predicted a positive or lagged residual outcome 7 out of 10
times. Similarly, ensemble models’ regression performance on the training set
was high, with an adjusted r2 of 0.76. Prior techniques had been able to explain
at most 30% of the variance in project ratings (which ratings are themselves,
as above, of doubtful importance). Non-linear models are able to explain a
substantially higher percentage of variation of a more meaningful target variable,
with the positive test set performance and similar results across the model types
and across folds giving confidence that this result is not simply the product of
over-fitting or label leaking. Full results are reported in Table B.9.
    One note is that performance collapses for linear models. All linear mod-
els had ROC AUCs below a coin toss, and explained little to no variance in
the target. Tree-based models outperformed Support Vector Machines (SVMs)
learning methods, although with minimal differences between ensemble methods
and gradient boosting. This may lead to concerns that the tree-based ensemble
methods are over-fitting. Such concerns should be alleviated by the relatively
large hold-out set and the use within the training set of K-fold cross validation.
    However, we conducted additional tests for robustness in several ways. First,
we examined the scores for hyperparameter combinations with heavy regular-
ization, that is, which significantly penalized over-fitting. When we did so, we
found some decline in performance, but only moderately. For example, reduc-
ing the maximum tree depth from 100 to 3 reduced the testing set AUC score
from 0.67 to 0.63, a modest reduction (and no reduction was observed at max
depth 10). Second, we dropped all but the top 20% of features (by feature im-
portance) and similarly saw declines of only 4 percentage points in the test set
ROC AUC and B in the training set adjusted r2 . When we add the features
found at review time to those at approval time, we find a slight performance
increase, with an ROC AUC score of 0.7 and an explained variance of 0.86. One
further concern might be that the models were simply detecting the presence or
absence of sectoral outcome keywords. To check for that possibility, we tested
for correlations between the presence of sectoral key words and loan size and
the residual outcomes, and found none (see B.3).




                                       27
                  Table B.8: Summary statistics for residual outcomes
                 Sector             N     Positive     Mean      StdDev
                 Education        352         218       0.17        0.89
                 Energy           225           99     -0.12        0.94
                 Health           580         242      -0.07        1.14
                 WASH             300         130       0.05        0.88
                 Total           1457         689       0.01        1.01
       Notes:    This table reports summary statistics of the residual sector out-
       comes for World Bank projects, estimated independently for each sector
       according to equation 2. The residual terms are normalized scalar values.
       Values for Fiscal projects are not reported as those were not included in
       the project-level non-linear models, given the positive result for the sector
       in the ratings models.




                              Table B.9: Prediction Results
               Model                                 ROC AUC            R2
               Linear (Lasso)                            0.500       0.000
               Linear (Ridge)                            0.603       0.076
               Ensemble (RF)                             0.672       0.564
               Ensemble (XGB)                            0.695       0.764
               Neural Network                            0.534       0.197
               SVCs                                      0.589       0.273
               Ensemble (RF, at approval)                0.700       0.861
       Notes:    RF = Random forest, XGB = gradient boosted trees, SVC =
       support vector classifier, ROC AUC = receiver-operator area under curve




Figure B.3: Outcome keyword presence in PDOs compared to residual outcomes and model
predictions




                                            28
Appendix C. Non-linear Results




                             29
        Table C.10: Divergence between Residual Outcomes and Normalized Ratings, Per Region and Loan Size Tertile
     Region                                 Loan size     Avg gaming prediction          Avg prob gamed         Avg gaming
     Africa East                                large                      0.90                     0.70               0.75
     Africa West                                large                      1.36                     0.77               1.64
     East Asia and Pacific                      large                      0.98                     0.77               1.01
     Europe and Central Asia                    large                      1.20                     0.76               1.22
     Latin America and Caribbean                large                      0.76                     0.72               0.69
     Middle East and North Africa               large                      0.05                     0.53              -0.16
     South Asia                                 large                      0.62                     0.65               0.62
     Africa East                             medium                        0.94                     0.68               0.94
     Africa West                             medium                        1.03                     0.73               1.11
     East Asia and Pacific                   medium                        0.50                     0.58               0.48
     Europe and Central Asia                 medium                        1.04                     0.73               1.09
30




     Latin America and Caribbean             medium                        0.71                     0.68               0.73
     Middle East and North Africa            medium                        0.16                     0.54               0.02
     South Asia                              medium                        0.49                     0.63               0.42
     Africa East                               small                      -0.74                     0.20              -0.85
     Africa West                               small                      -0.80                     0.17              -0.69
     East Asia and Pacific                     small                      -1.01                     0.11              -0.96
     Europe and Central Asia                   small                      -0.75                     0.10              -0.67
     Latin America and Caribbean               small                      -0.99                     0.08              -1.04
     Middle East and North Africa              small                      -1.14                     0.10              -1.64
     South Asia                                small                      -0.95                     0.13              -1.01
     Notes:      Loan size corresponds to observed tertiles of loan size. “Avg gaming prediction” = non-linear model’s
     predicted difference between normalized average rating and normalized sector outcomes (lagged). “Avg prob gamed” =
     prediction of likelihood that a project has a larger than average difference between its normalized rating and normalized
     sector outcomes. “Avg gaming” = observed difference between normalized rating and normalized sector outcomes
Appendix D. Non-linear Methods


                        Table D.11: Numeric Features
 Feature                Unit        Description
 Original Commitment    USD         Size of loan or grant at approval (in con-
                                    stant 2015 dollars)
 Project Duration       Months      Original intended duration of project
 CPIA                   1−6         WB Country Policy and Institutional
                                    Assessment for implementing country at
                                    project approval
 GDP per capita         USD         GDP per capita in constant PPP (at ap-
                                    proval FY), log scale
 Prep TTL experience    Projects    Number of prior projects prepared by
                                    the project’s task team leader
 Prep TTL “value add”   (VA)        Preparing TTL “value add“ in relation
                                    to project ratings
 Country Director VA    (VA)        Project rating value addition of country
                                    director at time of approval
 Sector Manager VA      (VA)        Project rating value addition of sector
                                    manager at time of approval
 Sector Percentage      %           Project budget allocated to primary
                                    sector
 Number Sectors                     Number of sectors the project spans
 Sector HHI             HHI         Herfindahl-Hirschmann Index of budget
                                    allocations across project sectors
 Freedom House Index    Index       Freedom House index for implementing
                                    country at time of approval




                                    31
                           Table D.12: Categorical Features
Feature                    Categories                            Description
Financing instrument       IPF, DPL, others                      The type of financ-
                                                                 ing used for the
                                                                 project
Funding source             IBRD, IDA, blend                      Source of funding
                                                                 within World Bank
Region                     Africa East, South Asia, etc.         World Bank region
                                                                 in which the project
                                                                 fell at approval
Primary Sector             Health, Education, etc.               The project’s pri-
                                                                 mary sector
Fragile/Conflict           Binary                                Whether the im-
                                                                 plementing country
                                                                 was fragile or post-
                                                                 conflict at approval




      Table D.13: Algorithms and hyper-parameters tested for project prediction
Algorithm                    Varieties                        Hyper-parameters
Linear Models                Linear, Logistic Lasso           L1 term multiplier
Support Vector       Ma-     Support Vector Classi-           Regularization    term,
chines                       fier (SVCs) and Sup-             kernel types
                             port Vector Regressor
                             (SVRs)
Ensemble Trees               Random Forest (RF)               Minimum samples in
                                                              leaf, maximum depth
Gradient Boosting            XGBoost                          Learning rate, mini-
                                                              mum child weight
Neural Network               Multilayer    Perceptron         Hidden layer sizes, reg-
                             (MLP)                            ularization




                                          32