Machine Learning in        Leonardo Bravo
                           Ariya Hagh
Evaluative Synthesis       Roshin Joseph
                           Hiroaki Kambe
Lessons from Private
                           Yuan Xiang
Sector Evaluation in the
                           Jos Vaessen
World Bank Group
IEG Methods and Evaluation Capacity Development Working Paper Series




© 2023 International Bank for Reconstruction and Development / The World Bank
1818 H Street NW
Washington, DC 20433
Telephone: 202-473-1000
Internet: www.worldbank.org


ATTRIBUTION
Please cite the report as: Bravo, Leonardo, Ariya Hagh, Roshin Joseph, Hiroaki Kambe, Yuan
Xiang, and Jos Vaessen. 2023. Machine Learning in Evaluative Synthesis: Lessons from Private
Sector Evaluation in the World Bank Group. IEG Methods and Evaluation Capacity Development
Working Paper Series. Independent Evaluation Group. Washington, DC: World Bank.

MANAGING EDITORS
Jos Vaessen
Ariya Hagh

EDITING AND PRODUCTION
Amanda O’Brien

GRAPHIC DESIGN
Luísa Ulhoa


This work is a product of the staff of The World Bank with external contributions. The findings,
interpretations, and conclusions expressed in this work do not necessarily reflect the views of
The World Bank, its Board of Executive Directors, or the governments they represent.
The World Bank does not guarantee the accuracy of the data included in this work. The bound-
aries, colors, denominations, and other information shown on any map in this work do not imply
any judgment on the part of The World Bank concerning the legal status of any territory or the
endorsement or acceptance of such boundaries.


RIGHTS AND PERMISSIONS
The material in this work is subject to copyright. Because The World Bank encourages dissem-
ination of its knowledge, this work may be reproduced, in whole or in part, for noncommercial
purposes as long as full attribution to this work is given.

Any queries on rights and licenses, including subsidiary rights, should be addressed to World
Bank Publications, The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA; fax:
202-522-2625; e-mail: pubrights@worldbank.org.
Machine Learning in
Evaluative Synthesis
Lessons from Private Sector
Evaluation in the World Bank Group




Leonardo Bravo, Ariya Hagh, Roshin Joseph,
Hiroaki Kambe, Yuan Xiang, Jos Vaessen




Independent Evaluation Group

June 2023
CONTENTS
Authors�������������������������������������������������������������������������������������������������������������������������iv

Abstract������������������������������������������������������������������������������������������������������������������������vi

Abbreviations�������������������������������������������������������������������������������������������������������������� viii

Introduction�������������������������������������������������������������������������������������������������������������������x

1. Machine Learning Applications in Evaluation����������������������������������������������������������2
   What Is Machine Learning?                                                                                                   4
   Previous Applications                                                                                                       5
   Potential in Evaluation                                                                                                     8

2. C
    lassification and Synthesis of Evaluative Evidence in The Independent
   Evaluation Group’s Finance and Private Sector Evaluation Unit���������������������������12
   Objectives                                                                              14
   Problem Statement                                                                       14
   Methodology                                                                             14
   Model Refinement                                                                        21
   Summary of Results                                                                      22
   Limitations                                                                             28

Conclusion�������������������������������������������������������������������������������������������������������������������30

Bibliography����������������������������������������������������������������������������������������������������������������34




ii 
AUTHORS
Leonardo Bravo1

Ariya Hagh2

Roshin Joseph3

Hiroaki Kambe4

Yuan Xiang5

Jos Vaessen6


Corresponding Author

Leonardo Bravo, lbravo@ifc.org


Author Affiliations
1
    World Bank Independent Evaluation Group
2
    World Bank Group
3
    International Finance Corporation
4
    Japan International Cooperation Agency
5
    World Bank Independent Evaluation Group
6
    World Bank Independent Evaluation Group




                              Independent Evaluation Group | World Bank Group v
ABSTRACT
The analysis of the implementation challenges private sector projects face has
traditionally involved manual identification and categorization of project doc-
uments by evaluation officers. An approach of this type offers nuance, but that
nuance comes at a significant cost in terms of time and effort expended. The la-
bor required to manually classify project performance parameters and assess the
factors that explain why a particular project did (or did not) successfully achieve
its intended development outcomes is both intensive and extensive and calls for
a more efficient approach. Such an approach should take advantage of evaluators’
established experience in diagnosing critical challenges and impediments to proj-
ect performance as well as recent advances in machine learning. These advances
allow practitioners to overcome the challenges manual classification presents
by extracting and classifying vast quantities of text in ways that would other-
wise be prohibitively laborious. As a demonstration of this concept, we discuss
the use of automated content analysis to identify and classify factors and issues
commonly faced in the implementation of private sector projects, sorting them
according to a curated taxonomy. We describe our approach, which started with
the development of a taxonomy of project factors and issues identified by subject
area experts. This subsequently provided the basis for employing a combination
of machine learning algorithms to iteratively fine-tune the taxonomy. The factors
and issues were then classified into 5 overarching categories and 51 subcatego-
ries. We show that once machine learning models are sufficiently well trained,
they are able to correctly identify the majority of factors and issues under consid-
eration in the taxonomy, including not only their probability of occurrence in a
particular paragraph, but also whether those factors and issues affected a partic-
ular project positively or negatively. The experiment suggests new avenues for
machine-assisted classification of large corpora of documents for use in portfolio
analysis and evaluative synthesis.




                            Independent Evaluation Group | World Bank Group vii
ABBREVIATIONS
	IEG	Independent Evaluation Group
	IFC	 International Finance Corporation
	LDA	 latent Dirichlet allocation

All dollar amounts are US dollars unless otherwise indicated.




                             Independent Evaluation Group | World Bank Group ix
INTRODUCTION
Faced with an ever-growing pool of evidence-rich text reports, evaluators are
increasingly interested in extracting and synthesizing insights from these reports
in a more efficient and reliable manner. A shift from manual identification and
extraction of information to a more automated process is warranted in many
cases, specifically in an institutional environment with a steady accumulation of
reports that follow fairly standardized formats and types of content.

Three issues necessitate such a shift. First, manual categorization can be time
consuming, which can limit evaluators to classifying either a smaller number
of evaluation documents or a smaller number of factors and issues within the
documents than they otherwise would. Second, differences among evaluators’
backgrounds and individual classification decisions can introduce inconsistencies
in how insights of the same type are classified. These inconsistencies can result in
potential over- or underestimation of the prevalence of certain factors and issues,
introducing unintended differences in classification that bias the resulting out-
put. Third, manual classification does not readily lend itself to updating existing
data sets with new documents and inputs that might become available after the
initial classification has been completed. Machine learning for text classification
provides an intuitive solution to these problems.

The automation of information extraction and classification opens up exciting
avenues for streamlining evaluative synthesis, enabling evaluators to render
in seconds what would otherwise require hours or even days of labor-intensive
manual identification and coding. Machine learning methods can accelerate
content extraction, provided that practitioners train the extraction tool properly.
In the context of the text analytics explored in this paper, machine learning
involves a combination of unsupervised and supervised text-mining techniques
that transform raw text data into a matrix of terms, which is then classified
according to a taxonomy of issues pertinent to the analysis at hand. Integration
of existing theoretical priors and evaluator experiences can ensure an appropriate
balance between the granularity and generalizability of the insights extracted
from project documents. Such an approach offers evaluators a powerful analytical
tool for better understanding the various determinants of project success,
potential challenges to project implementation, and practical lessons for future
projects, among other matters.




                             Independent Evaluation Group | World Bank Group xi
Automated methods provide three main advantages over conventional approaches.
First, they permit faster and more systematic analysis of a set of documents than
manual coding alone can achieve. Machine learning does not invalidate system-
atic manual review; rather, automatic classification and extraction of knowledge
can provide a first step to inform further analysis. Second, automated methods can
place a larger quantity of relevant data at the disposal of evaluation officers, who
can then draw insights from a broader set of inputs than would have been available
had manual approaches alone been used. Third, once properly trained, classifica-
tion algorithms can form the underlying infrastructure for real-time or just-in-time
analysis to inform decision-making, whereas using a purely manual process would
not produce the required analysis for weeks or even months. Such algorithms can
allow faster and custom manipulation of elements included in analyses based on
user needs. In fact, providing real-time insights (for example, to the chair of an
investment review meeting) could be the next use for this approach. The integration
of machine learning into evaluative synthesis would represent a relatively low-cost
intervention that would provide economies of scale for both current and future
evaluations. As an investment, the approach would offer a tool that can be reused
and modified for future analyses.1 Machine learning can catalyze positive feedback
loops, translating insights from identified project challenges into lessons that feed
into project design and improve the quality of project implementation and project
performance in the long term.

This paper builds on these points as follows. Chapter 1 provides an overview of ma-
chine learning and discusses relevant applications in the field of evaluation, briefly
outlining previous work and potential future applications. In chapter 2, we use the
case of the Finance and Private Sector Evaluation Unit of the Independent Evalu-
ation Group as an example to illustrate the benefits of machine learning for text
classification in evaluation. A summary of the results of this experiment and a brief
discussion of potential next steps conclude the paper.




xii Machine Learning in Evaluative Synthesis | Introduction
Endnotes
1 
     The diagnosis of delivery challenges, their rank-ordering by salience, and viable strategies for
iterative amelioration of future projects are some examples of ways in which multiuse machine
learning applications can be employed.




                                     Independent Evaluation Group | World Bank Group xiii
               1
MACHINE LEARNING
  APPLICATIONS IN
      EVALUATION
  What Is Machine
    Learning?




                        Potential in Evaluation




Previous Applications
What Is Machine Learning?

Machine learning is based on pattern recognition and the theory that computers
can autonomously learn to perform certain well-defined tasks (Samuel 1959). The
procedure employed usually relies on algorithms, a set of unambiguous mathemat-
ical rules used to perform classification and data processing and draw basic infer-
ences. At its core, machine learning is a Bayesian endeavor in which prior beliefs
are updated based on new data introduced into analysis. Though the philosophy
underlying this approach dates back to the eighteenth century, recent improve-
ments in the efficiency and accessibility of computational methods have allowed
scholars and practitioners to apply the tools of machine learning methods to a
wide array of complex problems.

Training on a subset of the data, a machine learning algorithm extracts generaliz-
able lessons from new data, becoming more precise as more information is inputted.
Human classification often faces an upper limit on both efficiency and scalability.
There are also limits on how perceptive human coders can be in regard to patterns
hidden in very large data sets; given the complexity of the underlying phenomena
under observation, more nuanced insights based on fewer observations might be lost
in the sea of available data. The same shortcomings that limit the performance of
manual methods can, however, serve as a source of strength in automated content
analysis. Automated classification tends to become more accurate as the quantity of
information increases and does not neglect more nuanced patterns, provided that
the training data used are sufficiently well ordered.

Machine learning applications can involve both supervised and unsupervised
methods, as well as a mixture of the two. Supervised-learning algorithms rely on
human-coded training sets to train a classification tool to generate predictions
from a broader sample of data. Such algorithms are given a set of latent parame-
ters to search for a priori, classifying raw data into categories according to those
parameters. Among other uses, they can be trained to categorize text, detect spam,
diagnose health issues, and discover fraudulent spending activity. The accuracy of
supervised methods relates to how well the parameters for information classifica-
tion are vetted and the quality of the manual classification of information that is
used as a training set for the algorithms employed. In short, supervised classifica-
tion methods require essential inputs from human sources to function properly.
However, they tend to make up for the initial time investment needed to provide
these inputs once they have been properly calibrated, parsing and categorizing
textual data that are relevant to a particular topic of interest faster and more accu-
rately than manual approaches.




4   Machine Learning in Evaluative Synthesis | Chapter 1
Unsupervised approaches, conversely, do not rely on human input. Instead, they
independently search input data for potential correlates and clusters based on
different underlying features. Both approaches offer unique advantages specific
to different applications. Unsupervised methods can best be thought of as tools
that support a Popperian “logic of discovery,” serving as an exploratory probe for
detecting clusters and patterns in texts (Aggarwal and Zhai 2012).1 However, though
unsupervised classification tools can successfully detect patterns in complex and
multidimensional data corpora, they can also be susceptible to misclassification
errors and overfitting.

In rare cases, unsupervised approaches may unintentionally extrapolate substan-
tively meaningless but statistically “significant” quirks in the data they are ana-
lyzing. Not every hidden association within data is useful in regard to a particular
research topic. Human intervention is therefore needed to ensure that unsupervised
training algorithms generate results that are substantively meaningful and not
driven by stochastic noise in the underlying data. Such intervention becomes more
pertinent as the complexity of the data increases. In practice, unsupervised learning
can often be used with great success to detect hitherto unclassified clusters in data,
highlight potential outliers in data sets, or reduce dimensionality within a complex
framework.2 But practitioners should not rely on unsupervised learning to produce
consistent and meaningful outputs without some degree of vetting by those with
substantive knowledge of the underlying phenomena of interest.


Previous Applications

Practical applications of machine learning and text analytics in the realm of
evaluation have primarily focused on three areas: automatic coding of key
implementation challenges, risk identification, and impact evaluation. Though
different machine learning methods can offer a variety of efficiencies related to the
practice of evaluation, arguably the most pertinent method has involved supervised
or semi supervised classification of large quantities of text. Previous applications
have taken advantage of tools for such supervised classification in several different
contexts. A variety of studies that have applied machine learning to data in health
care, pharmaceutical research, transportation, energy, and labor, among other areas,
have noted the benefits of such an approach.

Cimiano et al. (2005) use machine learning to categorize a large corpus
of heterogeneous data, extracting common text features and examining
interrelationships among the various terms identified. Tanguy et al. (2016) use
support vector machine learning to classify and evaluate safety event records



                                Independent Evaluation Group | World Bank Group 5
and archival documents, which enables them to categorize incident reports in
the aviation sector. The resulting output improves the accuracy and reliability of
analysis conducted by aviation experts, providing insights relevant to facets of
aviation incidents. Schmidt, Schnitzer, and Rensing (2015) similarly take advantage
of an automated classification algorithm for text-heavy source data, in this case a
catalog of job offers based on hours of work, modes of employment, and functional
work areas. The resulting output consists of a domain-specific search engine that
enables subject-specific knowledge to be exploited more efficiently using a set of
supervised subject filters.

Plmanabhan (2015) applies a battery of supervised multilabel classifiers and
natural-language-processing techniques to analyze policy documents and survey
data on psychological counseling for military servicemembers. He then uses the
resulting output as a framework for explaining how the policies of the United States’
Military Health System influence servicemembers’ access to psychological services.
Burscher, Vliegenthart, and De Vreese (2015) use a supervised-learning algorithm to
categorize policy issues, political articles, and parliamentary discourse by salience
and topic. The authors then use the results to investigate the generalizability of
policy issue classifiers, testing the relevance of different machine-coded topics
relative to those yielded by hand-coded training sets.

In regard to risk assessments, machine learning can help policy makers identify
category-specific risk factors and quantify their impact, drawing on insights from
challenges and obstacles encountered in earlier projects. In this context, Rona-Tas
et al. (2019) use supervised learning in the field of food safety to assess the two main
issues related to food hazards, helping practitioners better understand underlying
ambiguities and emergent risks related to monitoring and inspection practices.
Quantification of risk factors provides specific benefits in this context, as the
output of the model employed (assessing the need for potential safety warnings and
recalls) demands accurate and timely assessments of food risk parameters. Similarly,
Abdellatif et al. (2015) and Ali (2007) use neural networks to assess flood risks and
river water quality, generating output that helps manage urban water systems and
minimize loss of life and property after water-based disasters. Galindo and Tamayo
(2000) apply supervised-learning algorithms such as classification and regression
tree models and neural networks to evaluate risk among financial intermediaries,
generating an important diagnostic tool for assessing institutional risks and volatility.

Okori and Obua (2011) apply machine learning techniques to predict famines
in Uganda, using data from the country’s northern region to train their tool on
inputs from other regions. They employ a combination of support vector machine,
k-nearest neighbors, naïve Bayes, and decision tree analysis to highlight meaningful



6   Machine Learning in Evaluative Synthesis | Chapter 1
relationships related to food security and famines, yielding output beneficial for
evaluating causal variables related to theorized causes of food scarcity. Ofli et
al. (2016) combine crowdsourcing and real-time supervised machine learning to
evaluate large quantities of aerial and satellite imagery for time-sensitive disaster
response. Jean et al. (2016) similarly apply machine learning to survey data and
satellite imagery from Malawi, Nigeria, Rwanda, Tanzania, and Uganda, training a
convolutional neural network to identify variations in local economic outcomes.
The resulting output offers a scalable tool for predicting poverty according to a
combination of data sources. Likewise, McBride and Nichols (2015) implement
stochastic ensemble methods such as quantile regression forests to improve the
accuracy of beneficiary targeting in poverty reduction, generating economies in
areas in which conventional means testing can be prohibitively costly.

Impact evaluation has also benefited from advances in applied machine learning
techniques. Counterfactual designs determine the effect of a policy intervention by
comparing a treatment group with a control group over time, using experimental or
quasi-experimental techniques to control for observable and non-observable causal
factors. However, this type of comparison is not always feasible or desirable. In prac-
tice, achieving a proper balance among treatment and control groups is no easy feat,
particularly when the active samples (such as specific social groups or geographical
areas) tend to be structurally diverse. Matching techniques, including unsupervised
learning, can be used in this area (see, for example, Gertler et al. 2016). In one exam-
ple, Ruz, Varas, and Villena (2013) use k-means clustering algorithms to identify the
common characteristics of households lacking internet access as a means of evaluat-
ing whether an unconditioned broadband and subsidiary campaign had a significant
effect on broadband penetration in Chile.

Zheng, Zheng, and Ye (2016) also use machine learning methods to assess the devel-
opment impact of environmental tax reform in China. Niu, Wang, and Duan (2009)
rely on support vector machine analysis to evaluate the impact of power plant con-
struction projects in China, and Burlig et al. (2017) examine, via machine learning,
the impact of energy efficiency upgrades in primary and secondary schools. Machine
learning can also yield useful meta-analytical insights. Mueller, Gaus, and Konradt
(2016) note that progress in evaluation research depends on establishing a produc-
tive cycle of scholarly knowledge generation, dissemination, and implementation.
Examining the uneven proliferation of scholastic work on evaluation, they employ a
cross-national design for predicting evaluation research output, assessing the rela-
tive impact of country-specific research output in evaluation research.

In recent years, applications of machine learning and (more complex) deep learning
models in the practice of evaluation have become more widespread. For example,



                                 Independent Evaluation Group | World Bank Group 7
the Independent Evaluation Group (IEG), one of the early adopters of data science
applications in evaluation, has applied these tools in the analysis of textual data in
portfolio identification exercises and content analysis (for example, Franzen et al.
2022), as well as of imagery data in poverty mapping and geospatial impact evalua-
tion (for example, Ziulu et al. 2022).


Potential in Evaluation

The use of machine learning approaches in evaluation is still in its early stages but
shows significant potential, not only as part of advanced text analytics but also
in the use of other data such as imagery data. Regarding advanced text analytics,
machine learning techniques can be used to process and analyze text documents by
automatically coding and categorizing key issues in the documents. For example,
machine learning can be used to extract common challenges across various sectors
studied and map the evolution of obstacles over time. Machine learning applications
can provide at least two significant advantages over manual approaches in the
context of evaluation. First, they can systematically explore large or growing
data sources (such as, archives or document repositories), analyzing quantities of
information that would be prohibitively time consuming for human coders. They can
do this systematically, without a bias toward or against certain issues over others.
The impact of various traits these applications discover in the data will therefore be
directly related to the presence or absence of those traits in the data. This attribute
of machine learning applications is quite valuable in evaluation, as assessments
should reflect as closely as possible the underlying features of the evidence
examined, without subjective biases or unintended variations of the type different
human coders might introduce.

Second, automated machine learning applications can continue to improve their
assessments as new evaluative data are introduced. As a result, their output
represents a “living” classifier: new categories and implementation challenges will
be added, updated, and removed as the body of data assessed changes over time. In
the case of the work presented in this paper, for example (see chapter 2), the use
of machine learning applications allows real-time learning and adaptation by the
model in response to evaluator output and the integration of project lessons in
practice. Over time, as new data are integrated into supervised analysis, a positive
feedback loop can develop between evaluation and practice, allowing future projects
to integrate generalizable and context-specific lessons into their design and
implementation. This ability to learn and adapt can provide notable efficiency gains
relative to manual coding.




8   Machine Learning in Evaluative Synthesis | Chapter 1
The application presented in this paper focuses on the extraction and classification
of implementation challenges from private sector evaluation reports using machine
learning techniques. In many ways it is similar to the Delivery Challenges in
Operations for Development Effectiveness platform developed for public sector
operations by the Global Delivery Initiative. The Delivery Challenges data set
uses Implementation Completion and Results Reports from completed projects to
generate a taxonomy of common issues that have an impact on project performance.
Practitioners can then use insights from the data set to improve implementation
and supervision outcomes.3 The experiment outlined in this paper offers a similar
output for private sector operations, generating a set of implementation challenges
representing specific obstacles encountered in the project cycle.




                               Independent Evaluation Group | World Bank Group 9
Endnotes
1 
     For example, one particular type of unsupervised method (topic modeling) can be used to
extract central themes and topics from documents, something that can be useful for parsing as
well as classification (Blei 2012).

2 
     For example, unsupervised methods can be used to identify a latent construct represented in
clusters of text that contain common words related to a particular construct, such as women’s
empowerment, poverty, or democracy.

3 
     For more on the taxonomy, see Ortega Nieto, Hagh, and Agarwal (2022). 




10 Machine Learning in Evaluative Synthesis | Chapter 1
                2
    CLASSIFICATION
    AND SYNTHESIS
     OF EVALUATIVE
   EVIDENCE IN THE
      INDEPENDENT
        EVALUATION
  GROUP’S FINANCE
AND PRIVATE SECTOR
  EVALUATION UNIT
  Methodology


                   Summary of Results




Model Refinement



                       Limitations
Objectives

IEG, an independent department within the World Bank Group, is charged with eval-
uating the activities of the World Bank (the International Bank for Reconstruction
and Development and the International Development Association), the International
Finance Corporation (IFC), and the Multilateral Investment Guarantee Agency. Spe-
cifically in regard to IFC’s work, IEG conducts desk-based exercises to validate IFC’s
Investment Project Reports (Expanded Project Supervision Reports) and its Advisory
Project Reports (Project Completion Reports). Three objectives drive the analysis
outlined in this paper: (i) to support accountability by assessing the relevance, effi-
ciency, and effectiveness of IFC’s projects; (ii) to support organizational learning by
identifying lessons from experience to improve IFC’s operational performance; and
(iii) to reinforce corporate objectives and values among IFC staff members.


Problem Statement

Automating the analysis of private sector project evaluations serves two major goals.
First, we aim to build an automatic classifier so that the vast quantity of existing
information in evaluation documents can be efficiently categorized according to
distinct clusters of issues and challenges encountered in project implementation.
Given the issues raised in chapter 1 related to the inefficiency of manual categoriza-
tion, such an endeavor represents an intuitive next step in the parsing of evaluative
evidence. Second, properly trained machine learning applications can help over-
come issues related to intercoder reliability and evaluator subjectivity in classifica-
tion. Based on the challenges of efficiency and accuracy discussed in the preceding
chapter, automated classification and synthesis of project insights presents a viable
solution for optimizing both the reliability and the objectivity of project analysis.
The following sections summarize our methodological strategy, outline our imple-
mentation, and summarize our results.


Methodology

Using a combination of human expert knowledge and unsupervised- and supervised-
learning algorithms (including naïve Bayes, random forest, support vector machine,
and multilayer neural network methods), we generated a taxonomy of factors and
issues that private sector projects typically encounter in regard to implementation.
Approximately 1,600 documents evaluating private sector projects, produced between
2008 and 2022, provided our source input data for generating this taxonomy.




14 Machine Learning in Evaluative Synthesis | Chapter 2
First, experts (IEG sector leaders with subject area expertise in the evaluation of
projects in the financial, infrastructure, manufacturing agriculture and services,
and funds sectors) discussed and shared the main factors and issues they faced in
the development sectors in which they worked. We took the list of issues produced
by the IEG sector leaders to conceptually account for the bulk of implementation
issues private sector projects face throughout their life cycle. The sector leaders then
manually classified these issues into five broad categories (country, market, sponsor,
project specific, and other). Table 2.1 summarizes the taxonomy.

Table 2.1. Taxonomy of Project Insight (Categories)
 Categories
 Country and macro factors
 Market, sector, and industry factors
 Sponsor or client (management, sponsorship, and leadership)
 Project-inherent challenges
 Other


Source: Independent Evaluation Group.


Drawing on evaluation documents (specifically, IEG Evaluative Notes), we then
extracted terms and concepts that were relevant to the issues identified, creating a
matrix of keywords that was used to refine the experts’ draft taxonomy. In parallel,
we applied automated text categorization to a list of more than 10,000 paragraphs to
uncover potential subcategories from the corpus of supplied text. We used two unsu-
pervised methods to complement the manual identification of conceptual categories.
First, we used latent Dirichlet allocation (LDA) to find mixtures of terms for salient
topics in the text. An evaluation officer compared the topics and key terms gener-
ated by LDA to the existing categories in our taxonomy, and it was found that four
LDA-generated categories matched concepts identified by subject area experts. Key-
words from those topics were added to the list of terms that would be used to identify
those categories.1

Second, we used Google’s Word2vec model, which presents each term as a unique
vector.2 The model can easily identify similar word combinations in common con-
texts by measuring their spatial proximity to generate clusters of concepts that are
relevant to the analysis being undertaken. Figure 2.1 shows the Word2vec cluster
for the concept of “expertise.” Using the interactive dashboard in the TensorBoard
application, we then inputted keywords from our LDA and visualized the resulting
word-proximity vectors in three-dimensional manifolds. Next, we compared the con-




                                   Independent Evaluation Group | World Bank Group 15
ceptual clusters generated by Word2vec with an existing keyword list created by an
evaluation officer. Paired together, the resulting keyword matrix and modeled topics
provided a preliminary assessment of the distribution of issues relating to private
sector projects, how frequently they occurred in the documents, and how salient
they were to the results the projects obtained.


Figure 2.1. Word2vec Word Cluster for “Expertise”




Source: Independent Evaluation Group.


This assessment fed back into the manual review of issue areas to help disaggregate
the 5 categories into a set of 51 subcategories. The resulting taxonomy is shown in
Table 2.2. Keywords associated with each of the subcategories were further refined
by subject area experts to produce a training set for supervised machine learning
classification. The classification procedure involved the following steps. First, we
prepared the paragraphs for analysis by applying stemming, lemmatization, decap-
italization, and stop-word removal. This eliminated small words such as “a,” “the,”
and “and,” breaking down terms to their roots (for example, terms such as “history”
and “historical” would be reduced to the common stem “histor-”). We also removed
special characters and numbers to reduce the text to a set of cleaned “tokens” that
could be used for classification according to the training set.

The training set was used for classification according to a set of different algorithms
(naïve Bayes, random forest, support vector machine, and multilayer neural net-
work), which were compared to assess their relative performance in classifying a
new sample of paragraphs according to the subcategories generated. Given that the
same sentence in an evaluation document could potentially be tagged with multiple
relevant keywords, we used multilabel and multioutput text classification to cluster
the keywords. Based on the results of this testing, we decided to use naïve Bayes for
categorizing paragraphs, and specifically, for assigning a probability that a particular
paragraph would be assigned to a particular category in the taxonomy.3


16 Machine Learning in Evaluative Synthesis | Chapter 2
This approach was used to classify paragraphs in the corpus of 1,600+ documents,
with the system generating some 85,000 classified paragraphs overall. To allow
categorization of paragraphs to more than one theme, the classification assigned a
primary, secondary, and tertiary subcategory alongside a probability of assignment
to each.4 As an additional measure to aid categorization, we also used a sentiment
analysis to assign a score to each paragraph, ranging between –1 (totally negative;
paragraph includes information on a factor or issue that is a barrier or impediment
to project implementation) and +1 (totally positive; paragraph includes informa-
tion on a factor or issue that contributes to success in project implementation).
This analysis was carried out using polarity scores from Python’s Natural Language
Processing Package.


Table 2.2. Taxonomy of Project Insight (Categories and Subcategories)

   Categories        Subcategories                         Definition
 Country and      Civil unrest and       Factors related to civil unrest, armed conflict,
 macro factors    armed conflict         and war




                  Economic               Factors related to the macroeconomic
                  factors                environment, inflation, monetary policy, or
                                         austerity measures
                  Epidemics and          Factors related to epidemics (human, animal,
                  COVID-19               and plant) and COVID-19
                  Expropriation,         Factors related to expropriation, nationaliza�
                  nationalization, and   tion, transfer, and convertibility
                  transferability
                  Foreign exchange       Factors related to currency fluctuation,
                  and local currency     exchange rate and local currency issuance
                  factors                instruments
                  Legal or               Factors related to regulatory policies,
                  regulatory factors     government, legislation, and bureaucratic
                                         mechanisms
                  Natural disasters      Factors related to natural disasters such as
                                         hurricanes and earthquakes
                  Political factors      Factors related to the political environment,
                                         including legislative and electoral dynamics



                                                                             (continued)




                               Independent Evaluation Group | World Bank Group 17
   Categories          Subcategories                          Definition
 Market,             Business factors       Factors related to business model, cyclical
 sector, and                                business, or the operating environment
 industry factors




                     Competition            Factors related to market competition: barri�
                                            ers to entry, monopolies, market dominance,
                                            and penetration
                     Customers              Factors related to identifying correct target
                                            markets and clientele

                     Market share           Factors related to market share
                     Pricing                Factors related to price elasticity, supply,
                                            and marginal gains
 Sponsor or client   Capacity,              Factors related to sponsor capacity, capital�
 (management,        capitalization,        ization, and leverage
 sponsorship,        leverage
 and leadership)     Commitment and         Factors related to the strength and
                     motivation             valence of strategic alignment, including
                                            compatibility, motivation, and ownership
                     Conflicts of           Factors related to minority interest, conflicts
                     interest, corporate    of interest, and corporate governance
                     governance
                     Integrity,             Factors related to integrity and transparency,
                     transparency,          such as disclosures of sensitive ethical
                     fairness, reputation   issues, irregularities, and negative public
                                            perceptions
                     Organizational         Factors related to organizational culture,
                     structure              institutional procedures, policies, and
                                            accountability
                     Technical expertise,   Factors related to the quality and expertise
                     track record, and      of the management team, their technical
                     capacity               skills and track record, and contractor com�
                                            petency, familiarity, and acumen
                     Succession             Factors related to succession, especially in
                                            family-owned businesses




                                                                                (continued)




18 Machine Learning in Evaluative Synthesis | Chapter 2
  Categories         Subcategories                             Definition
Project-inherent   Asset quality             Factors related to asset quality
challenges




                   Cost overruns and         Factors related to overruns or delays
                   delays
                   Earnings and              Factors related to earnings and profitability
                   profitability
                   Environment and           Factors related to environmental standards,
                   sustainability            social health and safety parameters, or other
                                             safety standards
                   Expansion                 Factors related to acquisition, modernization,
                                             and expansion
                   Funding                   Factors related to funding
                   Greenfield                Factors related to greenfield projects
                   Gender                    Factors related to gender
                   Liquidity                 Factors related to liquidity
                   Technology                Factors related to changes in technology
                                             that affected project performance
                   Training, know-how,       Factors related to training and know-how
                   and implementation
Other              Additionality principle   Factors related to additionality and added
                   and catalytic rolea       value


                   Coordination and          Factors related to combined partnership and
                   collaboration with        collaboration among the various stakehold�
                   World Bank Group,         ers: the World Bank Group, donors, DFIs, and
                   other DFIs, donors,       other external stakeholders
                   and other external
                   stakeholders
                   Coordination and          Factors related to use of investment and
                   collaboration within      advisory services to enhance IFC roles and
                   IFC: AS-IS                contributions
                   Project scoping and       Factors related to ex ante market analysis,
                   screening; country        due diligence, and consumer preferences
                   and stakeholder
                   assessment; client
                   needs assessment
                                                                                (continued)




                                Independent Evaluation Group | World Bank Group 19
   Categories         Subcategories                          Definition
                   Client selection,       Factors related to client or implement�
                   commitment, and         ing-partner selection (appropriateness) and
                   capacity                client commitment and involvement
                   Project design          Factors related to project design
                   Financial model,        Factors related to financial modeling
                   project cost,           assumptions, including issues regarding
                   and sensitivity         overambitious objectives, deviations from
                   assumptions             forecasting estimates, and scaling
                   Market assessment       Factors related to market assessment, market
                                           analysis, and consumer preferences
                   Resources and           Factors related to staffing, budget, and
                   timeline                timeline
                   Supervision and         Factors related to (i) supervision and report�
                   reporting               ing; and (ii) taking measures to enhance
                                           these, as well as proactive client and stake�
                                           holder follow-up
                   Sensitivity analysis    Factors related to sensitivity analysis, worst-
                                           case scenarios, stress tests, and risks to
                                           achieving development outcomes
                   Documentation           Factors related to the quality of monitoring,
                                           documentation, and reporting
                   Loan issues             Factors related to loan agreements, operat�
                                           ing policies, breaches, and technical defaults
                   Relationship            Factors related to the quality and scope of
                   management              relationship management, including fruitful
                                           and proactive engagements with on-site
                                           staff
                   Debt issues             Factors related to debt issues, such as syn�
                                           dication, repayment, security, and refinanc�
                                           ing
                   Equity issues           Factors related to equity, valuation, and
                                           shareholder rights
                   Financial risk          Factors related to risk-mitigation mech�
                   mitigation              anisms such as guarantees, securities,
                                           prepayment penalties, and restructuring
                                           mechanisms
                   Prepayments             Factors related to prepayments




                                                                               (continued)




20 Machine Learning in Evaluative Synthesis | Chapter 2
    Categories              Subcategories                               Definition
                        Monitoring and              Factors related to compliance, monitoring
                        evaluation                  including measurement, reporting, auditing,
                                                    monitoring and evaluation plan and frame�
                                                    work, appropriate indicators and targets,
                                                    and clarity of data collection and evaluation
                                                    approach
                        Other issues                Factors related to other issues


Source: Independent Evaluation Group.
Note: a. The latest guidance on additionality can be found at https://km.ifc.org/sites/pnp/MainDocu-
mentMigration/DI716AdditionalityFramework.pdf.
AS = advisory services; DFI = development finance institution; IFC = International Finance Corporation;
IS = investment services.



Model Refinement

It should be noted that the initial classification exercise yielded low-accuracy
results. This may be related to two possible causes. First, the unrefined taxonomy
originally included 81 subcategories, before the manual validation described
earlier. This meant that many subcategories were too sparsely populated to enable
accurate identification of themes. Second, some of the keywords selected for
use in classification occurred too commonly in evaluation documents to provide
meaningful information for the models. By their nature, some of the themes
included in the taxonomy overlapped conceptually. For example, the subcategories
“client selection, commitment, and capacity” and “monitoring and evaluation” could
be considered integral parts of the category “project-inherent challenges” as well as
of the category “other,” where they appear in our taxonomy. This required manual
review to separate the themes (where possible) and refine the keywords.

Given the large number of subcategories generated in the taxonomy, several steps
were taken to iteratively refine it to improve classification precision and relevance.
This yielded the smaller taxonomy of 51 categories shown in table 2.2. First, the
subject area experts addressed deficiencies by either formulating new subcategories
or deleting irrelevant or less-frequently occurring ones, expanding or consolidating
categories when needed, and updating definitions. This helped us to avoid including




                                      Independent Evaluation Group | World Bank Group 21
catchall categories that would make the resulting classifications of issues discussed
in project documents less meaningful.5 Likewise, the removal of subcategories with
very few observations helped make the taxonomy more manageable.6 At the same
time, the training set was refined to eliminate catchall words and phrases to im-
prove classification precision. For example, manual classification led to more than
15 percent of the initial paragraphs being assigned to the subcategory “IFC work
quality.” We therefore assessed this subcategory as a catchall and divided it into sev-
eral different subcategories, such as “market assessment,” “sensitivity analysis,” and
“financial model, project cost, and sensitivity assumptions.”

Streamlining and refinement of model subcategories also involved additional di-
agnostics like cosine similarity. Cosine similarity analysis is a heuristic method of
the distinctiveness of the vocabulary associated with a particular concept and can
be used to identify categories that are problematically correlated with each other.
Cosine similarity was used to find areas where underlying keywords or phrases used
in conceptually distinct topics created issues in regard to classification accuracy:
although the topics themselves might be conceptually distinct, the use of similar
terms to identify relevant passages would result in overlaps among groups that
reduce classification accuracy. In the case of high similarity scores, we checked
keywords and categories to ensure that the groups identified in the taxonomy were
(to the extent possible) mutually exclusively defined. After a few iterations, we were
able to eliminate several categories with problematic overlaps, further improving
the subcategories in the taxonomy.

The model refinement process offered three main benefits. First, it ensured that most
categories were reasonably well balanced with respect to the number of paragraphs
classified into them. Second, it improved the quality and informativeness of text tags
and examples used in classification. Third, it generated sufficient observations per
subcategory to allow for the exploratory and descriptive statistical analysis of lesson
categories. After this recalibration, the subcategory with the maximum number of
paragraphs represented about 6 percent of the total population of paragraphs, and the
average subcategory included about 2 percent. Classification accuracy improved to an
average of about 70 percent across the refined subcategories.


Summary of Results

The results of the automated classification and synthesis procedure were compared
against hand-coded samples generated by subject experts. Table 2.3 provides an
illustration of the results of this analysis.




22 Machine Learning in Evaluative Synthesis | Chapter 2
            omparison of Hand-Coded (Human) and Machine-Coded
Table 2.3. C
           Classification

Factor                             Subcategory                       Text

Human Coding
Factor 1                    Legal or regulatory factors   Lack of a properly regu�
                                                          lated public transportation
                                                          system led to uncertainty
                                                          and high risk regarding the
                                                          setting of fares and pay�
                                                          ment of subsidies.
Factor 2                    Political factors             Effective nationalization
                                                          of [Company X] within the
                                                          country operation. Cancel�
                                                          lation of license (Country
                                                          CDE Operation).

Machine Coding
Factor 1                    Legal or regulatory factors   Lack of a properly
                                                          regulated public transport
                                                          system (at the national or
                                                          municipal level) leads to
                                                          uncertainty and therefore
                                                          high risk regarding the
                                                          setting of fares and
                                                          payment of subsidies. The
                                                          project was expected to
                                                          have a demonstration effect
                                                          for other governments
                                                          and municipalities and
                                                          encourage similar public
                                                          private partnerships.
Factor 2                    Legal or regulatory factors   An attempt could be to
                                                          have the legal agreement
                                                          (between the government
                                                          agency and the company)
                                                          subject to an outside
                                                          jurisdiction. It needs to
                                                          ensure that there is a
                                                          functioning regulatory
                                                          authority that determines
                                                          the amount and timing of
                                                          fare increases and subsidy
                                                          payments. This should
                                                          be (and act) as legally
                                                          independent of local and/
                                                          or national governments.
                                                                            (continued)


                             Independent Evaluation Group | World Bank Group 23
Factor                               Subcategory                        Text
 Factor 3                     Legal or regulatory factors   Subsidy turned out to be
                                                            critical for the project. Take
                                                            the form of international law
                                                            governing the documents,
                                                            or the presence of a strong
                                                            independent regulatory
                                                            authority in an environment
                                                            where the judiciary is also
                                                            strong and independent.
                                                            If no effort to protect the
                                                            project is undertaken, then
                                                            it is subject to the changing
                                                            whims of local regulators.
 Factor 4                     Legal or regulatory factors   [Company X] could not
                                                            meet its performance
                                                            targets owing to
                                                            "operational and
                                                            regulatory difficulties
                                                            with the regulator" as the
                                                            government refused to pay
                                                            the subsidies agreed upon
                                                            or increase the agreed-
                                                            upon tariffs.
 Factor 5                     Political factors             Nationalization of [Compa�
                                                            ny X] and cancellation of
                                                            the license smacks of politi�
                                                            cal interference and sets
                                                            a lasting, negative effect
                                                            which would deter future
                                                            private investment in the
                                                            public transport sector in
                                                            both countries.
 Factor 6                     Political factors             The project was structured
                                                            through the parent
                                                            operation and provided
                                                            some insulation against
                                                            project-level risks.
                                                            Nevertheless, from a
                                                            development perspective
                                                            this oversight exposed
                                                            the project to high and
                                                            unmitigated political risk.




                                                                               (continued)




24 Machine Learning in Evaluative Synthesis | Chapter 2
Factor            Subcategory                    Text
Factor 7   Political factors         The political movement
                                     had a significant political
                                     and financial impact on the
                                     country, with (among other
                                     things) several national
                                     government changes. It is
                                     very difficult to structure a
                                     project so that it achieves
                                     its development objec�
                                     tives while going through a
                                     once-in-a-generation politi�
                                     cal and social revolution.
Factor 8   Political factors         The project relied on
                                     two important factors: (i)
                                     subsidies from FGH and
                                     (ii) implementation of
                                     agreed tariff increases. The
                                     subsidy only amounted to
                                     a small portion of receipts
                                     from traffic violations and
                                     thus this was not seen as
                                     an issue. Without control
                                     mechanisms, the project
                                     was entirely reliant on po�
                                     litical will which is uncertain
                                     at best and was completely
                                     lacking after the political
                                     movement.
Factor 9   Expansion                 [Company X] was to invest
                                     approximately US$[X]
                                     million to modernize their
                                     facilities and expand their
                                     fleet. The loan was dis�
                                     bursed in two tranches.




                                                        (continued)




            Independent Evaluation Group | World Bank Group 25
 Factor                                     Subcategory                              Text
 Factor 10                          Expansion                          [Company X] planned to
                                                                       invest US$[X] million, most
                                                                       of it in the form of a capital
                                                                       increase. Additional invest�
                                                                       ment as well as capital
                                                                       provided by the existing
                                                                       shareholders to modernize
                                                                       its facilities and expand its
                                                                       fleet.
 Factor 11                          Expansion                          [Company X], as the part of
                                                                       an expansion plan, signed
                                                                       an agreement to invest
                                                                       US$[X] million through a
                                                                       capital increase. The capital
                                                                       increase would be used
                                                                       toward financing a capital
                                                                       expenditure program over
                                                                       the coming years with
                                                                       modern maintenance facili�
                                                                       ties, as well as a major fleet
                                                                       renewal and expansion.
 Factor 12                          Additionality principle and        The project went ahead
                                    catalytic role                     without adequately miti�
                                                                       gating development risks
                                                                       (as distinct from the credit
                                                                       risks) as both deserve equal
                                                                       attention given the corpo�
                                                                       rate mandate and purpose.
 Factor 13                          Additionality principle and        It was expected that the
                                    catalytic role                     project would have a strong
                                                                       developmental impact
                                                                       with increased transport
                                                                       access to the urban poor
                                                                       and the disabled, leading
                                                                       to improvements in service
                                                                       levels overall. In addition,
                                                                       the project was expected
                                                                       to encourage other govern�
                                                                       ments and municipalities
                                                                       to create public-private
                                                                       frameworks.


Source: Independent Evaluation Group.
Note: Firm names and specific dollar amounts are withheld for reasons of confidentiality.




26 Machine Learning in Evaluative Synthesis | Chapter 2
As expected, the model showed a high degree of accuracy in classifying content into
well-defined subcategories such as “legal or regulatory factors,” “political risk,” and
“market share,” whereas classifications into less well-defined categories such as “com-
mitment and motivation” yielded a higher number of false positives. Overall, classifi-
cation according to supervised machine learning techniques offered clear advantages
over manual classification of factors and issues in project implementation. Manual
classification relies on individual practitioners, each drawing on a set of unique theo-
retical priors, influenced by knowledge and experience that could potentially affect the
way they search evaluation documents for factors and issues in implementation.

Furthermore, human coders focus on high-level or highly salient issues with greater
frequency, potentially ignoring substantively meaningful but more subtle features
that evaluation documents may also discuss. Drawing on a vetted training subset,
supervised learning generated considerably higher classification efficiency than
human coding with a comparable degree of accuracy. Properly calibrated machine
analysis produced faster and more efficient synthesis of evaluative evidence.

After this initial test was undertaken, IEG undertook a wider analysis, with both
human coders and algorithms classifying content in more than 170 Evaluative
Notes published between 2020 and 2022 across four industries in which IFC
funds projects (Financial Institutions Group; Manufacturing, Agribusiness, and
Services; Infrastructure and Natural Resources; and Disruptive Technologies and
Funds). Human coders were asked to include (i) the top three factors (taxonomy
subcategories) that explained the success or failure of a project in terms of
achieving its desired development outcome, organized from most important to least
important; (ii) the direction in which each factor (subcategory) affected project
success (+1 if the factor supported project success, –1 if the factor presented a risk
affecting a project); and (iii) a copy of the paragraph from the Evaluative Note that
supported why a factor (subcategory) was chosen.

Once the initial coders had classified the content in their project documents, a spe-
cialist or sector leader validated the classifications, as a form of peer review intended
to make classification consistent across the four IFC industry groups. There was also
an additional review across industries to make sure that classifications were consis-
tent over the total portfolio of Evaluative Notes analyzed.

After human coders had classified the content in the Evaluative Notes and their
classifications had been reviewed as discussed in the preceding paragraph, the
same machine learning protocol was applied to the content. The average accuracy
of machine-generated classification was about 70 percent across the subcategories
evaluated, with classification in some subcategories such as “economic factors,”
achieving greater than 90 percent accuracy.7


                               Independent Evaluation Group | World Bank Group 27
To ensure the relevancy and adaptability of our machine learning model against the
evolving risk landscape, model performance is assessed periodically to reflect new
evidence and adapt the subcategories in our taxonomy. Rigorous quality and change
control procedures are in place to ensure the robustness, stability, and reliability of
the model output.


Limitations

No methodology is without flaws, and machine learning is no exception to that rule.
This section outlines some of the limitations to the approach explored in this paper.
As discussed earlier, the inclusion of many overly granular subcategories resulted
in low accuracy rates, especially in areas where there were very few observations
to help classify a particular concept. We addressed issues of excessive granularity
through a refinement of problematic subcategories. In addition, the use of diagnos-
tics like cosine similarity ensured that the remaining categories were conceptually
exclusive. However, this also implied that some of the nuances requested by subject
experts and practitioners had to be omitted from the taxonomy. In those cases, the
subcategories were often too subtle or complex to allow for accurate classification.

The output of a supervised model is only as good as the reliability of training data
inputted. There are numerous pathways to suboptimal machine classification,
but sufficient diligence and meticulous calibration of input parameters can guard
against more pernicious errors and biases. If overarching categories in the taxonomy
were not well defined or not mutually exclusive, the machine learning algorithm
had difficulty in categorizing content into them accurately. Two examples illustrate
this point. First, the model initially omitted the classification of factors and issues
related to advisory services projects. When it became clear that the initial taxonomy
was insufficiently equipped to classify such factors and issues, we modified the
subcategories to address the omission. Once pertinent examples of such factors and
issues had been provided to train the model, machine learning was then successfully
used to identify other instances of similar issues. Second, the model initially used
overly broad keywords, such as, “commitment” as a keyword in the subcategory
“commitment and motivation.” This resulted in an overestimation of challenges
related to that subcategory, as commitment can mean “the state or quality of being
dedicated to a cause, activity, and so on,” but can also mean “obligation to provide a
pledged amount of capital.” Its prevalence in evaluation reports therefore made it an
inefficient classifier for machine learning applications. In both cases, we identified
and corrected for this type of error through cross-validation of the output data and
providing the machine learning algorithm with examples instead of keywords.




28 Machine Learning in Evaluative Synthesis | Chapter 2
Endnotes
1 
     Though relatively efficient, the latent Dirichlet allocation approach often generated group-
ings without a clearly interpretable significance. While these clusters could have represented
potential categories, they were more likely a by-product of random associations without signif-
icant substantive meaning. We therefore omitted them from the analysis.

2 
     Google developed Word2vec to reconstruct the linguistic context of sentence fragments. It
maps inputted text data into a vector space.

3 
     We used the four algorithms to classify paragraphs that human experts had previously clas-
sified, and the algorithm with results closest to those of the manual classification was naïve
Bayes.

4 
     For example, in cases in which a paragraph spoke exclusively about “economic factors,” then
the probability for that subcategory would be 100 percent, and the probability for the next two
categories would be 0 percent. In one example in which the majority of the paragraph was about
economic factors, the probabilities assigned were 70 percent for “economic factors,” 20 percent
“foreign exchange and local currency factors” and 10 percent “legal or regulatory factors.”

5 
     After refinement, average per-subcategory inclusion rates approached 2.0 percent, and the
most broadly defined subcategory had an inclusion rate of 6.0 percent. To correct for the inclu-
sion of frequent but substantively uninformative categories, we normalized the frequency with
which categories were predicted by dividing the number of predictions for a particular category
by the overall distribution of the predicted categories in the universe of coded keyword tags. We
then chose the categories that had greater than 1.2 times the average along the distribution.
This yielded a workable hierarchy of the most salient factors included in each document.

6 
     We eliminated any subcategories that included fewer than 50 paragraphs or merged them
with conceptually proximate categories to increase identification accuracy. For example, we
merged the subcategories “conflicts of interest” and “corporate governance,” as we found that
they were both capturing similar concepts and each accounted for less than 1 percent of total
paragraphs classified.

7 
     It should be noted that certain subcategories continued to perform suboptimally, even after
modeling refinements were applied. In several cases, machine learning generated a substantial
volume of false positives that required additional manual validation. Part of this relates to the
trade-off between completeness and classification accuracy: although the lower performing
subcategories may be of conceptual interest, there are certain limits to the quality of catego-
rization output that are highly dependent on the keywords and phrases that can be used to
correctly identify a concept. In some cases, these nuances are too subtle to be picked up by
machine coding.




                                    Independent Evaluation Group | World Bank Group 29
CONCLUSION
This paper has discussed the advantages and challenges of using machine learn-
ing in evaluative synthesis; more specifically, it has looked at the identification
and classification of project-level implementation factors and issues. Our analysis
showed that with the right combination of manual and automated approaches,
machine-learning-based information classification can lead to significant effi-
ciency gains without the loss of accuracy in information extraction and classi-
fication. Indeed, the incorporation of quality control practices can even result
in gains in accuracy in certain cases. We discussed the concrete experience of
IEG’s Financial and Private Sector Micro Unit as a basis for a systematic discus-
sion of this process. We first discussed the principles for generating a taxonomy
for classification. We then applied a combination of unsupervised- and super-
vised-learning techniques to generate word clusters, keywords, and examples
from evaluation documents as features for classification. These were integrated
into a taxonomy and used to classify the features into multiple categories of fac-
tors and issues.

Following several rounds of cross-validation and calibration, we were able to
achieve accuracy rates for classification comparable to those achieved by human
coders in this field (about 70 percent accuracy) but at substantially higher levels
of efficiency, because the model we designed can perform the classification task at
a much faster rate than human coders. As expected, our model classified features
into well-defined subcategories such as “legal or regulatory factors,” “political
factors,” and “market pricing” with much higher accuracy (that is, fewer incorrect
classifications) than it did into broader subcategories such as “commitment and
motivation.” In instances in which we specified subcategories imprecisely, the
model faced greater difficulties in converging on the correct subcategories into
which to classify the features. Furthermore, the use of overly broad keywords also
initially resulted in misclassification errors. Subsequent refinements to the model
and inputs from subject experts helped improve the training data, enabling the
model to efficiently generate more relevant tags for features it classified.

Currently, the output of our extraction and classification process is captured in
a data visualization tool (based on Microsoft’s Tableau platform), which gener-
ates descriptive statistics on implementation factors and issues disaggregated
by geographic area and private sector industry. In addition, the output is used
for writing synthetic evaluative analyses. The inclusion of readily accessible and
searchable parameters for factors and issues allows project practitioners in the



                            Independent Evaluation Group | World Bank Group 31
Bank Group to observe commonalities and patterns across large numbers of success-
ful and unsuccessful projects and disaggregate the output according to sectoral or
regional factors where useful. Such a combination of features thus allows the model
to be used to leverage decades of institutional experience in project implementation
and apply it to both evaluative synthetic analysis and project design more efficiently
and systematically than has been possible before.

As with any other form of analysis, the accuracy of our model’s results
is contingent on the quantity and quality of inputted data, as well as the
presence of adequate supervision and cross-validation. Given these conditions,
automated parsing and tagging of project information shows promise as an
intuitive improvement over a manual approach. The output from our taxonomy
allows evaluators to access the entire universe of project insights from all
available project evaluations and learn about salient factors influencing project
performance. With future revisions and refinements to the taxonomy (particularly
with the inclusion of more examples in the training set), the classification accuracy
rates achieved by the model will continue to improve. Taken together, the gains
in efficiency and benefits in regard to data accessibility that result from the use
of machine learning techniques will allow evaluators and practitioners to better
incorporate lessons from the past into future practice.




32 Machine Learning in Evaluative Synthesis | Conclusion
BIBLIOGRAPHY
Abdellatif, M., W. Atherton, R. Alkhaddar, and Y. Osman. 2015. “Flood Risk
     Assessment for Urban Water System in a Changing Climate Using Artificial
     Neural Network.” Natural Hazards 79 (2): 1059–77. https://doi.org/10.1007/
     s11069-015-1892-6.
Aggarwal, C. C., and C. Zhai, eds. 2012. Mining Text Data. New York: Springer Sci-
     ence and Business Media. https://doi.org/10.1007/978-1-4614-3223-4.
Ali, M. Z. 2007. “The Application of the Artificial Neural Network Model for
     River Water Quality Classification with Emphasis on the Impact of Land Use
     Activities: A Case Study from Several Catchments in Malaysia.” PhD thesis,
     University of Nottingham, Nottingham, UK. https://eprints.nottingham.
     ac.uk/11867/.
Bail, C. A. 2015. “Commentary: Lost in a Random Forest; Using Big Data to Study
     Rare Events.” Big Data & Society 2 (2): 2053951715604333. https://doi.
     org/10.1177/2053951715604333.
Blei, D. M. 2012. “Probabilistic Topic Models.” Communications of the ACM 55 (4):
     77–84. https://doi.org/10.1145/2133806.2133826.
Burlig, F., C. Knittel, D. Rapson, M. Reguant, and C. Wolfram. 2017. “Machine
     Learning from Schools about Energy Efficiency.” NBER Working Paper 23908,
     National Bureau of Economic Research, Cambridge, MA. https://www.nber.
     org/papers/w23908.
Burscher, B., R. Vliegenthart, and C. H. De Vreese. 2015. “Using Supervised
     Machine Learning to Code Policy Issues: Can Classifiers Generalize across
     Contexts?” Annals of the American Academy of Political and Social Science 659
     (1): 122–31. https://doi.org/10.1177/0002716215569441.
Camillo, F., and I. D’Attoma. 2010. “A New Data Mining Approach to Estimate
     Causal Effects of Policy Interventions.” Expert Systems with Applications 37
     (1): 171–81. https://doi.org/10.1016/j.eswa.2009.05.072.
Cimiano, P., A. Pivk, L. Schmidt-Thieme, and S. Staab. 2005. “Learning Taxonomic
     Relations from Heterogeneous Sources of Evidence.” In Ontology Learning
     from Text: Methods, Evaluation and Applications (Frontiers in Artificial Intel-
     ligence and Applications, vol. 123), edited by P. Buitelaar, P. Cimiano, and B.
     Magnini, 59–73. Amsterdam: IOS Press.
Dayan, P., M. Sahani, and G. Deback. 1999. “Unsupervised Learning.” In The MIT
     Encyclopedia of the Cognitive Sciences, edited by R. A. Wilson and F. C. Keil.
    Cambridge, MA: MIT Press.




                            Independent Evaluation Group | World Bank Group 35
Franzen, S., C. Quang, L. Schweizer, A. Budzier, J. Gold, M. Vellez, S. Ramirez, and E.
     Raimondo. 2022. Advanced Content Analysis: Can Artificial Intelligence Accelerate
     Theory-Driven Complex Program Evaluation? IEG Methods and Evaluation Capac-
     ity Development Working Paper Series. Independent Evaluation Group. Wash-
     ington, DC: World Bank. https://ieg.worldbankgroup.org/methods-resource/
     advanced-content-analysis-can-artificial-intelligence-accelerate-theory-driv-
     en-complex.
Galindo, J., and P. Tamayo. 2000. “Credit Risk Assessment Using Statistical and Ma-
     chine Learning: Basic Methodology and Risk Modeling Applications.” Computa-
     tional Economics 15 (1): 107–43. https://doi.org/10.1023/A:1008699112516.
Gertler, P. J., S. Martinez, P. Premand, L. B. Rawlings, and C. M. J. Vermeersch.
     2016. Impact Evaluation in Practice. 2nd ed. Washington, DC: Inter-American
     Development Bank and World Bank. http://hdl.handle.net/10986/25030.
Ghahramani, Z. 2004. “Unsupervised Learning.” In Advanced Lectures on Machine
     Learning, edited by O. Bousquet, U. Luxburg, and G. Rätsch, 72–112. Berlin:
     Springer. https://link.springer.com/book/10.1007/b100712.
Grimmer, J., and B. M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of
     Automatic Content Analysis Methods for Political Texts.” Political Analysis 21
     (3): 267–97. https://www.jstor.org/stable/24572662.
Hillard, D., S. Purpura, and J. Wilkerson. 2008. “Computer-Assisted Topic Classifica-
     tion for Mixed-Methods Social Science Research.” Journal of Information Tech-
     nology and Politics 4 (4): 31–46. https://doi.org/10.1080/19331680801975367.
Ittoo, A., L. M. Nguyen, and A. van den Bosch. 2016. “Review: Text Analytics in In-
     dustry; Challenges, Desiderata and Trends.” Computers in Industry 78: 96–107.
     https://doi.org/10.1016/j.compind.2015.12.001.
Jean, N., M. Burke, M. Xie, M. Davis, D. B. Lobell, and S. Ermon. 2016. “Combining
     Satellite Imagery and Machine Learning to Predict Poverty.” Science 353 (6301):
     790–94. https://doi.org/10.1126/science.aaf7894.
McBride, L.,and Nichols, A. (2018). “Retooling Poverty Targeting Using Out-of-Sam-
     ple Validation And Machine Learning.” The World Bank Economic Review, 32(3),
     531–50.
Mueller, C. E., H. Gaus, and I. Konradt. 2016. “Predicting Research Productivity in
     International Evaluation Journals across Countries.” 12 (27): 79–92. https://doi.
     org/10.56645/jmde.v12i27.459.
Ofli, F., P. Meier, M. Inran, C. Castillo, D. Tuia, N. Rey, J. Briant, et al. 2016. “Com-
     bining Human Computing and Machine Learning to Make Sense of Big (Aerial)
     Data for Disaster Response.” Big Data 4 (1): 47–59. https://doi.org/10.1089/
     big.2014.0064.




36 Machine Learning in Evaluative Synthesis | Bibliography
Okori, W., and J. Obua. 2011. “Machine Learning Classification Technique for Famine
    Prediction.” In Proceedings of the World Congress on Engineering 2011 (London,
    July 6–8, 2011), vol. 2, edited by S. I. Ao, L. Gelman, D. W. L. Hukins, A. Hunter,
    and A. M. Korsunsky, 991–96. Hong Kong SAR, China: International Association
    of Engineers. https://www.iaeng.org/publication/WCE2011/.
Ortega Nieto, D., A. Hagh, and V. Agarwal. 2022. “Delivery Challenges and Devel-
    opment Effectiveness: Assessing the Determinants of World Bank Project
    Success.” Policy Research Working Paper 10144, World Bank, Washington, DC.
    http://hdl.handle.net/10986/37902.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
    Blondel, et al. 2011. “Scikit-learn: Machine Learning in Python.” Jour-
    nal of Machine Learning Research 12 (February): 2825–30. https://doi.
    org/10.5555/1953048.2078195.
Plmanabhan, J. P. 2015. “Applying Machine Learning Techniques to the Analysis
    of Policy Data of the Military Health Enterprise.” Masters’ thesis, Massachu-
    setts Institute of Technology, Cambridge, MA. https://dspace.mit.edu/han-
    dle/1721.1/106270.
Popper, K. 2002. The Logic of Scientific Discovery. London: Routledge. https://doi.
    org/10.4324/9780203994627.
Rennie, J. D. M., L. Shih, J. Teevan, and D. R. Karger. 2003. “Tackling the Poor As-
    sumptions of Naive Bayes Text Classifiers.” In ICML ’03: Proceedings of the Twen-
    tieth International Conference on Machine Learning (Washington, DC, August
    2003), edited by T. Fawcett and N. Mishra, 616–23. Washington, DC: AAAI Press.
    https://doi.org/10.5555/3041838.3041.
Rona-Tas, A., A. Cornuéjols, S. Blanchemanche, A. Duroy, and C. Martin. 2019. “En-
    listing Supervised Machine Learning in Mapping Scientific Uncertainty Ex-
    pressed in Food Risk Analysis.” Sociological Methods & Research 48 (3): 608–41.
    https://doi.org/10.1177/004912411772970.
Ruz, G. A., S. Varas, and M. Villena. 2013. “Policy Making for Broadband Adoption
    and Usage in Chile through Machine Learning.” Expert Systems with Applications
    40 (17): 6728–34. https://doi.org/10.1016/j.eswa.2013.06.039.
Samuel, A. L. 1959. “Some Studies in Machine Learning Using the Game of Check-
    ers.” IBM Journal of Research and Development 3 (3): 210–29. https://doi.
    org/10.1016/10.1147/rd.33.0210.
Schmidt, S., S. Schnitzer, and C. Rensing. 2016. “Text Classification Based Filters for
    a Domain-Specific Search Engine.” Computers in Industry 78: 70–79. https://doi.
    org/10.1016/j.compind.2015.10.004.
Sebastiani, F. 2002. “Machine Learning in Automated Text Categorization.” ACM
    Computing Surveys 34 (1): 1–47. https://doi.org/10.1145/505282.505283.




                               Independent Evaluation Group | World Bank Group 37
Tanguy, L., N. Tulechki, A. Urieli, E. Hermann, and C. Raynal. 2016. “Natural Lan-
    guage Processing for Aviation Safety Reports: From Classification to Interactive
    Analysis.” Computers in Industry 78: 80–95. https://doi.org/10.1016/j.comp-
    ind.2015.09.005.
Tong, S., and D. Koller. 2002. “Support Vector Machine Active Learning with Appli-
    cations to Text Classification.” Journal of Machine Learning Research 2 (March):
    45–66. https://doi.org/10.1162/153244302760185243.
Zheng, Y., H. Zheng, and X. Ye. 2016. “Using Machine Learning in Environmental Tax
    Reform Assessment for Sustainable Development: A Case Study of Hubei Prov-
    ince, China.” Sustainability 8 (11): 1124. https://doi.org/10.3390/su8111124.
Ziulu, V., J. Meckler, G. Hernández Licona, and J. Vaessen. 2022. Poverty Mapping:
    Innovative Approaches to Creating Poverty Maps with New Data Sources. IEG
    Methods and Evaluation Capacity Development Working Paper Series.
    Independent Evaluation Group. Washington, DC: World Bank. https://ieg.
    worldbankgroup.org/evaluations/poverty-mapping-innovative-approaches-
    creating-poverty-maps-new-data-sources.




38 Machine Learning in Evaluative Synthesis | Bibliography
The World Bank
1818 H Street NW
Washington, DC 20433