Guidance Note Balancing Innovation and Rigor: Guidance for the Thoughtful Integration of Artificial Intelligence for Evaluation 5/13/2025 Summary Within the evolving landscape of artificial intelligence, large language models (LLMs), a type of generative artificial intelligence, offer significant potential for improving the collection, processing, and analysis of large volumes of text data in evaluation. In this note, we present key lessons and good practices for leveraging LLMs based on our recent experiments. The experiments’ results reveal that the LLMs tested could perform text classification quite well, achieving satisfactory recall, precision, and F1 scores. The models also performed well on tasks such as text summarization and synthesis, achieving high scores on metrics related to relevance, coherence, and faithfulness of the generated text. However, challenges remain in ensuring completeness and relevance in information extraction and text synthesis tasks. We found iterative prompt validation and refinement, measurement of model performance with relevant metrics, and representative sampling to be important considerations to ensure the success of these applications. We hope this document will serve as a practical resource for multidisciplinary teams across evaluation departments seeking to responsibly integrate LLMs into their workflows by maintaining analytical rigor. Keywords Artificial intelligence; data science; evaluation; generative artificial intelligence; large language model; natural language processing. This publication was jointly produced by the Independent Evaluation Group (IEG) of the World Bank (WB) and the Independent Office of Evaluation (IOE) of the International Fund for Agricultural Development (IFAD). i Contents Key Takeaways.............................................................................................................................................. iii Abbreviations ................................................................................................................................................ iv Acknowledgments ........................................................................................................................................ v Introduction .................................................................................................................................................... 1 Key Considerations for Experimentation ..............................................................................................2 Identifying Use Cases ................................................................................................................................................ 2 Identifying Opportunities Within Use Cases ..................................................................................................... 2 Finding Agreement on Resources and Outcomes ......................................................................................... 5 Selecting Appropriate Metrics to Measure LLMs’ Performance ................................................................ 6 Our Experiments and Results....................................................................................................................8 Emerging Good Practices .........................................................................................................................11 Representative Sampling ....................................................................................................................................... 12 Developing an Initial Prompt ............................................................................................................................... 14 Evaluating Model Performance ........................................................................................................................... 17 Refining Prompts ...................................................................................................................................................... 18 Going Forward ............................................................................................................................................. 18 Bibliography ................................................................................................................................................ 20 Figures Figure 1. Structured Literature Review Workflow .................................................... 4 Figure 2. Prompting and Validation Loop ........................................................... 11 Tables Table 1. Assessment Criteria ................................................................................. 7 Table 2. Our Four Experiments ............................................................................. 9 Table 3. Experiment Results for Discriminative Task ............................................... 9 Table 4. Experiment Results for Generative Tasks ................................................ 10 ii Key Takeaways Identify relevant use cases. Thoughtful experimentation begins with identifying evaluation methods in which LLMs can be integrated to add significant value compared with traditional approaches within the same resource constraints. Leveraging LLMs will not be suitable for every use case; therefore, it is essential to align experiments with those use cases where LLMs’ capabilities can be leveraged effectively. Plan workflows within use cases. Breaking down use cases into detailed steps and tasks helps teams understand where and how to apply LLMs effectively. This modular design also allows for the reuse of successful components within and across use cases. Understand and agree on resource allocation and outcomes. Teams must clearly understand and agree on the necessary resources and expected outcomes for an experiment. This includes human resources (evaluator, data scientist, research design and domain experts), technology, timeline, and a definition of success for each experiment. Form an appropriate sampling strategy. A robust sampling strategy is essential, such as dividing a data set into training, validation, testing, and prediction sets to facilitate effective prompt development and model evaluation. Such division can help a team refine prompts iteratively and assess their generalizability, ultimately leading to more aligned responses from LLMs. Select appropriate model evaluation metrics. Selecting and calculating metrics to measure LLM performance, along with appropriate intercoder reliability assessments for human-annotated data, is crucial to determine the success of an experiment. For discriminative tasks such as text classification, standard machine learning metrics such as recall, precision, and F1 scores can be useful. For generative tasks such as text summarization and synthesis, human assessment criteria such as faithfulness, relevance, and coherence can be meaningful. Iteratively develop and validate prompts. Developing effective prompts involves iteratively testing and refining. For example, a team could start with a basic prompt and gradually add more specific instructions based on LLMs’ responses. Including requests for justification in prompts can provide insights into a model’s reasoning and help with prompt refinement. iii Abbreviations AI artificial intelligence GenAI generative artificial intelligence IEG Independent Evaluation Group LLM large language model SLR structured literature review All dollar amounts are US dollars unless otherwise indicated. iv Acknowledgments This guidance note was authored by Harsh Anuj, Hannah Den Boer, and Estelle Raimondo. Dawn Roberts, Jenny Gold, Mercedes Vellez, and Joy Butscher collaborated with the authors on the experiments. Jenny Gold and Ridwan Bello provided helpful comments on an earlier draft. Arunjana Das, Amanda O’Brien, Wendy Rubin, and William Stebbins, assisted with the editing, production, and dissemination of the guidance note. The authors are grateful to Sabine Bernabè and Dr. Indran A. Naidoo for their support. Microsoft Copilot was leveraged during the production of this document. v Introduction Within the evolving landscape of artificial intelligence (AI), large language models (LLMs)—a type of generative artificial intelligence (GenAI) for text (see Brown et al. 2020; Google 2025)—have the potential to enhance the efficiency, breadth, and validity of the collection, processing, and analysis of text as data in evaluation practice (see Raimondo et al. 2023a, 2023b, 2023c; Ziulu et al. 2024; Anuj et al. 2025).1 However, LLMs do not always generate aligned, authoritative, or accurate responses (see Ouyang et al. 2022; Martineau 2023; OpenAI 2024), indicating that their responses must be validated before use in our work. Furthermore, the importance of analytical rigor in our practice, combined with our institutions' ability to affect the lives of people around the world, makes it clear that we must take a thoughtful approach to integrating such tools. How can we realize the potential of LLMs while maintaining rigor? This guidance note aims to answer that question by demonstrating good practices for experimenting with LLMs based on a frequently occurring use case in our evaluations: structured literature review (SLR). This use case serves as a concrete example of how LLMs can be thoughtfully integrated into evaluation workflows. Our findings are based on a series of on-the-job experiments conducted by the Independent Evaluation Group (IEG) over a two-month period in late 2024. These experiments were carried out within a multidisciplinary team comprising IEG and International Fund for Agricultural Development staff with expertise in evaluation, data science, and research design. In the next section, Key Considerations for Experimentation, we describe how to identify relevant use cases and opportunities within use cases for the application of LLMs, the importance of finding agreement on resources and outcomes, and the selection of appropriate metrics to measure LLM performance. The section includes a detailed workflow for an SLR, while the workflow for an evaluation synthesis is presented in the appendix, along with a more “traditional” SLR workflow. The section Our Experiments and Results presents the design and results of our experiments and includes tables summarizing the performance of LLMs on text classification, summarization, synthesis, and information extraction, as measured by selected metrics. The next section, Emerging Good Practices, offers guidance for developing effective prompts, creating subsets of data to compute model evaluation metrics, and refining prompts based on validation findings. Finally, in the last section, Going Forward, we discuss the ongoing journey of 1Some LLMs such as OpenAI’s GPT-4o are inherently multimodal—that is, they can accept and or generate images, speech, or other types of data along with text. See for example Huyen 2023 for a helpful description of multimodality. 1 experimentation with AI in evaluation offices, emphasizing continuous learning, adaptation, and collaboration. Key Considerations for Experimentation Based on our experience, we identified the following key considerations to assess the potential for thoughtful integration of LLMs in use cases related to evaluative analyses and syntheses. Identifying Use Cases Thoughtful experimentation begins with careful planning and the identification of areas in which LLMs could add sufficient incremental value for a given set of resources and constraints (for example, staff, budget, time) compared with more traditional approaches to the analysis of text data. This foundational step ensures that experiments are purposeful and relevant. Although LLMs are quite versatile and seemingly all- knowing, their usefulness depends on the way they are applied for particular use cases. Misaligned experimentation risks wasting resources and compromising quality. Such use cases typically meet the following conditions: (i) The literature on LLMs (and or previous work) identifies the case as having high-value applications, such as text classification, text summarization, sentiment analysis, and information retrieval (see Puri et al. 2019; Lewis et al. 2020; Gera et al. 2022; Alaofi et al. 2024; Glickman et al. 2024); and (ii) the current evaluation practice is either inefficient, ‘shallow’, or impossible due to the sheer volume of text. For this guidance note, we built on the eight limited experiments on applications of LLMs for evaluation practice that we had carried out and published as a series of blogs (Raimondo et al. 2023a, 2023b, 2023c). We chose to focus on one of the two use cases that had yielded unimpressive results: SLRs. We also examined the other use case that had not worked well: evaluation synthesis. We expect LLMs to enhance the way in which these two important methods for our major evaluations are implemented. Identifying Opportunities Within Use Cases We learned from previous experiments that for complex use cases such as SLRs, it is important to unpack the various analytical steps and to carefully examine for what and how LLMs can be leveraged. This step requires the development of a granular understanding of the analytical steps involved, as well as the capabilities of LLMs. Although it is tempting, for example, to try to produce an SLR or evaluation synthesis with a few documents and simple prompts, we had found this approach to be unsuccessful. 2 Therefore, we started by creating a relatively detailed workflow—through a data science lens—for the various steps in the selected use cases. (For reference, we have provided a visualization of the workflow for a standard SLR as per IEG 2017 in the appendix.) In doing so, we found that the workflows for the two use cases (as well as other ones that interest us, such as portfolio review and analysis and interview transcript analysis) are broadly very similar. We also noticed that within the steps in the workflow, specific components can be repeated and provide opportunities to use the capabilities of LLMs that we know can work well (based on the literature and our previous applied work and experiments). These capabilities include (i) text classification, (ii) text summarization, (iii) text synthesis, and (iv) summative information extraction. The workflow for SLR is provided in figure 1, and a workflow for the evaluation synthesis is provided in figure A.2. Both figures show that components such as text search, manual review, text classification, and LLM appear multiple times within and across workflows. This modularity is by design and assists with the identification of task-specific opportunities for successfully applying LLMs. (In practice, the steps can overlap because the process is iterative, with multiple feedback loops.) The modularity can also be helpful when developing similar workflows for other use cases, such as portfolio review and analysis and analysis of interview transcripts, which we are currently implementing at IEG. From a developer’s perspective, this modularity is helpful when developing Python code to semi-automate various steps with humans in the loop. It is important to note that the manual review component is mandatory in our workflows when LLM or machine learning are used. 3 Figure 1. Structured Literature Review Workflow Source: Independent Evaluation Group. Note: LLM = large language model; SLR = structured literature review. 4 Figure 1 also shows that there are five moments that present opportunities to leverage LLMs: (i) When screening documents for inclusion in the review or synthesis based on their relevance to the topic; (ii) when extracting relevant information from documents; (iii) when annotating extracted text to various typologies; (iv) when summarizing annotated text within types; and (v) when synthesizing annotated text across types. Finding Agreement on Resources and Outcomes After completing the task of developing a clear road map for the application of LLMs in the use cases, team members need to harmonize expectations. Our experience shows it is important for all the team members to understand the types and amounts of resources required to undertake the experiments, and to arrive at a clear collective agreement on expected outcomes or what success would look like. This agreement is especially crucial given the multidisciplinary nature of the teams carrying out such experiments. Coming to a shared agreement on resources and outcomes can also help with dispelling or at least tempering the notion that working with LLMs is straightforward and inexpensive and will produce phenomenal results each time. In terms of types of resources, it is important to consider the availability of full-time staff, including data scientists, evaluators, subject matter experts, and research design specialists. The technology needed to carry out the work should be identified and acquired, including compute to efficiently process large volumes of data, and budget to use proprietary LLMs via their respective application programming interfaces (APIs).2 Finally, it is important to define the expected outcomes from the use of LLMs, including what would be considered a successful or helpful application. Expected outcomes should be commensurate with the resources allocated. For example, in our application of an LLM in the identification step of an SLR, we agreed to consider it a success because the process allowed us to identify (via a semantic search), bulk download, and screen the full text of over 10,000 research papers for relevance in a short duration and with an acceptable level of accuracy (see Selecting Appropriate Metrics to Measure LLMs’ 2To learn more about compute, see Amazon Web Services (n.d.-b). To learn about APIs, see Goodwin 2024. In our experiments, we used OpenAI’s proprietary GPT-4o model via their API as well and playground. Access to compute, especially sophisticated NVIDIA graphics processing units (GPUs), is necessary for using open-source models directly. We conducted some tests with open small models from Mistral AI, Microsoft, Google, and Meta, but due to our limited access to graphics processing units at the time, we could not test the larger models that might be able to compete with GPT-4o. However, the cost for GPT-4o was not insignificant, and free, open models with similar performance would certainly be a strong choice going forward, for a variety of reasons, given that they can be securely integrated into an institution’s information technology systems. 5 Performance). This application made the process significantly more efficient and comprehensive than a purely manual one would have been, while reducing the overall effort required. Selecting Appropriate Metrics to Measure LLMs’ Performance While the criteria for assessing whether an experimental application of LLMs for an evaluation use case is successful or not are subjective, it is important to think about clear dimensions to measure LLMs’ performance on more narrowly defined tasks, such as text classification, summarization, synthesis, and information extraction. Continuing with the SLR example, use of an LLM for literature identification (classification) with a recall score of 0.75 with precision score of 0.6 could be considered a success in one evaluation, whereas in another evaluation recall and precision scores of 0.9 and 0.5 respectively might be considered successful. However, to establish whether the applications were successful, the recall and precision scores need to be selected and computed first. For the text classification task in our experiment, we leveraged standard machine learning model evaluation metrics such as binary classification accuracy, recall, precision, balanced accuracy, and F1 scores respectively. These metrics measure the degree of overlap between machine-annotated “predicted” labels and human-annotated “ground-truth” labels.3 Furthermore, we split the underlying samples of papers into distinct training, validation, testing, and prediction sets respectively. As is standard practice in machine learning, the testing set was not used to develop or refine the prompts or other inputs to the process, which enabled us to compute unbiased estimates for our selected performance metrics with it (see Emerging Good Practices below). However, for the text summarization, synthesis, and information extraction tasks, we did not develop a human benchmark to use for assessing responses. This was because we did not apply these tasks to a real evaluation use case, and therefore did not have the resources to produce human-annotated data.4 In the absence of a directly comparable “ground truth,” how can we assess the quality of model responses? We used the following criteria—faithfulness, relevance, and coherence—which can provide comprehensive and accurate feedback as they allow for a subjective assessment of the 3When developing the “ground-truth” labels, it is important to take intercoder reliability into account. 4A human-generated reference text also offers the option to leverage relevant model evaluation metrics for natural language generation such as BLEU, METEOR, and ROUGE. 6 generated texts’ alignment with an evaluation task’s objective and an evaluator’s expectations. i. Faithfulness measures whether the information generated is factually consistent with the information in the source or not (see Durmus et al. 2020; Zhang et al. 2024). • Relevance measures whether the selected content from the source is the most important content following the prompt (see Fabbri et al. 2021; Zhang et al. 2024). • Coherence measures the overall collective quality of the sentences: The response text should be built from sentence to sentence to a coherent body of information about a topic (see Fabbri et al. 2021; Zhang et al. 2024). Table 1 provides details on the above criteria. To determine what minimum values for each metric would be acceptable for the application to be considered a success, we took a context-specific approach. For literature identification (a classification task), recall and precision scores higher than 0.6 and 0.7, respectively, were deemed necessary. This was due to two factors: (i) the conceptual complexity of classification task due to the complexity of the SLR topics, and (ii) the class imbalance in the underlying search results from the Semantic Scholar open data platform (Kinney et al. 2023).5 Similarly, users can determine what values of the metrics measuring faithfulness, relevance, and coherence would be satisfactory for their tasks. For use cases with higher stakes (where, for example, a real-world decision must be made using LLMs’ responses, even in part), higher values would be required.6 Finally, it is important to note that human judgments on Likert scales can vary; therefore, it is recommended that evaluators measure and report interrater agreement through a metric such as Cohen’s kappa (see McHugh 2012). Table 1. Assessment Criteria Criterion Definition Assessment Scale Source(s) Taska Faithfulness Being factually consistent with 0 (unfaithful) or 1 Durmus et al. Summarization, information in the source document (faithful) 2020; Zhang synthesis, extraction, If 0, then Hb or ICc et al. 2024 Relevance Selection of important content from Likert scale of 1–5 Fabbri et al. Summarization, the source document 2021; Zhang synthesis, et al. 2024 5That is, the results from Semantic Scholar bulk search API contained a high proportion of false positives. This was an intended outcome of our strategy for the initial search. We kept our search terms relatively broad to maximize recall (see IEG, Forthcoming). 6Only the text classification task was used for an evaluation, so no practical thresholds were set in advance for the text summarization, text synthesis, and information extraction use cases, as these were applied to purely experimental tasks. 7 Criterion Definition Assessment Scale Source(s) Taska (1 = highly extraction irrelevant, 5 = highly relevant) Coherence Collective quality of all sentences Likert scale of 1–5 (1 Fabbri et al. Summarization, = highly incoherent, 2021; Zhang synthesis 5 = highly coherent) et al. 2024 Binary Fraction of correct classifications 0 (completely Pedregosa et Classification classification inaccurate) to 1 al. 2011 accuracy score (completely accurate) Precision score TP / (TP + FP)d 0 (completely Pedregosa et Classification inaccurate) to 1 al. 2011 (completely accurate) Recall score TP / (TP + FN)d 0 (completely Pedregosa et Classification inaccurate) to 1 al. 2011 (completely accurate) Balanced Arithmetic mean of sensitivity (true 0 (completely Pedregosa et Classification accuracy score positive rate) and specificity (true inaccurate) to 1 al. 2011 negative rate) (completely accurate) F1 score Weighted harmonic mean of 0 (completely Pedregosa et Classification precision and recall scores inaccurate) to 1 al. 2011 (completely accurate) Source: Independent Evaluation Group; Pedregosa et al. 2011; Durmus et al. 2020; Fabbri et al. 2021; Zhang et al. 2024. Notes: a. Multiple metrics can, and should, be combined to assess the results of a particular task, as discussed earlier in this section. b. H = hallucination (that is, information expressed is not contained in the source). c. IC = incorrect concatenation (that is, information expressed conflicts with the source). d. TP = true positive; FP = false positive; TN = true negative, FN = false negative. Our Experiments and Results Given the reuse of components within and across the SLR and evaluation synthesis workflows—as well as the resources and time required to undertake such experiments in practice for a major evaluation at IEG—we did not conduct experiments for the full SLR workflow or the evaluation synthesis workflow. Instead, we focused on robustly testing the components of the literature identification step, including LLM-based text classification, for an SLR in an ongoing IEG thematic evaluation of the World Bank Group’s support for epidemic preparedness (World Bank, forthcoming). We then used random samples from identified literature to conduct experiments with text summarization, text synthesis, and information extraction. Table 2 provides details on the design of our experiments. 8 Table 2. Our Four Experiments Model Model and Item Task Sample response Unit of Scoring Parameters Text classification Binary classification to 30 papers in test Categorization Each categorization OpenAI GPT-4o identify literature on set. Selected via and justification response model via API private sector engagement text clustering for each paper Temperature = 0.0 in epidemic preparedness Text Generation of abstracts 30 papers. Abstract for Each generated OpenAI GPT-4o- summarization from full papers Selected each paper abstract mini model via API randomly from Temperature = 0.0 search results Text synthesis Generation of a synthesis Six summaries One 500-word Each of the five OpenAI GPT-4o from six summaries on of 200 words synthesis paragraphs. Each mini via private sector engagement each. Selected paragraph included playground in epidemic preparedness randomly from the pattern, the Temperature = 0.0 text examples, and a summarization conclusive results overarching sentence. Information Extraction of information 12 papers. 57 responses Each response per OpenAI GPT-4o- extraction on public–private sector Selected from returned (three paper (that is, three mini model API engagement in epidemic validation set categories for responses per Temperature = 0.0 preparedness contained in used in one of 19 examples, as paper) papers. Three types of the text one paper information were to be classification could contain extracted: actors, tasks. Selected multiple mechanism, and goals randomly examples). Source: Independent Evaluation Group. Note: API = application programming interface. Tables 3 and 4 summarize the results for each experiment. Table 3. Experiment Results for Discriminative Task Task\Score Accuracy Recall Precision F1 Balanced Accuracy Text classification 0.90 0.75 0.60 0.67 0.67 (testing set) Source: Independent Evaluation Group. Notes: We assume here that a ‘discriminative task’ is one for which the required response is in the form of a decision regarding the appropriate category for an observation. See also entry for discriminative models in Google [2025] for a definition. 9 Table 4. Experiment Results for Generative Tasks Task\Score Faithfulness (IC) Faithfulness (H) Relevance Coherence Text summarization 0.90 1.00 4.87 4.97 Text synthesis 1.00 1.00 4.20 5.00 Information extraction 1.00 1.00 3.25 n.a. Source: Independent Evaluation Group. Notes: We assume here that ‘generative tasks’ are those for which the required response from a model is in the form of a narrative. See also entry for generative model in Google [2025] for a definition (or lack thereof). n.a. = not applicable because the responses only included one sentence. IC = incorrect concatenation (that is, information expressed conflicts with the source); H = hallucination (that is, information expressed by the model is not contained in the reference text). As can be seen in tables 3 and 4, the LLMs we tested generally performed quite well in each of the generative tasks based on the metrics used. The model responses for the text summarization task were remarkably relevant, coherent, and faithful. The high relevance score (4.87) shows that the abstracts generated contained the most important information, often outperforming original abstracts where those were present. A coherence score of 4.97 highlights the ability to produce unified, logically connected responses, whereas a faithfulness score of 0.90 reflects strong factual alignment, with only some isolated issues with incorrect aggregation of information. Importantly, no hallucinations were observed. For the information extraction task, faithfulness was excellent: Information was accurately retrieved (faithfulness incorrect concatenation [IC] = 1.00), and no hallucinations took place (faithfulness hallucination [H] = 1.00). However, the relevance score (3.25) shows that the model had difficulty extracting the most relevant information from the papers, particularly in identifying specific requested details, and omissions of relevant information were noted. In the text synthesis task, which was a summary of summaries, information was accurately retrieved (faithfulness IC = 1.00), and no hallucinations took place (faithfulness H = 1.00).7 Additionally, the LLMs correctly referenced over 10 times the number of respective summaries that it had used to produce the 500-word synthesis, as we had stipulated in the prompt. However, some relevant information was omitted, hence the lower relevance score of 4.20. The text classification task yielded strong results after multiple iterations to refine the prompt using the validation set. Given the complexity of the task owing to the topics of the literature review, the need to keep overfitting in check,8 and the efficiency introduced by the overall workflow, the recall and precision scores of 0.75 and 0.60 respectively were deemed satisfactory in this particular use case (see Liu et al 2018). Indeed, the use of the same workflow and prompt format for different SLR subtopics 7 Because the synthesis was conducted with summaries of the source documents, the results were likely better compared with what we might have achieved by synthesizing the source texts directly. 8 For more information about overfitting, see Google (2024). 10 yielded helpful results, likely due to the use of representative sampling based on a semi- supervised learning strategy (see Géron 2019; Liu et al. 2018) that supported generalizability. Emerging Good Practices Given that our experiments yielded satisfactory results, sometimes after a few or many iterations, we identified some good practices that helped us achieve useful results. Most of our guidance in this section focuses on the prompting and validation loop because this is an important factor for achieving satisfactory results on our selected LLM evaluation metrics (see Shin et al. 2020 for a discussion on the importance of prompting).9 Figure 2 describes this iterative process. This guidance is based on our work on the various experiments described earlier. These practices emerged as ones that contributed to satisfactory results in this set of experiments and were identified by us either during this work or during our past work with LLMs. Figure 2. Prompting and Validation Loop Source: Independent Evaluation Group. As is standard practice in machine learning, the data set on which a prompt is applied to get the desired response should first be divided into training, validation, testing, and 9Various steps before the application of LLMs are important and were applied in our experiments. For example, an efficient and accurate retrieval system before LLM application (see Lewis et al. 2020) and minimizing context length (see Liu et al. 2024), among others. See IEG, Forthcoming for more details on the methodology. 11 prediction sets, respectively.10 The training set consists of a few human-annotated examples that are included in the prompt for the model to learn from when analyzing each unlabeled observation. The validation set consists of several human-annotated examples on which the prompt is applied, and model evaluation metrics are established. If these metrics are found to be unsatisfactory for the context of the task, then the prompt is refined until the results for the validation set are deemed satisfactory. Then, the prompt that provided the best results on the validation set is applied on a testing set and metrics are computed once more. Further prompt refinement is not done at this stage. The values of the metrics from the testing set allow us to assess the prompt’s generalizability on observations that differ from those in the validation set and provide an unbiased picture of the accuracy we can expect on the unlabeled prediction set. If the values of the metrics from the testing set are found to be unsatisfactory, then the whole exercise should be restarted, and a different set of observations should be included in the new testing set to avoid data leakage (see Mucci 2024 for more information on data leakage). Finally, if or when the metrics for the test set are deemed satisfactory, the prompt is applied to the unlabeled prediction set. These results then need to be manually screened for relevance. Ideally, model evaluation metrics should also be computed for this set, at least for a sample of 30 randomly selected observations. This approach will give the truest assessment of the model’s performance and might provide lessons to improve accuracy in future work. Representative Sampling As mentioned in the previous section, it is advisable to split the data set into four distinct sets before developing even an initial prompt. Taking the following steps will ensure that the model evaluation metrics help improve generalizability of the prompts on the prediction set. First, understand the distribution of your input data. Understanding the basic nature of your input data (for example, the text of research papers returned by an initial search) can be helpful throughout the process, including for setting and managing expectations. 10The observations in the first three sets must be annotated by humans. While the first is used to provide examples to the model, the second and third sets serve as the “ground truth” against which model evaluation metrics will be calculated. This annotation requires the judgment of at least one subject matter expert. Once again, calculating measures of intercoder reliability for the human annotated dataset is important for putting LLM performance metrics into context. For example, if two human coders only agree on 80% of the labels, and an LLM achieves 75% accuracy on the labels from one annotator, we might want to accept it as good performance. 12 Simple characteristics such as the extent of homogeneity or heterogeneity of the input documents can be informative. For example, if there are multiple, relatively distinct, topics in the scope of your literature review, instead of trying to identify papers for all topics at once, work on one topic at a time. Or, if the topic of your review is very broad, you can use text clustering with document embeddings to identify topic clusters and split your scope into multiple topics, and work with those respectively.11 Second, identify and include representative observations. For example, you can identify approximately 55 of the most representative documents in your set of unlabeled documents by using purposeful sampling or document clustering.12 We used the latter technique in our application.13 Ask your subject matter expert to annotate these documents and include around 5 of the most representative documents as examples in the training set (that is, in the prompt), the next 20 most representative ones in the validation set, and around 30 of the next most representative ones in the testing set. The remaining unlabeled documents should be automatically assigned to the prediction set. (See Liu et al. 2018 for why this component can be helpful.) This sampling strategy has several advantages, as it allows us to conduct model performance assessments across meaningful categories of documents14. First, it ensures 11 For information about clustering, see OpenAI (2022). 12The number can be higher if your documents tend to be longer compared to the LLM’s context length, and vice-versa. 13 We first used Semantic Scholar’s bulk search API (Kinney et al. 2023) to identify a long list of potentially relevant papers for each topic of the SLR. The search was performed using a set of queries with Boolean logic, developed iteratively by the team, along with filters for document type and date range etc. The Semantic Scholar results contained hyperlinks to open access full paper PDF files where available. We then scraped these files from their respective hyperlinks. Then, we split the papers into smaller chunks under the token limit of OpenAI’s text-embedding- 3-large model (i.e., 8,091 tokens) and retrieved the 3072-dimensional embeddings for each chunk. Then, we took the mean of the embedding vectors across each paper’s chunks to arrive at document-level aggregate embeddings. We then used the scikit-learn implementation of the k- means clustering algorithm (MacQueen 1967) to cluster the documents in the 3072-dimensional embedding space. Then, we identified the documents closest to each cluster’s theoretical centroids as the cluster centroid proxies and included those in our various subsets. The work was conducted using the open-source Python programming language and various user-contributed libraries. Full details of the methodology will be shared in IEG, Forthcoming. 14That is, the document clusters tend to group together documents that have similar semantic properties. In other words, the meanings of the words in the documents within the same cluster are similar or related to the extent that they are close to each other in the embedding space. This happens due to the richness of semantic information captured by high quality, high-dimensional text embeddings. 13 semantic diversity of the samples. By sampling from multiple clusters in the high- dimensional text embedding space, we ensure our model evaluation and prompt refinement spans a range of semantic contexts rather than over-representing dominant classes as might happen with random sampling with skewed data, which can lead to biased values for model performance metrics. Second, it bolsters interpretability and supports prompt refinement. Evaluating model performance across clusters reveals strengths and weaknesses of the prompt in specific types of cases, which is especially helpful when relying on prompts for classification, as it allows us to address specific issues by adjusting the prompt format and or content. Furthermore, using prototypical examples from clusters for prompt refinement can increase its effectiveness for different types of observations. Lastly, this sampling strategy also helps to avoid sampling of near-duplicate or highly similar documents. Developing an Initial Prompt A good prompt for an instruction-tuned LLM (see Bergmann 2024b) typically includes the following components or sections: (i) persona to be adopted by the model (for example, evaluation analyst); (ii) detailed instructions for the task the model must undertake; (iii) the relevant text with the context in which the instructions should be carried out; and (iv) requirements such as the length and format of the response. There are various community-produced resources online for how to craft the best prompts (that is, prompt engineering; see Google 2025), and best practices change as new models emerge, so we do not include general prompting tips in this guidance note and instead focus on the following specific considerations that worked well in our experiments15. Check the model’s prompt template. Different models (and, at times, model versions) require slightly different templates for the prompts that they can understand, so make sure to check and adapt a prompt to the specific template once you have selected a model (see Amazon Web Services n.d.-a). Break down the task into specific steps. Be explicit about the steps the model needs to undertake to follow your instructions, a technique known as chain-of-thought prompting (see Gadesha et al. 2025). For example, if you provide the LLMs with the titles and abstracts of research papers to classify based on their relevance to your SLR topic, then it would be helpful to mention in the prompt that you will give the model the titles and abstracts of the research papers and that it should first read these text fields, then compare it with the classification criteria and instructions provided to it, then make its classification decision, and then respond in the requested format. 15 See DAIR.AI 2025 for a helpful prompt engineering guide. 14 Try different prompt formats. It can be worthwhile to experiment with different prompt formats before applying a prompt to the validation set for assessment and refinement. The format of a prompt refers to the types of information it includes and the order in which information is included. Both are crucial. For example, the format of our final text classification prompt for literature identification involved starting with defining the persona for the model to adopt, followed by a high-level overview of the task, then detailed instructions, then a few labeled examples, and finally the unlabeled example for the model to classify. Our template is shown in figure 3. Include a request for justification. Due to the opaque nature of LLMs’ inner workings, it is not possible for humans to interpret their “decision-making process”16 or to understand how exactly they arrived at their responses. This challenge can be mitigated to some extent by including instructions in the prompt for the model to justify its reasoning in its responses17. This technique is helpful in prompt refinement and the manual verification of model responses, though it also has some limitations (see Chen et al. 2025). Include representative examples across categories. Including a few highly representative examples in the prompt is critical for ensuring that the model generates relevant responses, a process known as in-context learning (see Zewe 2023) or multi-shot prompting (see Anthropic n.d.). Aim to use at least five examples, depending on the model’s context length (see Bergmann 2024a). For the literature identification task, we included approximately five representative, manually labeled papers in our prompts, with at least one relevant and one irrelevant example. Include a request for references. Asking the model to include references to the source document(s) in its response can help with prompt refinement. For instance, if you want the model to generate a synthesis of 20 summaries, be clear that it should cite the specific summaries in its response. The references can be in the form of summaries of the key points from the reference text. Provide “unknown” or “not applicable” as a category. A limitation of closed LLMs such as OpenAI’s GPT-4o is that they are configured to always generate some response, however unlikely it might be given the model’s training data. This implies that the model may generate speculative results when it encounters insufficient or low-quality instructions or input data. To mitigate this issue, provide an “unknown” or “not 16 Humans obviously understand broadly how LLMs work since we designed and developed them. 17 See World Bank 2024, app B. 15 applicable” option, which allows the model the option to acknowledge when it does not have sufficient information to carry out the given instructions. Include a desired response format. It is useful to indicate the desired format in which the LLM should structure its responses. For instance, for the information extraction task, we instructed the model to deliver responses strictly in a specific JSON format, which facilitated the transfer of the responses into a table. Check edge cases. Check the model’s responses on infrequent or highly ambiguous cases to understand the limits of a model’s performance in such contexts. Figure 3. Prompt Format for Literature Identification Source: Independent Evaluation Group. 16 Evaluating Model Performance As mentioned in previous sections, a manual review of model responses is necessary when using LLMs in real evaluation use cases. This section offers some points to keep in mind when developing a strategy for evaluating model performance for an experiment or application. Assess the faithfulness of responses. Regardless of the type of task for which an LLM is being used, the evaluator should review the faithfulness of the model’s responses. For example, in text classification, it is useful to assess the model’s justifications using this criterion. Lower-than-expected levels of faithfulness can indicate a flaw in the design of the task, such as very long contexts. Set context-specific thresholds for selected metrics. Set clear thresholds for the respective model evaluation metrics and ensure that all relevant stakeholders agree with these thresholds, as the threshold defines the level of LLM performance that would be considered satisfactory. Refine the prompt or other aspects of the design until such results are achieved. Use annotation and validation guidelines. To maintain consistency throughout the validation process, reviewers should use annotation guidelines in the form of a shared codebook (see for example Kallos 2023). The codebook should include the instructions that manual reviewers will need for tasks such as labeling observations for classification or assigning values to assessment metrics for summarization or synthesis. Check intercoder reliability. During the processes of human data annotation or model response validation, despite the use of a detailed codebook or instructions, disagreements can arise between two or more coders. Calculating an intercoder reliability score such as Cohen’s Kappa (Cohen, 1960) is one way to “demonstrate the rigor of coding procedures” (Cheung et al. 2021, 1155) and can help evaluators settle on realistic target values for model performance metrics. In our experiments, the subject matter expert to provide labeled data for the text classification task. During prompt validation, the team discussed the cases where the model’s labels differed from the expert’s to arrive at a common understanding regarding. In future experiments, we aim to capture this iterative process of arriving at “ground-truth” labels more systematically, for example by using and reporting metrics such as Cohen’s Kappa. Use a confusion matrix for text classification. A confusion matrix (see Murel 2024 for practical guidance) is helpful for summarizing the performance of a classification model because it displays key metrics of interest. This matrix can help an evaluator diagnose a model’s classification performance by displaying the number of results that are true positives, true negatives, false positives, and false negatives. Use this knowledge during 17 prompt refinement (see Refining Prompts) based on what matters most for a use case. For example, in the SLR use case, we wanted to ensure a low rate of false negatives and could accept a higher rate of false positives because we wanted to ensure that we did not miss any relevant papers to include in our review. Refining Prompts Use validation findings for prompt refinement. If the results on the validation set do not meet your expected or required threshold, analyze the cause of the inaccuracies and use your findings to refine the prompt. For example, you might notice that the model makes some incorrect assumptions, so you need to include instructions to avoid those. You can see what impact your changes to the prompt have on the confusion matrix for the validation set and adjust the prompt accordingly. For text classification, the confusion matrix serves as a critical tool to help the team understand the sources of errors (for example, false positives or false negatives). Avoid creating convoluted prompts. As experimentation progresses, it is tempting to continually add instructions to prompts to address edge cases and improve performance. However, over time, doing so can lead to overly complex prompts with a patchwork of fixes, making the prompt susceptible to overfitting (Google 2025) the validation set. Going Forward In the World Bank and International Fund for Agricultural Development independent evaluation departments, we have embarked on a journey of experimentation with the application of AI in our practice. This journey is primarily about thoughtful risk taking, continuous learning and adaptation, and dialogue between staff with different areas of expertise. Learning to use AI is not a one-time effort but rather a continuous process of questioning, testing, learning, and refining. In this guidance note, we focused on two fundamental aspects of this journey: (i) defining and adapting our typical evaluation workflows to include LLMs where they fit best, and (ii) building trust through thorough performance testing (that is, adapting typical criteria of rigor to the specificity of LLM usage). Further research, experimentation, and collaboration are needed to standardize and expand on frameworks for assessing the performance of LLMs in evaluation. Collaboration should include sharing experiences and findings from experiments and pilots across organizations and contexts. Much has already been written on the potential and perils of leveraging LLMs in research and analytical tasks, but it is in the concrete, practical, context-specific 18 experimentation that we can find out what works, what does not work, and under what circumstances something either works or does not. We are committed to keep exploring and sharing what we find as widely as possible. 19 Bibliography Alaofi, Marwah, Negar Arabzadeh, Charles L. A. Clarke, and Mark Sanderson. 2024. “Generative Information Retrieval Evaluation.” arXiv preprint, April 11. arXiv:2404.08137v3. https://arxiv.org/abs/2404.08137. Amazon Web Services. n.d.-a. “Prompt Templates and Examples for Amazon Bedrock Text Models.” Amazon Bedrock User Guide. Accessed May 5, 2025. https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-templates-and- examples.html. Amazon Web Services. n.d.-b. “What Is Compute?” Amazon Web Services. https://aws.amazon.com/what-is/compute/. Anthropic. n.d. “Use Examples (Multishot Prompting) to Guide Claude’s Behavior.” Prompt Engineering. Accessed May 5, 2025. https://docs.anthropic.com/en/docs/build-with-claude/prompt- engineering/multishot-prompting. Anuj, Harsh, Virginia Ziulu, Ariya Hagh, Estelle Raimondo, and Jos Vaessen. 2025. “World Bank IEG Evaluations and the Role of Data Science: Reflections from Recent Experiences.” In Artificial Intelligence and Big Data: Lessons from Evaluations of the Rule of Law and Development, edited by Frans L. Leeuw and Michael Bamberger, 231–251. Edward Elgar Publishing. https://www.elgaronline.com/edcollchap/book/9781803925677/chapter11.xml. Arize. 2025. “The Definitive Guide to LLM Evaluation: A Practical Guide to Building and Implementing Evaluation Strategies for AI Applications.” Arize AI. https://arize.com/llm-evaluation. Banerjee, Satanjeev, and Alon Lavie. 2005. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. Association for Computational Linguistics. https://aclanthology.org/W05-0909/. Bergman, Dave. 2024a. “What Is a Context Window?” IBM Think, November 7. https://www.ibm.com/think/topics/context-window. Bergmann, Dave. 2024b. “What Is Instruction Tuning?” IBM Think, April 5. https://www.ibm.com/think/topics/instruction-tuning. 20 Brown, Tom B., Benjamin Mann, Nick Ryder et al. 2020. “Language Models Are Few- Shot Learners.” arXiv preprint, May 28. arXiv:2005.14165v4. https://arxiv.org/abs/2005.14165. Chang, Yupeng, Xu Wang, Jindong Wang et al. 2024. “A Survey on Evaluation of Large Language Models.” ACM Transactions on Intelligent Systems and Technology 15 (3): Article 39, 1–45. https://doi.org/10.1145/3641289. Chen, Yanda, Joe Benton, Ansh Radhakrishnan et al. 2025. “Reasoning Models Don’t Always Say What They Think.” Anthropic. https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_pap er.pdf. Cheung, Diana. 2024. “An Introduction to LLM Evaluation: How to Measure the Quality of LLMs, Prompts, and Outputs.” Codesmith (blog), May 15. https://www.codesmith.io/blog/an-introduction-to-llm-evaluation-how-to- measure-the-quality-of-llms-prompts-and-outputs. Cheung, Kason Ka Ching, and Kevin W. H. Tai. 2021. “The Use of Intercoder Reliability in Qualitative Interview Data Analysis in Science Education.” Research in Science & Technological Education 41 (3): 1155–75. https://doi.org/10.1080/02635143.2021.1993179. Cohen, Jacob. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37- 46. https://doi.org/10.1177/001316446002000104. DAIR.AI. 2025. “Prompt Engineering Guide.” https://www.promptingguide.ai/. Durmus, Esin, He, and Mona Diab. 2020. “FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5055–70. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.454/. Fabbri, Alexander R., Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. “SummEval: Re-evaluating Summarization Evaluation.” Transactions of the Association for Computational Linguistics 9: 391–409. https://doi.org/10.1162/tacl_a_00373. Gadesha, Vrunda, Vanna Winland, and Eda Kavlakoglu. 2025. “What is chain of thought (CoT) prompting?”. IBM Blog, April 23. https://www.ibm.com/think/topics/chain- of-thoughts. 21 Gera, Ariel, Alon Halfon, Eyal Shnarch, Yotam Perlitz, Liat Ein-Dor, and Noam Slonim. 2022. “Zero-Shot Text Classification with Self-Training.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 1107–19. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp- main.73/. Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd ed. O’Reilly Media. https://www.oreilly.com/library/view/hands-on-machine- learning/9781492032632/. Glickman, Mark, and Yi Zhang. 2024. “AI and Generative AI for Research Discovery and Summarization.” arXiv preprint, January 8. arXiv:2401.06795v2. https://arxiv.org/abs/2401.06795. Goodwin, Michael. 2024. “What is an API (application programming interface)?” IBM blog, April 9. https://www.ibm.com/think/topics/api. Google. 2024. “Overfitting.” Machine Learning Concepts. https://developers.google.com/machine-learning/crash- course/overfitting/overfitting. Google. 2025. “Machine Learning Glossary.” Google for Developers. https://developers.google.com/machine-learning/glossary. Huyen, Chip. 2023. “Multimodality and Large Multimodal Models (LMMs)”. Blog, October 10. https://huyenchip.com/2023/10/10/multimodal.html. Kallos, Alecia. 2023. “Creating a Qualitative Codebook.” Eval Academy. https://www.evalacademy.com/articles/creating-a-qualitative-codebook. Kinney, Rodney, et al. 2023. “The Semantic Scholar Open Data Platform.” arXiv preprint, January 24. arXiv:2301.10140. https://doi.org/10.48550/arXiv.2301.10140 Lewis, Patrick, Ethan Perez, Aleksandra Piktus et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Advances in Neural Information Processing Systems 33, 9459–9474. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e 5-Paper.pdf Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of Summaries.” In Text Summarization Branches Out, Association for Computational Linguistics, July 25, Barcelona. https://aclanthology.org/W04-1013/. 22 Liu, Jun, Prem Timsina, and Omar El-Gayar. 2018. "A comparative analysis of semi- supervised learning: The case of article selection for medical systematic reviews. Inf Syst Front 20, 195–207. https://doi.org/10.1007/s10796-016-9724-0 Liu, Nelson F., Kevin Lin, John Hewitt et al. 2024. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics 12: 57–173. https://doi.org/10.1162/tacl_a_00638. Liu, Yang, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. “G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2511–22. Association for Computational Linguistics. https://aclanthology.org/2023.emnlp-main.153/. Martineau, Kim. 2023. “What Is AI Alignment?” IBM Research (blog), November 8. https://research.ibm.com/blog/what-is-alignment-ai. MacQueen, James. 1967. "Some methods for classification and analysis of multivariate observations." In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, vol. 5, pp. 281-298. University of California press. McHugh, Mary L. 2012. “Interrater Reliability: The Kappa Statistic.” Biochemia Medica 22 (3): 276–82. https://pmc.ncbi.nlm.nih.gov/articles/PMC3900052/. Mucci, Tim. 2024. “What is data leakage in machine learning?”. IBM Think, September 30. https://www.ibm.com/think/topics/data-leakage-machine-learning. Murel, Jacob. 2024. “Create a Confusion Matrix with Python.” IBM Developer, March 7. https://developer.ibm.com/tutorials/awb-confusion-matrix-python/ OpenAI. n.d. “Best Practices for Prompt Engineering with the OpenAI API.” OpenAI Help Center. Accessed May 4, 2025. https://help.openai.com/en/articles/6654000- best-practices-for-prompt-engineering-with-the-openai-api. OpenAI. 2022. “Clustering.” OpenAI Cookbook, March 10. https://cookbook.openai.com/examples/clustering. OpenAI. 2024. GPT-4o System Card. OpenAI. https://cdn.openai.com/gpt-4o-system- card.pdf. 23 Ouyang, Long, Jeff Wu, Xu Jiang et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f 58805a001731-Paper-Conference.pdf. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research 12 (2011): 2825–2830. https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html. Puri, Raul, and Bryan Catanzaro. 2019. “Zero-Shot Text Classification with Generative Language Models.” arXiv preprint, December 10. arXiv:1912.10165v1. https://arxiv.org/abs/1912.10165. Raimondo, Estelle, Harsh Anuj, and Virginia Ziulu. 2023a. “Setting up Experiments to Test GPT for Evaluation.” IEG Blog (blog), August 16. https://ieg.worldbankgroup.org/blog/setting-experiments-test-gpt-evaluation. Raimondo, Estelle, Virginia Ziulu, and Harsh Anuj. 2023b. “Fulfilled Promises: Using GPT for Analytical Tasks.” IEG Blog (blog), August 23. https://ieg.worldbankgroup.org/blog/fulfilled-promises-using-gpt-analytical- tasks. Raimondo, Estelle, Harsh Anuj, and Virginia Ziulu. 2023c. “Unfulfilled Promises: Using GPT for Synthetic Tasks.” IEG Blog (blog), August 30. https://ieg.worldbankgroup.org/blog/unfulfilled-promises-using-gpt-synthetic- tasks. 24 Shin, Taylor, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, 4222–35. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.346. van der Lee, Chris, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. “Best Practices for the Human Evaluation of Automatically Generated Text.” In Proceedings of the 12th International Conference on Natural Language Generation, 355–68. Association for Computational Linguistics. https://aclanthology.org/W19-8643/. Wang, Zhiqiang, Yiran Pang, and Yanbin Lin. 2023. “Large Language Models Are Zero- Shot Text Classifiers.” arXiv preprint, December 2. arXiv:2312.01044v1. https://arxiv.org/abs/2312.01044. World Bank. 2017. Conducting a Structured Literature Review in the Framework of IEG (Major) Evaluations. IEG Methods Literature. Independent Evaluation Group. World Bank. World Bank. 2024. Biodiversity for a Livable Planet: An Evaluation of World Bank Group Support for Biodiversity (FY15–24). Approach Paper. Independent Evaluation Group. World Bank. https://ieg.worldbankgroup.org/sites/default/files/Data/reports/ap_biodiversity.p df. World Bank. Forthcoming. Epidemic Preparedness. Approach Paper. Independent Evaluation Group. World Bank. Yan, Ziyou. 2024. “Task-Specific LLM Evals That Do and Don’t Work.” eugeneyan.com. https://eugeneyan.com/writing/evals/. Zewe, Adam. 2023. “Solving a Machine-Learning Mystery. MIT News, February 7. https://news.mit.edu/2023/large-language-models-in-context-learning-0207. Zhang, Tianyi, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2024. “Benchmarking Large Language Models for News Summarization.” Transactions of the Association for Computational Linguistics 12: 39–57. https://doi.org/10.1162/tacl_a_00632. 25 Ziulu, Virginia, Harsh Anuj, Ariya Hagh, Estelle Raimondo, and Jos Vaessen. 2024. “Extracting Meaning from Textual Data for Evaluation: Lessons from Recent Practice at the Independent Evaluation Group of the World Bank.” In Artificial Intelligence and Evaluation: Emerging Technologies and Their Implications for Evaluation, edited by Steffen Bohni Nielsen, Francesco Mazzeo Rinaldi, and Gustav Jakob Petersson, 57–73. Routledge. https://www.taylorfrancis.com/chapters/oa-edit/10.4324/9781003512493- 5/extracting-meaning-textual-data-evaluation-virginia-ziulu-harsh-anuj-ariya- hagh-estelle-raimondo-jos-vaessen. 26 Appendix A. Additional Workflows Figure A.1 provides an alternative workflow for structured literature reviews (SLRs) in the framework of Independent Evaluation Group major evaluations. This workflow is based on a checklist for conducting such reviews provided as internal methodological guidance and is closer to the “traditional” approach. We provide this as a comparison to the workflow presented in figure 1 to demonstrate the slightly different framing when viewing the same use case from the perspective of different specializations or domains. We hope that such a comparison will help evaluators think through how they can translate their workflows from ones similar to figure A.1 to ones more like figure 1 to enable the application of large language models. Figure A.2 provides our current proposed workflow for evaluation synthesis. Indeed the same workflow can be replicated across use cases, including portfolio review and analysis and interview transcripts analysis. IEG is currently piloting the former as a set of AI-powered web- based applications developed in-house jointly with the WB Information Ttechnology Solutions department (ITS). 27 Figure A.1. Alternate Workflow for Structured Literature Reviews Source: World Bank 2017. 28 Figure A.2. Evaluation Synthesis Workflow Source: Independent Evaluation Group. Notes: LLM = Large Language Model. 29 Reference World Bank. 2017. Conducting a Structured Literature Review in the Framework of IEG (Major) Evaluations. IEG Methods Literature. Independent Evaluation Group. World Bank. 30