Psychometric considerations for implementing phone-based learning assessments © 2022 International Bank for Reconstruction and Development / The World Bank 1818 H Street NW, Washington, DC 20433 Telephone: 202-473-1000; Internet: www.worldbank.org Some rights reserved. This work is a product of the staff of The World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of The World Bank, its Board of Executive Directors, or the governments they represent. The World Bank does not guarantee the accuracy of the information included in this work. Nothing herein shall constitute or be considered to be a limitation upon or waiver of the privileges and immunities of The World Bank, all of which are specifically reserved. Rights and Permissions This work is available under the Creative Commons Attribution 4.0 International license (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/, with the following mandatory and binding addition: Any and all disputes arising under this License that cannot be settled amicably shall be submitted to mediation in accordance with the WIPO Mediation Rules in effect at the time the work was published. If the request for mediation is not resolved within forty-five (45) days of the request, either You or the Licensor may, pursuant to a notice of arbitration communicated by reasonable means to the other party refer the dispute to final and binding arbitration to be conducted in accordance with UNCITRAL Arbitration Rules as then in force. The arbitral tribunal shall consist of a sole arbitrator and the language of the proceedings shall be English unless otherwise agreed. The place of arbitration shall be where the Licensor has its headquarters. The arbitral proceedings shall be conducted remotely (e.g., via telephone conference or written submissions) whenever practicable, or held at the World Bank headquarters in Washington, DC. Attribution – Please cite the work as follows: Luna-Bazaldua, Diego; Julia Liberman; Victoria Levin; and Aishwarya Khurana. 2022. Guidance note on psychometric considerations for implementing phone-based learning assessments. Washington, DC: The World Bank. License: Creative Commons Attribution CC BY 4.0 IGO. Translations – If you create a translation of this work, please add the following disclaimer along with the attribution: This translation was not created by The World Bank and should not be considered an official World Bank translation. The World Bank shall not be liable for any content or error in this translation. Adaptations – If you create an adaptation of this work, please add the following disclaimer along with the attribution: This is an adaptation of an original work by The World Bank. Views and opinions expressed in the adaptation are the sole responsibility of the author or authors of the adaptation and are not endorsed by The World Bank. Third-party content: The World Bank does not necessarily own each component of the content contained within the work. The World Bank therefore does not warrant that the use of any third party-owned individual component or part contained in the work will not infringe on the rights of those third parties. The risk of claims resulting from such infringement rests solely with you. If you wish to reuse a component of the work, it is your responsibility to determine whether permission is needed for that reuse and to obtain permission from the copyright owner. Examples of components can include, but are not limited to, tables, figures, or images. Any queries on rights and licenses, including subsidiary rights, should be addressed to World Bank Publications, The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA; fax: 202-522-2625; e-mail: pubrights@worldbank.org. PHOTO BY: © 2021 FARID TAJUDDIN/SHUTTERSTOCK. PSYCHOMETRIC CONSIDERATIONS FOR IMPLEMENTING PHONE-BASED LEARNING ASSESSMENTS DIEGO LUNA BAZALDUA JULIA LIBERMAN VICTORIA LEVIN AISHWARYA KHURANA SUPPORTED WITH FUNDING FROM THE GLOBAL PARTNERSHIP FOR EDUCATION PHONE-BASED FORMATIVE ASSESSMENT Contents Contents Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   6 Validity in interpreting phone-based assessment scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   8 Reliability of phone-based assessments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   10 Country Case: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   11 Phone-based assessment in Botswana and lessons learned concerning validity and reliability.   11 4 PHONE-BASED FORMATIVE ASSESSMENT Acknowledgments Acknowledgments This guidance note was prepared by Diego Luna Bazaldua, Julia Liberman, Victoria Levin, and Aishwarya Khurana of the World Bank’s Learning Assessment Platform (LeAP) team. This work was sponsored by the Global Partnership for Education’s (GPE’s) grant to support continuity of learning during the pandemic. The team worked under the overall guidance of Omar Arias (Practice Manager, Global Engagement and Knowledge Unit, Education Global Practice). Valuable feedback was obtained from Noam Angrist during the preparation of this note. Colleagues who contributed to this document with reviews and feedback include Rabia Ali, Marguerite Clarke, Martin Elias de Simone, Koen Geven, James Gresham, and Mari Shojo. The team was supported by the Russia Education Aid for Development (READ) Trust Fund program. This note is part of a broader set of global knowledge products on phone-based learning assessments, including a guidance note on considerations for the implementation of phone-based formative assessments, a landscape review of existing phone-based assessment interventions and their key features, and a checklist template to assess implementation prerequisites and enabling conditions for phone-based formative assessment solutions. 5 PHONE-BASED FORMATIVE ASSESSMENT Introduction The COVID-19 pandemic has had a significant impact on education worldwide, affecting whether and how students learn. World Bank estimates and initial analysis at the country level indicate that school closures and interruptions to in-person education are likely to negatively impact student learning.1 Moreover, these es- timates predict that students in low- and middle-income countries will suffer from more significant learning losses than students in high-income countries, worsening the existing global learning crisis. At the same time, the pandemic also represents an opportunity to think outside the “classroom box” and to ex- plore and introduce innovative approaches to support learning remotely. As part of their education response to the pandemic, many countries have provided learning content to students at home through various means, including through the internet via devices such as computers, tablets and smartphones, as well as through television, radio, printed materials, and feature (basic) phones. With schools closed, teachers have also had to adapt to using these different media to continue their teaching. While providing learning content to students through remote means is necessary to support learning, it is critical to know whether students are actually using the learning resources that are provided to them and whether they are actually learning while schools are closed. Assessing students to determine what they know, understand, and can do is the only way for stakeholders to identify areas where students need support and to close any learning gaps. Learning assessments have typically been implemented in person, such as in a school or at a testing center. However, in the context of school closures and limitations to in-person gatherings, one of the biggest challenges in implementing learning assessments is the difficulty of administering them in-person safely. Where network and internet connectivity and access to internet-connected devices (e.g., computers, tablets, or smartphones) are widespread, learning management systems and online assessment platforms can enable one to assess students outside the classroom. In low-resource contexts, however, where network and internet connectivity are weak or access to internet-connected devices is limited, it may not be feasible to use online assessment applications. In those contexts, organizations and policymakers have been exploring the use of phone-based assessments using basic (feature) phones through direct phone calls, short message services (SMS), and/or interactive voice response (IVR) technologies. Such phone-based assessment solutions can be used for at least two purposes. The first purpose is to conduct formative assessment, for example to gauge how well students have absorbed the learning content, identi- fy any misconceptions in understanding, provide constructive feedback to students or caregivers, and offer additional learning resources and activities to support learning. In this first use, formative assessments are a pedagogical tool that can help teachers and other stakeholders promote learning by identifying students’ learning status and planning next steps in their learning process. For readers interested in the formative use 1 See Maldonado & De Witte 2020; Betkowski 2020; Masters 2021. 6 PHONE-BASED FORMATIVE ASSESSMENT Introduction of phone-based assessments, the World Bank has also developed an additional guidance note outlining the critical elements necessary to make learning assessments feasible and sustainable over the phone through the use of phone calls, SMS, and IVR solutions. The second purpose is to conduct impact evaluations, for example analyzing the effect on student learning outcomes of interventions introduced in response to the pandemic (see “Country Case” box at the end of this document). In this second use, learning assessments are a source of information on the overall effectiveness of a particular program. For the purposes of making high-stakes decisions or monitoring learning at the system level, phone-based assessments are generally not suitable due to several constraints. These include limitations to student sampling, standardization of the assessment administration, test security issues, and potential malpractice (e.g., item leaking, test-taker identity corroboration, or cheating). As policymakers and other stakeholders design and implement phone-based assessments, they should con- sider educational measurement principles and standards related to the technical properties of any learning assessment to help prevent potential biases in the assessment results and their interpretation. These prin- ciples and standards are related to assessment concepts of validity and reliability. Ensuring that learning assessments possess these technical qualities is crucial for impact evaluations, particularly those that will use learning outcomes as evidence of an intervention’s effectiveness. In the case of formative assessments, compliance with these technical standards may promote the use of these phone-based activities to support remote learning, but it is not entirely relevant for implementation. Critical validity and reliability considerations are introduced in the next section of this note. Organizations implementing phone-based learning assessments must work directly with specialists in psychometrics and assessment to address these considerations and ensure that the assessment process adheres to technical principles and standards. These technical experts can provide insights from the planning stage of any phone- based assessment initiative to the moment results become public public (if that is an intended dissemination strategy) and are disseminated to different stake- holder groups. 7 PHONE-BASED FORMATIVE ASSESSMENT Validity in interpreting phone-based assessment scores Validity concerns the extent to which interpretations and claims based on test scores can be supported with evidence. A validity framework requires the continuous accumulation of evidence to support interpretations and uses of assessment results. Because some assessments traditionally administered in the classroom are now being adapted for administration over the phone, it is necessary to consider the following five key ques- tions related to validity. First: What are the assessment’s objectives and intended uses? This is the first and overarching question that should guide all subsequent decisions in the assessment plan- ning and design process. A clear statement identifying the purpose of the assessment can help determine what further validity evidence is needed to ensure that the phone-based assessment can produce valid results. For instance, a phone-based assessment could be used with the overarching diagnostic purpose of evaluating changes in students’ numeracy skills since the start of the pandemic and, in turn, making it possible to pro- vide them with learning materials aligned to their numeracy ability level. Likewise, an assessment based on television or radio learning contents could be administered over the phone with the objective of determining whether children are absorbing these remote learning contents and whether learning is occurring. Second: Is the assessment content relevant and representative of the knowledge domain (e.g., content delivered through television or radio), subject (e.g., science or mathematics), or skill (e.g., reading) that is to be measured? Is there any learning content that will be omitted from the assessment if it cannot be assessed by phone? These questions imply the existence of a comprehensive definition of the knowledge domain, subject, or skill that will be measured. Such a definition is the starting point to formulate items and tasks that can help stake- holders determine whether students have acquired the learning. If key aspects of the knowledge domain cannot be assessed by phone, this omission must be clearly documented to facilitate the interpretation of the assessment results. For instance, when a paper-based assessment includes plots, graphs, figures or long reading passages that cannot be included in a phone-based assessment, these omitted assessment elements may alter the interpretation of assessment findings in terms of what test-takers know and can do. 8 PHONE-BASED FORMATIVE ASSESSMENT Validity in interpreting phone-based assessment scores Third: Compared to paper-based assessments, is the use of phones making it harder for students to understand and answer the items? Are adaptations of paper-based assessments necessary to reduce the complexity of their administration over the phone? In addition: What cognitive processes are students using to answer the questions over the phone? Are these cognitive processes similar to those required to answer paper-based assessments? These questions are linked to cognition, another key aspect of validity evidence. Because of the relatively small and recent use of phones to deliver remote learning assessments as compared to paper-based assess- ments or computer-based assessments, it is critical to ensure that this alternative assessment modality does not add more cognitive complexity for students to demonstrate what they know and can do accurately. To answer these questions, assessment developers may need to conduct and document a small pilot study in which they ask students to explain in detail how they solve each question posed to them through the phone. For instance, Australia conducted pilot studies as it prepared to transition from paper-based to online com- puter-based national large-scale assessments. These pilot studies focused on determining how the assessment delivery using computers affected student achievement, including the impact of using a keyboard to complete the writing section of the test. In addition, students were interviewed to determine their level of engagement with the computer-based assessment. Results provided supportive evidence to the Australian government’s claim that the change in delivery mode would not substantially change the learning assessment properties. Fourth: Are the item scores showing internal coherence with each other? And are the assessment scores correlated with any other external variables? The remaining validity-related questions require quantitative evidence to be produced after assessment data has been gathered. Many of these statistical analyses are only relevant for standardized learning assessments that involve many items. For example, if all assessment items are measuring the same knowledge domain or subject, then item scores tend to show positive correlations among them. Advanced statistical techniques (e.g., factor analysis models) can be used to confirm if these correlations reflect the measurement of the intended knowledge domain. If the statistical results indicate that any item is not correlated with the rest of the items as expected, item content revisions made by subject matter experts can help to decide if the item should be retained or excluded from the total score. Fifth: Are the assessment scores correlated with any other external variables? This question is direct- ly related with what previously was known as predictive validity. Examples of these external variables may include independent measures of student learning (including scores on large-scale assessments) or their school grades. This question may be harder to answer if enumerators are able to collect only limited information from students over the phone; however, in some instances stakeholders may be able to determine if test scores predict external outcomes (see example in the Country Case box, below). 9 PHONE-BASED FORMATIVE ASSESSMENT Reliability of phone-based assessments In addition to validity, test scores must characterize student achievement reliably. When it comes to reliability, the main question to answer is, Are phone-based assessment scores accurate? Because different data collec- tion designs and statistics can inform one about an assessment’s reliability properties, the answer to this main question depends on the type of reliability that needs to be analyzed. Most of the time, the interest is in analyzing whether all item scores are consistent with each other (i.e., students who answer item 1 correctly also tend to answer other assessment items correctly). The reason this matters is to ensure that the assess- ment total score is an accurate representation of students’ knowledge and skills. Thus, another key question is, Are item scores consistent among themselves when administered over the phone? Once assessment data is gathered, this type of reliability is captured by the degree to which items co-vary (most commonly estimated by Cronbach’s Alpha coefficient). If the same assessment is administered twice to the same students, then the most important question be- comes, Are phone-based assessment scores accurate over time? Correlations between the two total scores collected at different time points provide evidence of this type of reliability and help to determine if scores are stable over time. Finally, with respect to the accuracy of the enumerator administering the assessment, Are assessment scores produced by different enumerators administering the assessment over the phone consistent with each other? For this reliability evidence, it is necessary for the precision and consistency of assessment scoring to be documented among enumerators during their assessment administration training. For instance, enumerators can practice by scoring phone-based assessments in simulated situations, and scoring responses can be com- pared against those made by expert enumerators. Likewise, inter-rater consistency coefficients (e.g., Cohen’s kappa coefficient) can be calculated by analyzing the extent to which responses between trained and expert enumerators coincide with each other. Depending on the uses and administration mode of an assessment (that is, phone calls, SMS, or IVR), one or more sources of reliability evidence should be documented in any assessment results report or assessment technical report. Teams developing and administering phone-based assessments can increase the reliability of results by developing standardized protocols and training materials for enumerators to maximize equiva- lence in the results. As the COVID-19 pandemic has prompted education leaders to explore and introduce innovations to support learning continuity, it is critical that these innovations, especially those that support the learning process outside school walls, are in line with the standards and best practices in educational assessment. In many places, student learning can only be assessed remotely at this time. Basing assessment findings on valid and reliable evidence—whether the assessments are administered in-person, are computerized, or are phone- based—makes them more valuable to support students in their learning. 10 PHONE-BASED FORMATIVE ASSESSMENT Country Case: Phone-based assessment in Botswana and lessons learned concerning validity and reliability Validity and reliability can be maintained even in the context of learning assessment administered over the phone. For example, the organization Young 1ove in Botswana recently presented results of a phone- based assessment that followed students during the COVID-19 pandemic lockdown. This assessment was conducted as part of a randomized-control trial focused on analyzing the impact on learning outcomes of a targeted instruction component within a remote learning intervention. Learning outcomes of students randomly assigned to one of several groups were compared to determine the impact of different remote learning interventions. The assessment tool used by Young 1ove was an adaptation of the phone administration of the Annual Status of Education Report (ASER) mathematics paper-based assessment, which measures foundational numeracy skills. The phone adaptation covered a distribution of numeracy skills included in the original assessment tool: number recognition, two-digit subtractions, and three-digit divisions. Like the paper-based version of the as- sessment, the assessment administered over the phone delivered instructions orally to students. Paper-based assessment items had to be simplified for phone administration. For instance, Young1ove implementors had to simplify instructions and include practice items to familiarize students with the remote assessment exercise. Following are answers to the key questions on validity and reliability raised in this note, as applied to this Botswana assessment. 11 PHONE-BASED FORMATIVE ASSESSMENT Country Case: 1. What are the assessment’s objectives and intended uses? ■ The phone-based assessment was implemented to evaluate the targeted instruction component of a remote learning intervention during the time schools were closed because of the pandemic. ■ As part of the learning assessment objectives, learning outcomes of students randomly assigned to differ- ent arms were compared to determine the impact of different remote learning interventions. 2. Is the assessment content relevant and representative of the knowledge domain (e.g., content delivered through television or radio), subject (e.g., science or mathematics), or skill (e.g., reading) that is to be measured? Is there any learning content that will be omitted from the assessment if it cannot be assessed by phone? ■ The assessment tool was adapted for phone administration from the ASER mathematics paper-based assessment, which measures foundational numeracy skills. ■ The phone adaptation covered a distribution of numeracy skills included in the original assessment tool: number recognition, two-digit subtractions, and three-digit divisions. ■ Like the paper-based version of the assessment, the phone-based adaptation delivered instructions oral- ly to students. ■ Items with minimal visual stimuli were found to be more easily adapted to phone-based assessments. ■ Assessment content coverage is linked to time spent on the phone. ■ Lessons learned: Items that are rich in visual stimuli may require the delivery of information through SMS message to students before the assessment takes place through phone calls. It is also recommended that phone calls be kept brief, and that students be assessed at a higher frequency to maintain the phone-based assessment administration informative and easy to implement. 3. Compared to paper-based assessments, is the use of phones making it harder for students to understand and answer the items? Are adaptations of paper-based assessments necessary to reduce the complexity of their administration over the phone? What cognitive or thought pro- cesses are students using to answer the questions? Are these cognitive processes similar to those in paper-based assessments? ■ Students were asked to explain their response process as part of the quality assurance and validity evidence checks for the phone-based assessment. ■ Young 1ove implementors had to simplify instructions and include practice items to familiarize students with the remote assessment exercise. ■ Lack of visual information made phone-based assessments rely more heavily on oral and written lan- guage. Therefore, language adjustments were done to promote simplification of the assessment process over the phone. ■ Pilot studies allowed enumerators to identify problematic questions not amenable to phone-based as- sessment administration. ■ Lessons learned: Authors of the research findings report suggest comparing phone-based assessments against face-to-face assessments, if possible. The resulting information will permit assessment develop- ers and users to determine what changes have to be made to facilitate the phone-based adaptation. 4. Are the questions or items showing internal coherence with each other? Are the assessment scores correlated with any other external variables? ■ Because pre-pandemic assessment data for students was available, it was possible to correlate the perfor- mance of tasks delivered over the phone with student learning outcomes before schools closed. Results 12 PHONE-BASED FORMATIVE ASSESSMENT Country Case: from a correlation analysis showed a high, positive correlation between students’ answers to the phone- based assessment and a mathematics assessment administered in school in the months before school closures. ■ No information regarding internal structure results has been documented yet. 5. Are phone-based assessment scores accurate? ■ The authors of the research findings article highlighted the benefits of training enumerators to standard- ize and introduce quality assurance in the phone-based assessment process. Enumerators were trained over the phone (with phone calls and messages) and received scripts to administer the assessment. The quality assurance protocols introduced to standardize the phone-based assessment process included limiting response times for each item to two minutes and requests to students to explain their response process. ■ In addition, pilots were conducted for two weeks to ensure the feasibility of the phone-based assessment administration and acceptance of this assessment mode by the students and their caregivers. ■ Authors of the research findings report acknowledge the relevance of psychometric reliability, but no empirical reliability coefficients have been documented yet. 13 PHONE-BASED FORMATIVE ASSESSMENT References References & Bibliography American Educational Research Association, American Psychological Association & National Council on Measure- ment in Education (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA. Angrist, N., Bergman, P., Evans, D.K., Hares, S., Jukes, M., & Letsomo, T. (2020). Practical lessons for phone-based assessments of learning. BMJ Global Health 5. doi:10.1136/ bmjgh-2020-003030 Angrist, N., Bergman, P., & Matsheng, M. (2020). School’s Out: Experimental Evidence on Limiting Learning Loss Using “Low-Tech” in a Pandemic. Paper no. w28205. Cambridge, MA: National Bureau of Economic Research. Australian Curriculum, Assessment and Reporting Authority (ACARA). (2014). Tailored Test Design Study 2013: Sum- mary Research Report. Sydney, Australia: ACARA. Azevedo, J. P. (2020, December). How could COVID-19 hinder progress with Learning Poverty? Some initial simula- tions. World Bank Blogs. https://blogs.worldbank.org/education/how-could-covid-19-hinder-progress-learning- poverty-some-initial-simulations?cid=SHR_BlogSiteEmail_EN_EXT Betkowski, B. (2020, November). Online learners falling behind in their reading skills. Troy Media. https://troymedia. com/education/online-learners-falling-behind-in-their-reading-skills/#.YBYY4uhKiUl Brown, T. A. (2015). Confirmatory Factor Analysis for Applied Research. New York, NY: Guilford publications. Clarke, M. (2012). What Matters Most for Student Assessment Systems: A Framework Paper. Washington, DC: The World Bank Group. Educational Testing Service. (2014). 2014 ETS Standards for Quality and Fairness. Princeton, NJ: Educational Testing Service. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement 50(1), 1-73. Kline, P. (2014). An Easy Guide to Factor Analysis. New York, NY: Routledge. Maldonado, J., & De Witte, K. (2020). The Effect of School Closures on Standardised Student Tests. Leuven, Belgium: KU Leuven Department of Economics. https://lirias.kuleuven.be/retrieve/589583 Masters, K. (2021, January). Early data shows extent of learning loss among Virginia students. Virginia Mercury. https://www.virginiamercury.com/2021/01/29/early-data-shows-extent-of-learning-loss-among-virginia-stu- dents/ Sireci, S. G. (2009). Packing and unpacking sources of validity evidence: History repeats itself again. In R. W. Lissitz (Ed.), The Concept of Validity: Revisions, New Directions, and Applications (pp. 19–37). Charlotte, NC: Information Age. For more information regarding the implementation of phone-based learning assessments, please consult the other knowledge products developed under the Continuous and Accelerated Learning in response to COVID-19 ASA project. 14 SUPPORTED WITH FUNDING FROM THE GLOBAL PARTNERSHIP FOR EDUCATION