WPS6526 Policy Research Working Paper 6526 It’s Only Words Validating the CPIA Governance Assessments Stephen Knack The World Bank Development Research Group Human Development and Public Services Team June 2013 Policy Research Working Paper 6526 Abstract This study analyzes the validity of the World Bank’s has exploded over time. Although higher-quality write- Country Policy and Institutional Assessments governance ups are also longer on average, there is wide dispersion ratings, an important factor in allocating the Bank’s in the word count at any given quality level, and some concessionary International Development Association long write-ups provide little relevant information. Higher funds. It tests for certain biases in the ratings, and quality write-ups are associated with a lower likelihood examines the quality of the written justifications that that central unit reviewers will either disagree with accompany the ratings. The study finds no evidence of proposed ratings, or request additional information to bias in favor of International Development Association- assess the proposed rating. Controlling for quality, longer eligible countries, despite a potential moral hazard write-ups are associated with a greater probability that problem inherent in the ratings process. However, there central reviewers will disagree with a proposed rating. is some evidence of an upward bias in ratings for one Although checks and balances built into the process region, relative to the other five regions. The study finds appear to work reasonably well, the author concludes significant regional differences in the quality of the that a more proactive role for central unit reviewers and written justifications accompanying the six World Bank regional chief economists’ offices could further enhance regions’ proposed ratings. The length of these write-ups the quality of write-ups and reduce regional bias. This paper is a product of the Human Development and Public Services Team, Development Research Group. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://econ.worldbank.org. The author may be contacted at sknack@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team It’s Only Words: Validating the CPIA Governance Assessments Stephen Knack 1 JEL: O10, O17, O19 Keywords: Aid effectiveness, aid allocation, governance, corruption, rule of law 1 Lead Economist, DECHD & PRMPS, World Bank (sknack@worldbank.org ) The conclusions of this paper are not intended to represent the views of the World Bank, its Executive Directors, the countries they represent, or the members of the Public Sector Governance Board. Claudia Berg provided able research assistance. The paper benefited from valuable comments by Rui Coutinho, David Gould, Roumeen Islam, Kimberly Johns, Markus Kitzmuller, Jana Kunicova, Nick Manning, Shilpa Pradhan, Nadeem Rizwan, Halsey Rogers, Smriti Seth, and Hernan Winkler. However, the author assumes full responsibility for the analysis and interpretation, including any remaining errors. 1. Introduction The World Bank’s Country Policy and Institutional Assessment (CPIA) ratings are the primary factor in determining allocations of its concessionary IDA (International Development Association) funds across recipient countries. The premise is that the development effectiveness of aid and other resources is conditional on the quality of macroeconomic and other policies, and on the quality of public sector management (including budgetary and legal systems). Under IDA’s “performance-based allocation� (PBA) formula, countries with lower per capita incomes receive higher allocations per capita, but CPIA ratings are weighted more heavily than income. The African and Asian Development Banks have their own CPIA ratings, used in their own PBA systems. Collectively, these two regional development banks and IDA annually account for roughly $17 billion in gross ODA disbursements (or about $11 billion net) in recent years, with the majority allocated partly on the basis of their respective CPIA ratings. Over two-thirds of these funds are from IDA. Each of these three development banks produces its own CPIA ratings, based on assessments of their own staff and implementing a “questionnaire� with detailed criteria on 16 policy areas, grouped into 4 “clusters� (World Bank, 2010). The purpose of this study is to contribute to our understanding of the CPIA process in the World Bank, and thereby of the likely accuracy of the ratings. It analyzes the length and quality of the written justifications that accompany the regions’ proposed ratings, as well as the responses of central unit reviewers to these write-ups and ratings. We also test for certain biases in the ratings, and find some evidence of bias in favor of one region, but no evidence of any bias toward IDA countries. Finally, we recommend reforms to strengthen the process where apparent weaknesses are identified. This study is not designed to be a comprehensive treatment of the CPIA as an appropriate mechanism for allocating aid. In particular, it does not address issues regarding the content of the CPIA questions. Nor does it address the weights that the IDA donors assign to the different CPIA questions and other variables in the IDA allocation formula. Moreover, it focuses exclusively on the questions in one of the four clusters. The choice of cluster D, on quality of public sector management and institutions, is motivated by the fact it has by far the largest weight in determining IDA allocations (World Bank, 2010). The validity of CPIA governance ratings matters for other purposes, in addition to allocating IDA funds to countries where donors believe they will be used most effectively. The CPIA ratings are often used in research studies (e.g. Collier and Dollar, 2002; Dollar and Levin, 2006; Knack, 2009) on the largely untested assumption that the ratings are valid and reliable. They are sometimes used to monitor performance of individual countries or groups of countries. For example, question 13 on quality of budgetary and financial management was used as one of the Paris Declaration monitoring indicators (OECD, 2011; Knack 2013). The remainder of the study is organized as follows. Section 2 briefly reviews the (sparse) related literature, and places this contribution in the context of that literature. Section 3 describes the CPIA ratings process, and section 4 describes the data and hypotheses to be tested. Results are 2 presented in sections 5-8, and section 9 summarizes and concludes with several recommendations for strengthening the process, as well as suggestions for additional research. 2. Related Literature Only a few studies have examined the validity of the CPIA ratings, despite the CPIA’s importance in determining aid allocations. They have mostly focused on the content of the CPIA questions, and on how the IDA donors weight the different questions in allocating IDA funds. 2 The scope of this study is limited to analyzing the validity of the ratings themselves, and not how the IDA donors choose to use them in allocating aid among recipient countries. Several observers have criticized the CPIA content for reflecting a neoliberal or “Washington Consensus� view of what policies matter for development (e.g. Cage 2009). Kanbur (2005) proposes that the CPIA focus more on measuring outcomes. He argues that if countries are generating favorable development outcomes there is good reason to believe they will use aid funds effectively, even if their policies do not conform closely to the CPIA’s prescriptions. Steets (2008) recommends adding more content on infrastructure, “opportunities for participation� and “empowerment of communities.� Steets (2008) and IEG (2010) advocate an increased emphasis on protection of human rights. The validity of the CPIA ratings has rarely been explicitly addressed in empirical studies. The IEG (2010) evaluation of the CPIA finds moderate to high cross-country correlations between ratings on various questions and other conceptually related indicators. For example, it reports a .86 correlation between CPIA question 16 (on accountability, transparency and corruption in the public sector) and Transparency International’s Corruption Perceptions Index. The IEG report does not test for possible biases, however, despite noting that the use of CPIA scores for IDA allocations creates incentives for country teams to inflate ratings. Gelb, Ngo and Ye (2004) provide evidence of the CPIA’s external validity, showing it predicts income growth in the medium term. They explicitly test for a pro- or anti-Africa bias in the ratings, and find that ratings for countries in the region are neither significantly higher nor lower than predicted by their scores on related indicators. Other empirical analyses making use of the CPIA implicitly support the external validity of the ratings. At least two World Bank (2011a, 2007) reports find that high or increasing CPIA ratings are associated with a stronger likelihood of achieving several of the Millennium Development Goals. Performance of World Bank projects is also stronger in countries with higher CPIA ratings (Denizer, Kaufmann and Kraay, 2011; IEG, 2010: Appendix G). Donors are more likely to use recipient countries’ public financial management systems, rather than managing their aid through parallel systems, when ratings on CPIA question 13 (on strength of budgetary systems) are higher (Knack, 2013). If the CPIA ratings contain a large random-error component, that 2 Steets (2008), IEG (2010) and an external review panel’s report (World Bank, 2004) conclude that there is insufficient empirical evidence to justify a higher weight on cluster D on grounds of development effectiveness. Rather, the extra weight on governance is likely motivated instead by donors’ fiduciary concerns (IEG, 2010: 60). 3 would tend to bias empirical tests against finding correlations such as the ones in these three studies. 3 This study contributes to the evidence on the validity of the CPIA ratings in several ways. It uses related indicators from other sources to test for systematic ratings bias favoring the IDA- eligible countries. We find no evidence for such a bias, despite the fact that their ratings are linked to aid allocations, unlike the case with non-IDA countries. We also test for regional biases in the data, and confirm the finding by Gelb, Ngo and Ye (2004) on the absence of an African bias. Instead, we find some evidence of ratings bias in favor of countries in the Eastern Europe and Central Asia (ECA) region. By comparing the original regional proposals with the final ratings, we show that the network review process has only a marginal impact in curbing this bias, despite the fact that ratings disagreements are usually resolved in favor of the network’s recommendations over the region’s proposals. The study also provides the first quantitative analysis of the written justifications (“write-ups�) accompanying the region’s proposed ratings. We find significant differences among the regions in the length of these write-ups, and more importantly in their quality, with East Asia (EAP) the leader and Latin America (LCR) the laggard, relative to the other four regions. On average, longer write-ups are higher in quality (as measured by how many of the criteria in the question are addressed in the write-up). Despite this overall positive relationship, many long write-ups actually cover fewer of the criteria than many short write-ups. This fact suggests that in some cases much of the information contained in them is of limited relevance in justifying the proposed ratings. The Bank’s central units, including the network anchors, have a key quality control role in the CPIA. We therefore also analyze responses of network reviewers to the regions’ proposed ratings as a function of write-up characteristics and other factors. The reviewers were more likely to request additional information to assess a proposed rating if (1) it was accompanied by a shorter write-up; (2) the write-up addressed fewer of the criteria in the question; (3) it represented an increase or decrease from the previous year; or (4) it belonged to an IDA-eligible country. Finally, network reviewers were more likely to disagree with a region’s proposed rating when the write-up addressed fewer of the criteria in the question, or when it represented an increase. Controlling for ratings levels and increases, there were no significant regional differences in the likelihood that reviewers disagreed with a proposed rating. These findings indicate that the network reviews are serving a valuable quality-control function in the CPIA process - and likely improving accuracy of the ratings – although this role could be strengthened further. 3 Most sources of systematic bias in the ratings would also tend to weaken relationships with development outcomes. However, if quality of policies and institutions as measured by CPIA were merely inferred from observing development outcomes (e.g. donors use country systems in X, so X must have sound budgetary systems), then correlations would be biased upwards. 4 3. CPIA Process The CPIA’s 16 questions are intended to assess a recipient country’s ability to make effective use of aid resources in furthering development and poverty reduction. The set of questions and their criteria have evolved over time, and are revised periodically to reflect changes in the collective knowledge of practitioners and specialists – both inside and outside the World Bank - regarding policies and public sector management institutions that matter for these outcomes (IEG, 2010; World Bank, 2004). The questions are grouped into four “clusters� as follows: A. Economic Management 1. Macroeconomic Management 2. Fiscal Policy 3. Debt Policy B. Structural Policies 4. Trade 5. Financial Sector 6. Business Regulatory Environment C. Policies for Social Inclusion/Equity 7. Gender Equality 8. Equity of Public Resource Use 9. Building Human Resources 10. Social Protection and Labor 11. Policies and Institutions for Environmental Sustainability D. Public Sector Management and Institutions 12. Property Rights and Rule-based Governance 13. Quality of Budgetary and Financial Management 14. Efficiency of Revenue Mobilization 15. Quality of Public Administration 16. Transparency, Accountability, and Corruption in the Public Sector The questions are designed to assess government policies and institutions to the extent possible, rather than outcomes. Some of the sub-criteria in the questions are quantitative, but a certain amount of expert judgment is required in determining the scores for all 16 questions. Ratings originate with the country teams in the Bank’s regional departments. This can be viewed as a strength of the CPIA ratings process, relative to cross-country assessments produced by some other organizations: “closeness to the client is needed to provide an in-depth knowledge of the policies and institutions in a given country� (Gelb, Ngo and Ye, 2004: 2). However, it also creates a potential conflict of interest, as “having staff rate the countries on which their work programs depend could lead to rating biases� (IEG, 2010: xv). For this reason, the Chief Economist’s office in each of the six World Bank regional departments has a quality control function in the process. These regional chief economists’ (RCE) offices have a lot of discretion over how they conduct their respective regional-level reviews, and procedures vary. For example, quantitative checks for ensuring intra-regional consistency are more feasible and useful for regions with numerous countries such as Africa (AFR) than for others such as South Asia (SAR). Gelb, Ngo and Ye (2004) report that in the Africa region “intensive discussions take 5 place between the country economists� and sector specialists, and ratings are debated at several meetings convened by the Chief Economist’s office. Following the intra-regional reviews, the six RCE offices then forward their proposed ratings to the World Bank’s central units for a cross-regional review coordinated by the Vice-President’s Office in the Operational Procedures and Country Services (OPCS) department. Each of the 16 questions is assigned to a particular network anchor with primary responsibility for the review, although there is no one-for-one mapping: multiple central units comment on each questions, and some network anchors comment on multiple questions. The Public Sector Governance anchor in the PREM network (PRMPS) has primary responsibility for four of the five questions in cluster D, all but question 13. 4 These are the four questions that will be the subject of the analyses in the remainder of the study. The main value added of the network review is to strengthen cross-country comparability of ratings across regions. The central units are usually better positioned than the RCE offices to conduct comparative statistical analyses using a range of other indicators pertaining to the content of the CPIA questions. Comments from network reviewers are collected by OPCS and submitted to the regions for their responses. Many comments take issue with a particular ratings proposal, in the majority of cases (but by no means all) recommending a lower rating than the one proposed by the region. Some comments disagree only with a proposed sub-rating that has no implications for the overall question score. Many other comments do not express disagreement with a rating or sub-rating, but instead point out that some of the criteria contained in the question are neglected in the written narrative submitted by the regions in support of each proposed rating. These “write-ups� were introduced into the CPIA process in 2001, and are of enormous importance in helping regional and network reviewers – and government officials - to understand the reasoning behind the ratings proposed by the country teams. To encourage candid input from staff, the write-ups are not publicly released. Government officials can see their own country’s write-up (but not those of other countries) and discuss it with the Bank country team. In contrast to some other expert-based subjective governance ratings, the CPIA’s write-ups therefore can provide more transparent indications of how governments can improve their ratings. Some write-ups are more thorough than others, however. Omissions in the write-ups often prompt requests by network reviewers for additional information to be provided in revised write-ups, to enable a more informed network review. The write-ups were extremely brief when first introduced in 2001, but have expanded steadily over the years, with little sign of convergence towards a finite length. Figure 1 shows the trend in total word count for the largest region’s (AFR) write-ups. The average annual increase in the word count over the period is 26.3%. The rate of increase slowed from an average 47.7% between 2001 and 2007 to 13.2% from 2007 to 2012. In the most recent two years (2010-2012), however the annual growth rate was 13.8%. The five other regions have experienced similar increases. 4 In recent years the Financial Management Board in OPCS has taken over lead responsibility from PRMPS for question 13 (on Quality of Budgetary and Public Financial Management). This change allows PRMPS to focus more resources on reviewing the remaining four questions. 6 In the IDA country allocation formula, cluster D is weighted much more heavily than the other three clusters combined (World Bank, 2010). 5 Participants in the CPIA process also regard this cluster as the most difficult to assess, because of the relative scarcity of hard data and hence greater reliance on expert judgment (IEG, 2010: 50). If staff members are trying to raise a country’s ratings in order to increase its IDA allocation, the IDA formula provides a motive for focusing on the cluster D questions, and the subjective nature of some of its questions provides the opportunity. It is therefore particularly important to investigate the process and validity of the cluster D ratings, although we welcome future research into the other three clusters. 4. Data and Hypotheses In this study the unit of analysis is the country-question; with 4 CPIA questions and 136 countries there are 544 observations in total. Two dependent variables were constructed from the write-ups accompanying the regions’ 2011 proposed ratings for four of the five cluster D questions. 6 The first variable is a simple word count, measuring the length of each write-up. The second variable assigns a grade to each write-up based on the proportion of the criteria contained in the question that are addressed at all. A grade of “A� is assigned if all or nearly all (more than roughly 85%) of the criteria are addressed, and a “B� is assigned to other write-ups that cover at least one half of the criteria. The lowest grade of “D� is assigned to the relatively few cases in which all or nearly all (again, roughly 85%) of the criteria are not addressed. A grade of “C� applies to the remaining cases, where fewer than half of the criteria are covered, but not so few as to merit a “D� grade. Credit is given for covering a particular criterion in the question even if the write-up addresses it only in a perfunctory manner. Moreover, no attempt was made to grade write-ups with respect to accuracy or impartiality of the information they contain. The write-up grade is therefore only a simple and partial measure of write-up quality. A more comprehensive measure of quality would not only have been far more time consuming to construct, but would have further increased the subjectivity in determining quality grades. The mean word count is 735, with a standard deviation of 396 words. The shortest write-up is only 159 words, and the longest is 4539. Although word count is highly skewed, the log of word count approximates a normal distribution quite well. We therefore use log of the word count in all of the analyses below. 5 The four clusters are weighted equally in computing an overall CPIA score, termed the “IRAI� (IDA Resource Allocation Index). In the Country Performance Rating (CPR) that actually determines IDA allocations, cluster D is assigned a weight of .68, the average of clusters A, B and C is assigned a weight of .24, and a portfolio performance rating receives the remaining .08 weight. These weights have been adjusted several times in determining the CPR, and the weighting of the CPR relative to per capita income in determining IDA allocations is also occasionally adjusted. However, in the CPR the CPIA’s governance questions have been weighted more heavily than the other questions since 1998 (World Bank, 2010). 6 These four include all of those for which the PREM Public Sector Governance Group has primary responsibility for the network review. The exception is question 13. 7 The median quality grade is a “B�. Nearly 65% of write-ups covered one half or more of the criteria (i.e. received a grade of either “A� or “B�), but only 19% received an “A� grade. Only 5.5% of write-ups received a grade of “D�, for addressing less than 15% of the relevant criteria. As discussed below, however, there is substantial variation across regions in quality grades. Three other dependent variables in the analysis are constructed from the network review outputs and finalized CPIA ratings. First, we code each country-question observation for whether or not at least one network reviewer requests more information from the regions; such requests typically reflect a perception that the write-up is inadequate in one or more respects for making a confident assessment of the proposed rating. Second, we code for whether or not at least one network reviewer disagrees with the proposed rating and recommends a different rating. Third, we code for ratings adjustments, or cases where the final “official� rating differs from the one originally proposed by the regions. IDA countries The CPIA ratings are produced for purposes of allocating IDA funds. However, the CPIA assessments are conducted for IBRD-eligible countries as well as for IDA-eligible countries. Ratings are publicly disclosed only for the IDA countries (and only beginning in 2005). Intuitively, one would expect staff to devote extra effort to the CPIA assessments for the IDA countries, because more is at stake, and because their ratings (if not the write-ups) are publicly released so potentially subject to wider scrutiny and criticism. We therefore hypothesize that write-ups will be longer and cover more of the question criteria for IDA than for IBRD countries. If this increased effort produces more accurate and well-justified ratings, then network reviewers – other things equal – should disagree less often with the regions’ proposals for IDA countries, and request additional information less often. Other things may not be equal, however. Network reviewers in turn may devote more attention to IDA country ratings, because of their importance for allocations and because they are public. Moreover, network reviewers may scrutinize IDA country ratings more closely, recognizing a certain degree of moral hazard inherent in the process. Although the expert judgment of Bank staff is clearly an asset in the CPIA exercise, at the same time there is a potential conflict of interest in having staff provide ratings, particularly for IDA countries. This potential for conflict of interest arises from the fact that ratings produced by staff are in turn used for allocating IDA resources for the same countries on which the work programs of those staff depend. Therefore, staff may potentially be upwardly biased in assigning ratings for their countries. (IEG, 2010: 50) This reasoning suggests that any extra effort devoted to IDA ratings by country teams may not necessarily improve the accuracy of scores. Accordingly, network reviewers might disagree more often with IDA than with non-IDA ratings even if the write-ups for the former are lengthier and cover the criteria in the questions more completely. Most notably, the moral hazard issue introduces a strong asymmetry into the network reviews; it implies that a majority of disagreements will take the form of network reviewers recommending a rating lower than the one proposed by the region. 8 Benchmark countries The CPIA process is implemented in two phases. In the benchmark phase about 20 countries are assessed. The benchmark countries change somewhat from year to year, but the sample is designed to include at least one IDA country in each region, and to represent a mix of higher- and lower-performing countries. The remaining 115 (approximately) countries are assessed in a second phase after ratings for the benchmark sample are finalized. The timetable for the benchmark assessments is relatively long, given that there are only about 20 countries. For example, the time allotted for the network review is usually about two weeks for the benchmark phase and about three weeks for the second phase. It is doubly important to set ratings at the appropriate level for the benchmark countries, if those ratings are often used as comparators by the regions and reviewers in the second phase. In this context, the relatively generous timetable for the benchmark phase is appropriate. The extra time available for assessing the benchmark countries, and the added importance of getting their ratings right, should affect the write-ups and review outputs. Other things equal, we would expect write-ups for the benchmark countries to be longer and to cover more of the criteria in each question. If a longer and more thorough benchmarking process produces more accurate ratings and better write-ups, then network reviewers should be less likely to request additional information from the regions and to disagree with the proposed ratings. However, network reviewers have more time available to investigate each proposed rating in the benchmark phase. Therefore, controlling for the more thorough process (using our indicator of how many of the criteria are addressed in the write-ups), reviewers might disagree and request additional information more often in the benchmark phase. Ratings increases and decreases Ratings changes should generally require more elaborate justifications than unchanged ratings. Country teams are perceived as having stronger incentives to increase than to decrease ratings, so they may anticipate more skeptical reactions from regional and network reviewers to proposed increases by providing lengthier and more comprehensive justifications. If it is both easier, and less important, for country teams to convince reviewers to go along with a ratings decrease than with an increase, write-ups accompanying proposed decreases may differ little from those for unchanged ratings. Openness More information relevant to the CPIA governance questions is likely to be available in countries with more open governments that encourage or at least tolerate freedom of the press. We measure openness using the Freedom House index of press freedoms, and hypothesize that more openness will be associated with longer and higher-quality write-ups. Country size and income 9 Less information is generally available for smaller countries, particularly the “micro-states� (IEG, 2010: 46; Gelb, Ngo and Ye, 2004). For this reason write-ups are hypothesized to be longer and cover more of the question criteria in larger countries, as measured by log of population. We also control for log of per capita income. More information on governance and public sector institutions, particularly at sub-national levels, is likely to be available for higher income countries, which tend to have better communications and transportation infrastructure and more foreign investment. Among IDA-eligible countries, however, the poorer ones receive higher IDA allocations. More intensive engagement associated with higher aid levels is likely to provide country teams with more information. The net impact of per capita income is therefore theoretically ambiguous. CMU in country For 32 of the 136 countries, the World Bank’s Country Management Unit (CMU) is based in the country. For these cases, the country team may have access to more relevant information for the CPIA. Other things equal, therefore, we would expect lengthier write-ups that cover more of the criteria, and fewer requests from reviewers for additional information. If the added information produces more accurate ratings, we might also expect to observe fewer disagreements by network reviewers with the proposed ratings. On the other hand, country teams based in Washington (or, as is sometimes the case, in a neighboring country in the region) may find it easier to maintain objectivity. Where staff are based in the country, there may be more subtle pressures to produce higher ratings, either to increase country allocations or to maintain friendly relations with government counterparts. If so, network reviewers may disagree more often with proposed ratings for these countries. The net effect of these counteracting influences may be either positive or negative. Regional effects All IDA-eligible and IBRD-eligible countries in the World Bank are administratively assigned to one of six regions: Sub-Saharan Africa (AFR), East Asia and Pacific (EAP), Eastern Europe and Central Asia (ECA), Latin America and Caribbean (LCR), Middle East and North Africa (MNA), or South Asia (SAR). The regional chief economists’ offices have substantial discretion in how they manage the CPIA process and regional reviews. Some may encourage more input from sector specialists than others, or conduct more intensive comparative analyses, or debate more vigorously with their country teams on specific ratings. Any regional effects emerging from the tests here may reflect such differences in how the process is conducted. There are no strong theoretical reasons to expect a particular region to produce more thorough write-ups. However, one plausible argument is that the CPIA process is given higher priority in regions where most countries are IDA eligible. If so, then even when we control for IDA status of individual countries we might observe longer and more thorough write-ups on average for countries in AFR than in ECA, LCR and MNA. In the 2007 CPIA, the network reviewers disagreed with ECA’s proposed ratings most often (IEG, 2010). For all 16 CPIA questions, reviewers disagreed with 17.5% of ECA’s ratings. For the other regions this figure varied from 8.6% for SAR to 12.3% for LCR. In the absence of any 10 substantive explanation for these regional differences, it is unclear whether these results should generalize to our analysis of the 2011 CPIA, which is limited to the cluster D ratings. Question effects Some questions cover more extensive issues or sub-criteria than others, so we would expect their write-ups to be longer on average. Questions have multiple components that are each assigned a sub-rating, and each component in turn lists several sub-criteria to address in the write-ups. Questions 12 and 15 each have three components, while question 16 has four and question 14 has only two. We should therefore observe longer write-ups for question 16 and shorter ones for question 14, relative to questions 12 and 15. 5. Write-ups: Word Counts Regressions presented in Table 1 show partial correlations of write-up length with the variables described in section 4. The baseline specification in equation 1.1 shows results for the full sample of 136 countries and 544 questions. Standard errors (in this and subsequent tables) are adjusted for clustering by country, as errors for the four observations for each country are not likely to be independent. As hypothesized, write-ups for IDA countries are significantly longer than for non-IDA countries. The IDA dummy coefficient estimate of 0.15 implies that IDA country write-ups are about 16% longer on average, other things equal. Length of write-ups for benchmark and non- benchmark countries do not differ significantly, although the coefficient for the benchmark dummy is positive as hypothesized. Write-ups for ratings increases are also about 16% longer, other things equal. Ratings decreases, on the other hand, are not accompanied by longer justifications. The index of press freedoms, population, per capita income, and the dummy for in-country CMU are also not associated with significantly longer (or shorter) write-ups, contrary to predictions. Coefficients for the regional dummies are interpretable relative to the omitted category, LCR. Other things equal LCR has the shortest write-ups on average. Coefficients in equation 1.1 are positive for the other five regions, and statistically significant for AFR, EAP and ECA. Relative to LCR, write-ups are longer on average by 19% for AFR, 32% for EAP, and 24% for ECA. Other than LCR, the next-shortest write-ups are for MNA, but the difference between MNA and the other four regions is significant only for EAP. Question effects in equation 1.1 are consistent with intuition. Relative to the omitted category of question 12, write-ups are significantly shorter on average (by about 13%) for question 14, and longer for questions 15 and 16 (by about 15% and 35% respectively). Question 12 has more sub- questions (three) than question 14 (two), but fewer than question 16 (four). Question 15 also has three sub-questions, consistent with the finding that its write-ups are longer than question 12’s but shorter than question 16’s. Until it was revised for the 2011 CPIA, question 15 had four sub- questions. The write-ups each year are not entirely re-written in most cases, but edited from the 11 previous year. Many of the question 15 write-ups in 2011 still contained information applicable to some of the obsolete sub-questions. This perhaps explains why write-ups for question 15 in 2011 were significantly longer than for question 12, despite the fact both now include three sub- questions. If equation 1.1 is run separately for each of the four CPIA questions, the IDA dummy is highly significant in question 16, and the dummy for ratings increases is significant for questions 14 and 16. Relative to LCR, 18 of the 20 regional coefficients are positively signed. Six of these 18 positive coefficients are statistically significant at the .05 level. The two negative coefficients belong to MNA, in the regressions for questions 12 and 14, but neither of these two differences with LCR is statistically significant. Differences between LCR and other regions in write-up length are most pronounced for question 15. 7 Equations 1.2 and 1.3 in Table 1 report results for sub-samples of IDA and non-IDA countries respectively. Surprisingly, the marginal effect of ratings increases on word count is more significant and larger for non-IDA (26%) than for IDA countries (11%), although ratings affect resource allocations only for the latter. Differences between LCR and other regions are much smaller for the non-IDA than for the IDA sample. So are question effects. The explanatory power of the model is therefore much smaller for the non-IDA (R2=.18) than for the IDA (.39) sample. 6. Write-ups: Quality Grades As expected, longer write-ups tend to address more of the criteria in the CPIA questions. Figures 2-5 show the mean, maximum and minimum word counts for each question and quality grade combination. The average word count is consistently higher for higher grades, but there is huge variation in word counts among write-ups for a given question and a given quality grade. For question 15 (see Figure 4), the average word count is about 560 for write-ups graded “D�, increasing to 690 for “C� grades, 840 for “B�, and 920 for “A�. Despite the positive correlation between write-up length and quality, it is clear that a lengthy write-up is neither necessary nor sufficient to address most of the criteria in the questions. The shortest write-up for question 15 with a grade of “A� is only 380 words, about two-thirds of the average length and one-third of the maximum length of “D�-graded write-ups. No write-up of 1000 or more words for questions 12, 14 or 16 was graded as low as “D�, but at least one write-up of that length was graded “C� for each question. Those write-ups that address all of the criteria in at least a cursory manner can receive the same “A� grade as a 1000-word write-up that addresses all of them in greater depth. However, the message from Figures 2-5 that lengthy write-ups sometimes neglect most of the criteria in the question is consistent with anecdotal accounts and impressions of reviewers that write-ups are often not sufficiently focused or even pertinent. Many write-ups contain extensive descriptions of ongoing efforts to pass legislation or reform procedures in ways that might eventually improve 7 Results in this paragraph are not shown in tables for space reasons, but are available on request from the author. Disaggregating by region instead of by question adds little of interest to the analysis, in part because some regions (MNA, SAR) have very few countries. 12 performance on some of the criteria in the question, but without providing any indication of the current level of performance. Average write-up grade differs by question: they tend to be highest for question 16 and lowest for question 14. Note that, unlike the case with write-up length, these differences cannot be attributed to the number of sub-ratings or criteria contained in the questions. Relatively low grades for question 15 can be attributed at least in part to major revisions in the question prior to the 2011 CPIA exercise. Many question 15 write-ups for 2011 still addressed the old criteria better than they did the new criteria. There is no obvious explanation for the low grades for question 14, on revenue mobilization. Fewer than 1 in 13 write-ups for question 14 received a grade of “A�, compared to more than 1 in 4 for question 16. Regional variations portrayed in Figure 6 are even larger than variations across questions. Only about 1 in 10 grades in EAP countries are “C� or lower, compared to more than one half in LCR. No EAP write-ups were graded “D�, compared to 1 in 8 of LCR’s write-ups. More than half of EAP write-ups received an “A� grade, compared to only 5.4% in LCR. Table 2 presents multivariate tests of the association of write-up grades with word count and other variables. Equation 2.1 reports results from an ordered probit regression, while equation 2.2 reports an OLS regression for the same model specification. The dependent variable is an ordinal scale with only four categories, so ordered probit is the preferred method. We run OLS as well, however, because its coefficients (unlike the case with probit) are directly interpretable as marginal effects. Those coefficients could be misleading if the two methods generate very different results. However, the t-statistics are remarkably similar in equations 2.1 (ordered probit) and 2.2 (OLS), and the same set of variables are statistically significant in both tests. Moreover, marginal effects computed from binary probit regressions on sub-samples of observations with adjacent grades 8 indicate that little information is lost by assuming linearity and using OLS. For simplicity, the remainder of this section will therefore focus on results from OLS tests. Lengthier write-ups receive significantly higher grades, but quantitatively the average effect is rather modest. A 1.5-unit increase in the log of word count (more than three standard deviations, or an increase from the mean of 735 words to about 3000 words) is required to produce an increase of one grade level, e.g. from “C� to “B�. Grades are not significantly different for IDA or benchmark countries, or for proposed ratings increases or decreases from the previous year. Grades are significantly higher in larger countries and in those with more press freedoms. In-country CMUs are associated with lower average grades. Grades are lower for LCR than for other regions, and differences between LCR and three regions (EAP, ECA and MNA) are statistically significant. Other things equal, grades are nearly a full level higher on average for EAP than for LCR countries. The mean grade (calculated by assigning scores of 4, 3, 2 and 1 respectively to grades A, B, C and D) is 3.41 for EAP and 2.40 for LCR, for a difference of 1.01. The EAP regression coefficient of .835 in 8 Specifically, three probit regressions were run on observations with “A� and “B�, “B and “C�, and “C� and “D� grades respectively. Marginal effects for most variables (calculated at the mean values of all other regressors) are very similar across the three sub-samples. 13 equation 2.2 implies that differences in write-up length and other variables in the model account for less than one-fifth of this full-grade difference between average write-up grades in EAP and LCR. Grades for questions 14 and 15 are significantly lower, other things equal, than for questions 12 and 16. The largest negative coefficient belongs to question 15, the one with criteria that were substantially revised between 2010 and 2011. Equations 2.3 and 2.4 report similar regressions, but for the IDA and non-IDA sub-samples respectively. Results are broadly similar. The word count coefficient is somewhat larger in the non-IDA sample, and the EAP coefficient is somewhat larger for the IDA sample. The negative effect of in-country CMU is significant only for the non-IDA sample. The large difference in write-up quality between question 12 and questions 14 and 15 observed for the full sample widens further in the IDA sample, but narrows in the non-IDA sample. To summarize, two regions stand out from the other four with respect to average write-ups grades, EAP positively and LCR negatively. Although length of write-ups is positively related to quality, it accounts for very little of the difference between EAP and LCR. Somewhat surprisingly, grades are not significantly higher for IDA or benchmark countries, or for ratings changes. 7. Network Review The network review phase of the CPIA process generates a set of comments on some of the proposed ratings. Many comments express disagreement with proposed ratings, but others simply request a revised write-up that better addresses the criteria in the question. This section analyzes the determinants of both of these types of responses, information requests and disagreements with ratings. It also looks at factors associated with ratings adjustments, i.e. the subset of disagreements resolved in favor of network recommendations. In the 2011 CPIA process, network reviewers requested additional information for 8.1% of all proposed ratings. As shown in Figure 7, there is substantial regional variation in this figure, ranging from a minimum of 1.3% for EAP to a maximum of 10.2% for AFR. Much of these regional differences can be attributed to write-up volume and quality grades. Equation 3.1 in Table 3 reports results from a probit regression, where the dependent variable is coded 1 for the 44 country-question observations (8.1% of the sample) where network reviewers requested additional information, and 0 for the other 500 (91.9%). The variable coefficients reported have been transformed to represent the marginal effects of a one-unit increase, evaluated at the mean value of all other independent variables. Each letter-grade increase in quality of the write-up is associated with a highly significant three percentage point drop in the likelihood that a network requests more information. Controlling for the write-up grade, higher word counts also significantly reduce the probability of a request for more information. Presumably this result reflects the fact that write-up grade is an 14 incomplete measure of quality, and that longer write-ups not only tend to address more of the criteria in the question but also (more often than not) to address them in greater depth. Equation 3.2 presents the reduced-form estimate of the word count effect. When write-up grade is not controlled for in equation 3.2, the word count coefficient is four times as large as in equation 4.1. Note that the explanatory power of the model drops by one third when grade is omitted: the R2 is .39 in equation 3.1 but only .26 in equation 3.2. Network reviewers appear to devote extra attention to ratings for IDA countries. Other things equal, the probability of an information request is 8 percentage points higher if the proposed rating belongs to an IDA country. For proposed increases in a rating, the impact is an even larger 18 percentage points. Reviewers were also significantly more likely to request additional information when a ratings decrease was proposed, but this marginal effect is only 6 percentage points, one-third as large as for a ratings increase. While proposed changes generate more reviewer requests, higher ratings levels do not. Conceivably, reviewers might want to scrutinize more closely any proposals for a high rating, whether or not it represents a change. However, no support is found in the data for this conjecture. Benchmark countries are also not associated with more frequent information requests. Reviewers may have more time in the benchmark phase to identify informational shortcomings in the write-ups, but country teams may also have more time in that phase to produce more complete write-ups. Given these countervailing forces the absence of a significant benchmark effect is not surprising. Regional differences in information requests are small in equation 3.1, controlling for write-up grades. In equation 3.2, where write-up grade is not controlled for, the likelihood of information requests differs trivially among AFR, ECA, LCR and MNA, but relative to those four is significantly lower in EAP and SAR (by about 4 percentage points). Information requests are significantly more frequent for question 15 (in equation 3.2), the question which experienced the most substantial revisions to its criteria leading into the 2011 CPIA exercise. Controlling for the lower average quality grade of question 15 write-ups in equation 3.1, however, this difference is not significant. Network reviewers disagreed with the regions’ proposals in 7.9% of cases. As Figure 8 shows, there is again large variation across regions, from a minimum of 2.6% for EAP to a maximum of 17.4% for ECA. Most of this variation turns out to be attributable to differences in the number of proposed ratings increases; the frequency of ECA’s proposed increases (19.6%) is more than double that of the five other regions collectively (9.1%). Table 4 analyzes in detail the determinants of networks’ propensity to disagree with proposed ratings. The dependent variable, “disagreement,� is coded 1 for the 42 cases where network reviewers expressed disagreement with a proposed rating and recommended a different rating, and is coded 0 for the other 502 observations. Of these 42 cases, 31 represent proposed increases, 1 a proposed decrease, and 10 were unchanged from the previous year. There were 59 15 proposed increases in total, so networks disagreed with a slight majority of them. They disagreed with only 3% (1 of 32) of decreases, and 2% (9 of 453) of unchanged ratings. 9 Network reviewers are less likely to take issue with a proposed rating when it is accompanied by a more thorough write-up. As shown in equation 4.1, each one-grade increase in write-up grade is associated with a (statistically significant) reduction of 2.2 percentage points in the probability of a disagreement. By far the most important predictor of disagreement is the ratings increase dummy. The likelihood of disagreeing rises by 45 percentage points, other things equal, when a ratings increase is proposed. Regional differences are small. The omitted category is ECA. Coefficients for the other 5 regions are negative, but small in magnitude (1.5 percentage points or less). Only the MNA coefficient is (borderline) significant. Question effects are more substantial. Other things equal, reviewers were significantly less likely to disagree with ratings for questions 14 and 15 than for question 12. The substantial revisions to the question 15 criteria leading into the 2011 CPIA exercise may be responsible for lower write-up quality grades (Table 2) and thus for more requests for additional information in the review process (Table 3, equation 3.2). But results in Table 4 suggest that they do not appear to have produced more frequent disagreements over ratings. Equation 4.2 excludes the ratings increase dummy from the model. Most notably, the explanatory power of the model plunges from .52 to only .23. Also, regional effects become much more strongly negative (relative to ECA) and more significant in equation 4.2, where we are not controlling for the fact that ECA proposes a far greater number of ratings increases than the other regions. This finding is consistent with IEG (2010), which reports that network reviewers disagreed more often with ECA’s proposed ratings, in its analysis of all 16 questions for the 2007 CPIA. Word count of write-ups is not significant in equation 4.1, but in equation 4.2, where ratings increases are not controlled for, it is associated with an increase in the likelihood of disagreement over ratings. Regional staff may anticipate network disagreement with more dubious proposals for an increase, and provide longer write-ups in attempting to justify them. Ratings level is also significant in equation 4.2, but not in equation 4.1 where proposed increases are controlled for. When ratings level is dropped from the model in equation 4.3, the R2 declines even further (from .23 to .17), and regional differences are further accentuated. Taken together, results in equations 4.1-4.3 indicate that the greater propensity of network reviewers to disagree with ECA ratings is mostly attributable to the fact that this region compared to the other five proposes more high ratings, and more increases. Results are broadly similar for the sub-sample of IDA countries. In equation 4.4, which replicates equation 4.1 for that sub-sample, the coefficient on proposed increases remains highly significant, but is somewhat smaller in magnitude. For the IDA sample, in-country CMUs are associated with a significantly lower probability of disagreement. 9 In several other cases, network reviewers recommended increasing sub-ratings, but with no implications for the overall question rating. 16 Table 5 analyzes ratings “adjustments.� For these probit regressions, country-question observations are coded 1 if the final, official rating for 2011 set following the review process differs from the one proposed by the regions and sent to the networks and central units for review. “Adjustment� is coded 0 if and only if the final rating is the same as the one proposed by the region. There were 36 adjustments, out of the 42 instances in which network reviewers had disagreed with the proposed rating. As in the IEG’s (2010) analysis of 2007 ratings, therefore, in most cases of disagreement the network recommendations prevailed in 2011. By this measure, the network review appears to have a significant influence on the ratings. Presumably, the neutral role and cross-regional perspectives of the networks serve to improve the accuracy of the ratings overall, even if there is no way of knowing what the “true� rating should be in each instance of ratings disagreement or adjustment. Because most adjustments are downward rather than upward, the network reviews also help counteract ratings inflation over time, although again there is no way of knowing the “true� time trend of average ratings for each question. The probability of an adjustment is about 5 percentage points higher for ECA’s ratings than for the other regions. Equation 5.1 of Table 5 presents regional and question effects, without controlling for any other variables. Ratings for question 16 are significantly more likely (again, by about 5 percentage points) to be adjusted than those for the other three questions. Equation 5.2 controls for the effects of proposed increases and other variables on the likelihood of ratings adjustments. Proposed increases are the most powerful predictor of ratings adjustments. Higher ratings levels and lower write-up grades are also associated with an increased likelihood that the final rating is different from the one proposed by the region. The regional effects in equation 5.1 largely disappear in equation 5.2: controlling for ECA’s propensity to propose more ratings increases, its ratings are no more likely to be adjusted than those of other regions. Ratings for questions 14 and 15 are significantly less likely to be adjusted than those for questions 12 and 16. 8. Testing for Regional or IDA Bias As IEG (2010: 50) notes, the reviews conducted by the six regional Chief Economists’ offices are intended mostly to correct for over-exuberance regarding individual countries’ ratings, but “there could still be issues� regarding inter-regional comparability “even if the relative rankings of countries are adjusted� appropriately within each intra-regional review. Gelb, Ngo and Ye (2004) test for an Africa bias in the CPIA, but do not test for any other regional biases. Specifically, they regress CPIA cluster D scores on a simple average of the six Worldwide Governance Indicators (WGI) indexes and on a dummy for Sub-Saharan African countries. The CPIA-WGI relationship is very strong and significant, but the Africa dummy is insignificant. Similarly, they report no Africa bias in regressing either cluster D or cluster B scores on the Heritage Foundation’s Economic Freedom Index, or in regressing cluster C scores on the UNDP’s Human Development Index. 17 In attempting to shed light on the validity of CPIA ratings, IEG (2010) reports correlations between some of the CPIA ratings and other related indicators. It does not test for possible regional biases, however. Nor does IEG (2010) test for an IDA country bias, despite noting repeatedly that the use of CPIA scores for IDA allocations created incentives for country teams to inflate ratings. If it is common for regional staff to inflate ratings proposals for the purpose of increased IDA allocations, then we would expect to observe a positive, significant coefficient on an IDA country dummy included in regressions of CPIA ratings on related indicators. There is never perfect conceptual overlap between any single CPIA question (or cluster of questions) and related indicators produced by other organizations. The lack of a perfect or even close relationship empirically does not necessarily indicate that the CPIA (or the related indicator) is invalid. For any individual country, a higher (or lower) ranking in the CPIA than on the related indicator could reflect less-than-perfect conceptual overlap, or measurement error in the CPIA, or measurement error in the related indicator. For a large group of countries, however, it is more difficult to dismiss systematic discrepancies between the CPIA and related indicators that tend to favor IDA countries, or countries from a particular region. If CPIA ratings are significantly higher than predicted – based on the values of a related indicator – for IDA countries or for an entire region, it cannot plausibly be attributed to random measurement error in the full sample of more than 130 countries covered by the CPIA. Conceivably, a source of comparator data may contain its own regional biases that would contaminate tests for bias in CPIA. For this reason our tests use a range of related indicators, from several largely independent data sources. It is highly unlikely that these sources would all exhibit the same regional biases. Table 6 presents a pair of OLS regressions for each of the four cluster D questions (12, 14, 15 and 16). The dependent variable in the first of each pair is the region’s proposed rating in the 2012 CPIA exercise, and in the second of each pair is the final 2012 rating. Independent variables include a conceptually related “comparator� indicator, log of per capita income, an IDA dummy, and an ECA dummy. Preliminary tests showed that ECA’s coefficients were consistently larger (more positive) than those for other regions and variations among the other five regions were small. We therefore test only one regional dummy in Table 6, focusing on the issue of whether ECA ratings are significantly higher than those of countries in the other regions combined, controlling for their ratings on comparator indicators and per capita incomes. The regressions within each pair are identical other than using proposed ratings and final ratings as alternative dependent variables. Reporting these tests side-by-side shows the extent to which any IDA or regional biases in proposed ratings are dampened or eliminated by the network review. 10 Equations 6.1 and 6.2 analyze question 12 ratings. The related comparator indicator is an index of “guidepost indicators� constructed for purposes of the network review, and designed to match 10 This method captures only the immediate direct effects of the network review. From a longer-term perspective the existence of the reviews can deter regions from proposing ratings that are higher than justified. 18 the criteria in question 12 as closely as possible. 11 The guidepost index is positively and significantly related to question 12 ratings. The IDA dummy coefficient is very small and insignificant in both equations. Based on this finding, the large weight assigned to CPIA governance questions in the IDA allocation formula does not appear to be distorting the ratings. The ECA coefficient is positive and highly significant in both equations 6.1 and 6.2. Question 12 proposed ratings for ECA countries are more than one-fourth of a point higher than predicted from their incomes and guidepost index scores (equation 6.1). Following the network review, this positive ECA effect is basically unchanged, increasing from .275 to .276. Equations 6.3 and 6.4 analyze question 14 ratings, on revenue mobilization. The two related regressors are from the “Paying Taxes� component of the Doing Business project. Tax rates paid by a “typical firm� pertain to the first sub-rating in question 14 on tax policy. Number of distinct tax payments that a “typical firm� must pay pertains mostly to the second sub-rating on tax administration. Both variables are significantly and negatively associated with question 14 ratings, as expected. The effects are quantitatively small, however: a two standard deviation increase in both tax rate (equal to 80% of firm profits) and number of tax payments (equal to 40) would be required to increase the question 14 rating by one-half point. Moreover, these two indicators do not address many of the criteria in question 14. Accordingly, the R2 in equations 6.3 and 6.4 is much lower than in the other regressions reported in Table 6. As in equations 6.1 and 6.2, the IDA coefficient is negative but insignificant. The ECA coefficient is positive and marginally significant in equations 6.3 and 6.4. Proposed ratings for ECA countries are nearly one-quarter of a point higher than predicted on average. For the final ratings, this effect shrinks to one-fifth of a point. The review process appears to reduce the bias somewhat for question 14. Question 15’s proposed and final ratings are the dependent variables in equations 6.5 and 6.6, respectively. These regressions control for an index of two related indicators from the Economist Intelligence Unit’s (EIU) country risk ratings. One is on bureaucratic quality (including meritocracy), and the other is on red tape encountered in dealing with the government bureaucracy. This EIU bureaucracy index is very strongly related to question 15 ratings. The IDA dummy coefficient is again very small and insignificant in equations 6.5 and 6.6. The ECA coefficient is positive and significant at the .01 level in both regressions. The average question 15 proposed ratings for ECA countries are more than one-third of a point higher than predicted. The network review process has only a minimal impact on this bias, as the coefficient declines from .363 in equation 6.5 to .355 in equation 6.6. 12 Finally, equations 6.7 and 6.8 test for IDA and regional biases in question 16 ratings. The comparator variable is an index of guidepost indicators, constructed to match as closely as 11 The index includes 19 variables from five sources: the World Economic Forum’s (WEF) “Executive Opinion Survey,� the International Country Risk Guide (ICRG), Economist Intelligence Unit (EIU), Freedom House, and the Heritage Foundation. Data are all from 2012. The ICRG and EIU ratings are updated monthly, and the others annually. 12 Results are very similar if we replicate this exercise substituting the bureaucratic quality indicator from International Country Risk Guide for the EIU indicators. 19 possible the criteria in question 16. 13 Again, the IDA dummy is not significant, but the ECA dummy is positive and significant at the .05 level in both regressions. The average ECA country proposed rating (equation 6.7) is about one-fifth of a point higher than predicted. Following the network review process, the magnitude of the bias is reduced slightly, from .192 to .182. 14 The network review has the effect of strengthening (albeit modestly) the relationship between CPIA ratings and comparator indicators. The coefficient on the comparator indicator is slightly greater in the second regression within each pair, for questions 12, 15 and 16. This result is not surprising, as the network reviewers make use of these indicators (as well as other quantitative and qualitative information) in formulating their ratings recommendations. Table 7 reports regressions similar to those for the even-numbered equations in Table 6, i.e. using the final rather than proposed ratings as dependent variables. They differ however by specifying dummy variables for the other 5 regions, with ECA as the base category. The regional dummy coefficients in this table thus show pair-wise comparisons between ECA and any other single region. Of the 20 regional dummy coefficients in Table 7, 19 are negative (all but the one that is smallest in absolute value), and 8 are statistically significant. Differences with ECA are smaller for SAR than for the other four regions. The largest single coefficient is for the MNA dummy in equation 7.3: ECA ratings on question 15 are more than ½ point higher than SAR’s, controlling for the EIU comparator indicator and other factors. The results in Table 6 are reassuring in finding no evidence at all that the CPIA governance ratings incorporate an IDA country bias. These findings do not definitively reject the possibility of an upward bias in the ratings, but any such bias must apply equally to non-IDA countries. An IDA bias does not even show up in the regions’ proposed ratings, suggesting that the regions’ own internal review procedures are successful in deterring or purging any IDA bias in their ratings before they are forwarded to central units for review. There is evidence, however, of a significant regional bias. Coefficients for the ECA dummy are positive and at least marginally significant in all eight regressions. A comparison of proposed and final ratings indicates that the network review has only a modest impact in curbing this bias. The mean bias over the four questions is slightly more than one-fourth of a point (0.265) in the proposed ratings, and it is reduced only by about 4% (to 0.254) in the final ratings. If reform progress is spatially correlated and the guidepost indexes reflect lagged information, then a positive and significant regional dummy coefficient in these tests would not necessarily reflect ratings bias. The CPIA ratings might simply reflect more up to date information about progress in one region relative to the other five. However, the network reviews conduct regional comparisons every year similar to those reported in Table 6, and a similar “ECA effect� has been 13 The index includes 20 variables from five sources: the World Economic Forum’s (WEF) “Executive Opinion Survey,� the International Country Risk Guide (ICRG), Economist Intelligence Unit (EIU), Freedom House, and Reporters Without Borders. Data are all from 2012. The ICRG and EIU ratings are updated monthly, and the others annually. 14 Results for questions 12 and 16 in the table are very similar if the 2011 Worldwide Governance Indicators (WGI) “Rule of Law,� “Control of Corruption� and “Voice and Accountability� indexes are substituted for the guidepost indexes. The WGI indexes include more sources than the guidepost indexes, but are not designed to match the content of the CPIA criteria and contain less up to date information. 20 present for at least several years. It seems unlikely that an information lag for the EIU, WEF and other sources relative to the CPIA would persist for more than a few years at most. Not all “expert� ratings are necessarily equal: some sources may have less accurate or up to date information than others on governance and public sector reform in ECA. One source that specializes in ECA countries (with the exception of Turkey) is the annual Freedom House “Nations in Transit� (NIT) report. This report provides detailed country narratives as well as quantitative indicators designed to be comparable over time. The NIT produces ratings on three indicators pertaining to questions 12 and 16 in the CPIA: “judicial framework and independence,� “corruption,� and “independent media.� 15 In its 2012 edition, closely corresponding to the 2011 CPIA in its timing, only 2 of the ratings were upgraded (for countries also included in the CPIA), and 10 were downgraded. In contrast, the region proposed 12 increases and 0 decreases for question 12 and 16 ratings. In its 2013 edition, the NIT upgraded 6 ratings on these questions but downgraded 13 others, for countries covered by the CPIA. The source of comparator indicators that is arguably the best informed about ECA thus appears to disagree markedly with Bank staff on the extent of improvements in governance and public sector institutions. The ECA region is comprised of several distinct sub-regional groupings, ranging from new EU members to Central Asian republics. We therefore experimented with different sub-regional dummies, to determine which group or groups might be driving the positive ECA effect found in Table 6. The group with the largest positive bias (i.e. CPIA ratings higher than predicted by comparator indicators) turns out to be the EU accession (Croatia, entering in July 2013) and candidate (Macedonia, Montenegro, Serbia and Turkey) countries. Accession and candidacy should be associated with improvements in governance, so this result may not appear surprising. Although the three ECA countries in the 2012 CPIA that joined the EU earlier (Poland, Bulgaria and Romania) exhibit less of a positive bias, it could be that sources of comparator indicators have had more time to learn about their governance improvements, and catch up with any information advantage temporarily possessed by Bank staff and reflected in their CPIA assessments. Conversely, the comparator indicators may be lagging, relative to the CPIA, with respect to information on improvements in the five accession and candidate countries. However, the Freedom House NIT reports are the least likely of comparator indicators to contain lagging information, and arguably should be as accurate and up to date as the CPIA on the issues it covers. In the four accession and candidate countries covered by the NIT, it downgraded one rating (Macedonia on “judicial framework and independence�) and upgraded none in its 2013 report. In 2012, it upgraded one and downgraded one, and in 2011 it upgraded two and downgraded two. The NIT’s ratings are thus inconsistent with the view that governance in these accession and candidate countries is improving so rapidly that assessments by less specialized sources of comparator indicators are hopelessly lagging. 15 These NIT indicators are not used as components of the “guidepost indexes� for questions 12 and 16 used in Table 6, because the NIT does not cover countries in any of the other 5 regions. 21 In any event, while the positive “ECA effect� demonstrated in Table 6 is more attributable to the accession and candidates countries than to other sub-groups in the region, it is not solely attributable to them. Even if these 5 countries are dropped entirely from the Table 6 regressions, the ECA coefficient remains positive and is still statistically significant in many of the regressions. If ratings disagreements are usually resolved in favor of the networks’ recommendations, one might ask why the positive “ECA effect� observed in Table 6 diminishes only slightly using ratings from before versus after the network review. There are two likely explanations. First, in borderline cases the networks tend to defer to the regions’ views, and ECA is by far the most assertive region in terms of proposed increases, and many of these “extra� increases will represent borderline cases. Second, the ECA coefficient reflects a cumulative effect over many years, and the reviews in any one year mostly (but not entirely) focus on newly-proposed increases. 9. Implications and Recommendations This analysis of the validity of the CPIA governance questions finds significant regional differences in the coverage of criteria by the written justifications accompanying the regions’ proposed CPIA ratings. It shows that the length of the write-ups has steadily increased over time, with little sign of leveling off. Although write-ups with higher quality grades are also longer on average, there is wide dispersion in the word count for any given grade, and some long write-ups provide little relevant information. The analysis also examined network reviewer responses to ratings proposals, as a function of the quality and length of write-ups. Higher grades are associated with a lower likelihood that central unit reviewers will either disagree with proposed ratings, or request additional information to assess the proposed ratings. Controlling for grades, longer write-ups are actually associated with a greater probability that central reviewers will disagree with a proposed rating. If a region wishes to avoid time-intensive back-and-forth exchanges with the central units over its ratings proposals, it should therefore provide relevant and thorough but reasonably concise write-ups to support them. Using related “comparator� indicators from other sources, we find no evidence of a pro-IDA country bias in the ratings. This is a striking finding, given the central role of country teams in the CPIA process and the fact that aid allocations are highly sensitive to ratings on the governance questions. However, we find a significant upward bias in ratings for one region (ECA), and show that it is only slightly reduced by the network review process. This study’s findings have several implications for improvements in the CPIA process. First, although the write-ups justifying the proposed ratings are a major strength of the process, there is room for further improvements. The write-ups continue to increase in volume, but much of the incremental information may be of only limited relevance. In the most extreme case, one write- up of well over 1000 words neglects to address nearly all of the criteria in the question. Moreover, there are enormous differences across the regions in how well the write-ups cover the 22 criteria. Only one tenth of EAP write-ups are graded “C� or below (meaning they address less than one half of the criteria in the question), compared to over one half for LCR. Strengthening the guidance provided by central units to regional staff regarding the cluster D questions could help somewhat in steering the write-ups toward provision of more pertinent information. For example, certain data sources (such as the World Bank’s own Enterprise Surveys) are under- utilized, while others (such as TI’s Corruption Perceptions Index) are often cited in ways that do not provide meaningful comparative information across countries or over time. However, the payoff to strengthening guidance notes may be limited, as it is unclear whether many of the relevant staff members currently make use of the existing guidance. Second, the network reviews could also more assertively request additional information in a much larger number of cases. The network with the lead role in reviewing question 13 (on budgetary management) employed this strategy with some success over the course of several years. It might be overly ambitious to implement this approach simultaneously for all four cluster D questions for which the Public Sector Governance Group has the lead review role. However, it might be feasible to implement it more gradually, e.g. focusing initially on the question with the lowest average quality grades (14, on revenue mobilization), or starting with the grade “D� cases (where most criteria go unaddressed) for all four questions and moving on the following year to the “C� cases. Third, the network reviews should be more assertive about questioning proposed ratings even when no increase is being proposed, and in borderline cases. Reviews that are overly focused on rating changes produce a status quo bias in the ratings, and reduce the scope for eliminating regional or other biases when they appear in the ratings. Finally, the process would benefit from increased checks and balances within each region, including an enhanced role in at least some regions for sector specialists. The most striking results from the analyses above involve regional comparisons. On quality of write-ups, EAP is the positive outlier, and could serve as a useful model for other regions to follow. In contrast, LCR is a negative outlier as the region with the least pertinent write-ups. On realism of proposed ratings, ECA stands out as the region that appears to be most over- optimistic in its assessments. The network reviews can play a useful role in minimizing any regional or other biases that emerge in the ratings, as well as in alerting regional staff on the need to improve the quality of write-ups in more instances. The regional chief economists’ offices arguably can achieve the same or better results, however, at lower cost 16 and via interactions with regional colleagues that country teams might find more credible. 16 Disagreements by network reviewers typically occur after in-depth analysis of individual outliers identified from cross-country statistical analyses. Conducting these analyses and crafting the written arguments to challenge 42 proposed ratings is a fairly time-intensive task for network reviewers. If other regions were as assertive as ECA in proposing increases, and network reviewers disagreed with them at the same rate, there would have been about 75 disagreements instead of 42. 23 According to IEG (2010: 49), interviews with regional reviewers suggested that some of these regional chief economist’s offices play a much more active role than others. 17 As merely one small example, these offices could provide their own guidance on useful region-specific information sources. More importantly, in those regions where sector specialists have a relatively minimal role in the process, the regional chief economists’ offices should take steps to enhance their role. This study does not intend to represent the last word on validity and reliability of the CPIA ratings. Assessments of the quality of write-ups could also be broadened and deepened. Another important limitation of this study is that it does not analyze changes in write-up length, criteria coverage or ratings over time. Adding a time dimension to the analysis would also make it possible to test other independent variables, such as turnover of key staff involved in the CPIA process. Results of such a study would be useful in assessing the validity of CPIA time series data. Although the CPIA is intended primarily for comparing countries to each other at a point in time, analysts are often tempted to use it for assessing trends (e.g. World Bank, 2006). Use of the data in this way is premised on the untested assumption that year to year changes primarily reflect real progress (or deterioration). Alternatively, a large proportion of changes over time may merely reflect belated corrections to incomplete information, or inferences from changes in economic performance, or even changes in the identity of staff responsible for the ratings. Finally, further research could also add to the debates on the appropriate content of the CPIA and on the appropriate question or cluster weighting in the IDA allocation formula. Analyses of the validity of the other three CPIA clusters could usefully complement this one. Results for cluster D questions – which are more subjective, and count more for IDA allocations – will not necessarily generalize to clusters A (macro policies), B (structural policies) and C (social sector and environmental policies). 17 It should be noted however that regional differences likely go well beyond the identity of staff in the RCE offices. E.g., the ECA region has been easily the most assertive region in proposing increases for a period encompassing the tenure of at least two RCEs. 24 REFERENCES Cage, Julia, 2009. “Growth, Poverty Reduction and Governance in Developing Countries: A Survey.� CEPREMAP Working Papers 0904, CEPREMAP. Collier, Paul and David Dollar, 2002. “Aid Allocation and Poverty Reduction.� European Economic Review 46: 1475-1500. Denizer, Cevdet; Daniel Kaufmann and Aart Kraay (2011). “Good Countries or Good Projects? Macro and Micro Correlates of World Bank Project Performance.� World Bank Policy Research Working Paper 5646. Washington DC: The World Bank. Dollar, David and Victoria Levin, 2006. “The Increasing Selectivity of Foreign Aid, 1984-2003.� World Development 34(12): 2034-46. Gelb, Alan, Brian Ngo and Xiao Ye, 2004. “Implementing Performance-Based Aid in Africa: The Country Policy and Institutional Assessment.� Africa Region Working Paper Series No. 77. Washington, DC: The World Bank. IEG (Independent Evaluation Group), 2010. The World Bank’s Country Policy and Institutional Assessments: an IEG Evaluation. Washington DC: The World Bank. Kanbur, Ravi, 2005. “Reforming the Formula: A Modest Proposal for Introducing Development Outcomes in IDA Allocation Procedures.� Revue d’Economie du Developpment 79-99. Knack, Stephen, 2013. “Building or Bypassing Recipient Country Systems: Are Donors Defying the Paris Declaration?� World Bank Policy Research Working Paper 6423. Washington DC: The World Bank. Knack, Stephen, 2009. “Sovereign Rents and Quality of Tax Policy and Administration.� Journal of Comparative Economics, 37(3), 359-71. Knack, Stephen, F. Halsey Rogers and Nicholas Eubank, 2011. Aid Quality and Donor Rankings. World Development 39(19), 1907-17. OECD, 2011. Aid Effectiveness, 2005-10: Progress in Implementing the Paris Declaration. OECD, Paris. Steets, Julia, 2008. “Adaptation and Refinement of the World Bank’s Country Policy and Institutional Assessment (CPIA).� Global Public Policy Institute, on behalf of the German Federal Ministry for Economic Cooperation and Development. World Bank, 2011a. Global Monitoring Report 2011: Improving the Odds of Achieving the MDGs. World Bank, Washington DC. World Bank, 2011b. “Country Policy and Institutional Assessment 2011: Assessment Questionnaire.� World Bank, Washington DC. World Bank, 2010. “IDA’s Performance Based Allocation System: Review of the Current System and Key Issues for IDA16.� World Bank, Washington DC. World Bank, 2007. “Selectivity and Performance: IDA’s Country Assessment and Development Effectiveness.� World Bank, 2006. Global Monitoring Report 2006: Strengthening Mutual Accountability, Aid, Trade and Governance. World Bank, Washington DC. 25 World Bank, 2004. “Country Policy and Institutional Assessment: An External Panel Review.� World Bank, Washington DC. 26 Table 1 Word count in write-ups Equation 1.1 (All) 1.2 (IDA) 1.3 (non-IDA) IDA 0.151** (1.97) Benchmark country 0.064 0.111 0.020 (0.88) (1.24) (0.22) Increase proposed 0.153*** 0.098 0.228** (2.97) (1.63) (2.56) Decrease proposed 0.012 0.091 -0.039 (0.15) (0.82) (-0.36) Freedom House press -0.003 -0.002 -0.002 freedoms (-1.56) (-1.18) (-0.65) Log of population 0.001 0.017 -0.010 (0.07) (0.70) (-0.48) Log of per capita GNI -0.035 -0.047 -0.024 (-0.93) (-0.85) (-0.37) CMU in country 0.075 0.112 0.072 (0.99) (0.89) (0.85) Sub-Saharan Africa 0.174** 0.201* 0.239** (2.30) (1.71) (2.28) East Asia & Pacific 0.274*** 0.371*** 0.259** (4.11) (3.83) (2.44) East Europe & Central Asia 0.213** 0.429*** 0.095 (2.56) (3.23) (0.90) Middle East & North Africa 0.071 0.378** 0.027 (0.76) (2.12) (0.25) South Asia 0.182 0.199 (1.15) (1.19) Question 14 -0.142*** -0.169*** -0.104 (-3.54) (-3.56) (-1.46) Question 15 0.142*** 0.175*** 0.096 (3.36) (3.48) (1.29) Question 16 0.303*** 0.413*** 0.154*** (8.19) (9.07) (2.66) Constant 1.910 1.766 2.003 (3.68) (2.65) (2.64) No. of observations 544 316 228 No. of countries 136 79 57 2 R .32 .39 .18 Dependent variable is log of number of words in question write-up. Unit of analysis is country- question. Omitted region is Latin America and Caribbean. Omitted question is 12. T-statistics, reported in parentheses below point estimates, are based on standard errors adjusted for non- independence of errors within regional clusters of observations, with *** p<0.01, ** p<0.05, * p<0.1. Table 2 Coverage of criteria in write-ups Equation 2.1 2.2 2.3 2.4 Method Ordered probit OLS OLS OLS Sample All All IDA Non-IDA Word count (log) 1.143*** 0.672*** 0.655*** 0.734*** (8.02) (8.44) (5.21) (6.28) IDA 0.082 0.039 (0.43) (0.35) Benchmark country 0.061 0.036 0.036 0.130 (1.45) (0.42) (0.30) (0.87) Increase proposed 0.005 0.002 0.105 -0.235 (0.03) (0.02) (0.98) (-1.50) Decrease proposed 0.225 0.133 0.010 0.161 (0.92) (0.92) (0.05) (0.89) Freedom House press 0.009*** 0.006*** 0.006* 0.005** freedoms (2.68) (2.78) (1.86) (2.07) Log of population 0.139*** 0.083*** 0.075** 0.074* (3.04) (3.01) (2.09) (1.92) Log of per capita GNI 0.078 0.045 0.002 0.165* (0.83) (0.79) (0.03) (1.69) CMU in country -0.379** -0.224** -0.074 -0.338** (-2.22) (-2.14) (-0.46) (-2.56) Sub-Saharan Africa 0.233 0.158 0.190 0.067 (1.37) (1.51) (1.17) (0.50) East Asia & Pacific 1.440*** 0.835*** 0.984*** 0.760*** (5.73) (6.06) (5.06) (3.81) East Europe & Central Asia 0.371* 0.233** 0.348* 0.201 (1.92) (1.97) (1.87) (1.46) Middle East & North Africa 0.589** 0.359** 0.400 0.372* (2.27) (2.29) (1.18) (1.92) South Asia 0.519 0.313 0.307 (1.44) (1.45) (1.32) Question 14 -0.399*** -0.241*** -0.327*** -0.120*** (-3.51) (-3.44) (-3.41) (-1.12) Question 15 -0.509*** -0.311*** -0.538*** 0.013 (-3.44) (-3.55) (-5.17) (0.09) Question 16 0.102 0.049 -1.08 0.273*** (0.93) (0.78) (-1.31) (2.66) Constant -0.510 0.007 -1.607 (-0.68) (0.01) (-1.50) No. of obs., countries 544, 136 544, 136 316, 79 228, 136 2 2 Pseudo R or R .17 .33 .40 .31 Dependent variable is “Grade� (A, B, C or D) indicating coverage of criteria in question. Unit of analysis is country-question. Omitted region is Latin America and Caribbean. Omitted question is 12. T-statistics, reported in parentheses below point estimates, are based on standard errors adjusted for non-independence of errors within regional clusters of observations, with *** p<0.01, ** p<0.05, * p<0.1. Table 3 Network review (probit regressions) Equation 3.1 3.2 Dependent variable Request info Request info Proposed rating (1-6 scale) -0.001 -0.006 (-0.14) (-0.52) Grade (coverage of criteria) -0.029*** (-5.60) Word count (log) -0.017** -0.069*** (-2.03) (-3.72) IDA 0.083*** 0.113*** (5.33) (4.55) Benchmark country -0.005 -0.011 (-0.64) (-0.88) Increase proposed 0.179*** 0.219*** (4.61) (4.52) Decrease proposed 0.058** 0.067* (2.17) (1.76) Freedom House press 0.0002 0.0003 Freedoms (0.97) (0.60) Log of population 0.007*** 0.006 (2.65) (1.42) Log of per capita GNI 0.010** 0.019* (1.96) (1.70) CMU in country 0.007 0.036 (0.77) (1.61) Sub-Saharan Africa -0.002 -0.006 (-0.22) (-0.32) East Asia & Pacific -0.013 -0.040** (-1.44) (-2.43) East Europe & Central Asia 0.007 0.001 (0.57) (0.05) Middle East & North Africa 0.015 0.002 (0.70) (0.07) South Asia -0.014*** -0.035*** (-2.71) (-2.99) Question 14 0.004 0.030 (0.35) (1.16) Question 15 0.013 0.071*** (1.01) (2.61) Question 16 0.013 0.008 (1.15) (0.44) No. of obs., countries 544, 136 544, 136 Pseudo R2 .39 .26 Dependent variable is “Request,� coded 1 if either the PREM or CFP network reviews requested that additional information be provided in a revised write-up, and 0 otherwise. Omitted region is Latin America and Caribbean. Omitted question is 12. Coefficients represent marginal effects evaluated at mean of other independent variables. T-statistics, reported in parentheses below point estimates, are based on standard errors adjusted for non-independence of errors within regional clusters of observations, with *** p<0.01, ** p<0.05, * p<0.1. 29 Table 4 Network review: Disagreement (probit regressions) Equation 4.1 4.2 4.3 4.4 Sample All All All IDA Proposed rating (1-6 scale) 0.021 0.060*** 0.021 (1.58) (2.90) (1.58) Grade (coverage of criteria) -0.022*** -0.031*** -0.032** -0.009** (-3.30) (-2.62) (-2.10) (-2.43) Word count (log) 0.021 0.057*** 0.071*** 0.006 (1.61) (2.82) (2.83) (0.90) IDA 0.005 -0.017 -0.008 (0.22) (-0.55) (-0.24) Benchmark country 0.026 -0.016 -0.016 -0.002 (1.63) (-0.93) (-0.65) (-0.60) Increase proposed 0.454*** 0.287*** (8.26) (6.53) Decrease proposed 0.035 -0.003 -0.023 0.057 (0.96) (-0.10) (-0.61) (1.33) Freedom House press 0.021 -0.001 0.001 -0.0002 Freedoms (1.58) (-1.29) (0.77) (-1.62) Log of population 0.006 0.003 0.008 0.007 (1.11) (0.40) (1.06) (4.28) Log of per capita GNI -0.007 -0.039** -0.024 0.003 (-0.57) (-2.25) (-1.58) (0.74) CMU in country -0.010 0.020 0.017 -0.007** (-0.87) (0.89) (0.66) (-2.01) Sub-Saharan Africa -0.023 -0.066*** -0.079*** -0.008 (-1.38) (-2.83) (-3.05) (-1.00) East Asia & Pacific -0.014 -0.041** -0.059*** -0.008** (-1.29) (-2.15) (-2.64) (-2.20) Latin America & Caribbean -0.007 -0.020 -0.036* -0.001 (-0.61) (-1.05) (-1.63) (-0.08) Middle East & North Africa -0.015* -0.032** -0.043* -0.002 (-1.90) (-2.23) (-1.92) (-0.30) South Asia -0.012 -0.036** -0.050** -0.005 (-0.79) (-2.21) (-2.29) (-1.32) Question 14 -0.022** -0.039** -0.012 -0.004 (-2.06) (-2.07) (-0.47) (-0.62) Question 15 -0.023*** -0.039** -0.040* -0.008 (-2.93) (-2.46) (-1.71) (-1.45) Question 16 0.008 0.043** 0.051** 0.019** (0.84) (2.14) (2.16) (2.00) No. of obs., countries 544, 136 544, 136 544, 136 316, 79 Pseudo R2 .52 .17 .17 .57 Dependent variable is “Disagreement,� coded 1 if either the PREM or CFP network reviews recommend a different rating from the one proposed by the region, and 0 otherwise. Omitted region is East Europe and Central Asia. Omitted question is 12. Coefficients represent marginal effects evaluated at mean of other independent variables. T- statistics, reported in parentheses below point estimates, are based on standard errors adjusted for non-independence of errors within regional clusters of observations, with *** p<0.01, ** p<0.05, * p<0.1. 30 Table 5 Network review: Ratings Adjustment (probit regressions) Equation 5.1 5.2 Sample All All Disagreement Proposed rating (1-6 scale) 0.021** (2.23) Grade (coverage of criteria) -0.017*** (-3.16) Word count (log) 0.005 (0.58) IDA 0.002 (0.10) Benchmark country 0.033* (1.71) Increase proposed 0.408*** (8.13) Decrease proposed 0.040 (1.23) Freedom House press -0.1001 Freedoms (-0.60) Log of population 0.004 (0.96) Log of per capita GNI -0.001 (-0.09) CMU in country -0.006 (-0.62) Sub-Saharan Africa -0.050** -0.010 (-2.00) (-0.72) East Asia & Pacific -0.070*** -0.011 (-2.75) (-0.83) Latin America & Caribbean -0.046* -0.009 (-1.87) (-0.79) Middle East & North Africa -0.042* -0.010 (-1.58) (-1.06) South Asia -0.036 -0.001 (-1.23) (-0.04) Question 14 -0.003 -0.019*** (-0.11) (-2.58) Question 15 -0.012 -0.013* (-0.47) (-1.76) Question 16 0.054** 0.001 (2.04) (0.07) No. of obs., countries 544, 136 544, 136 Pseudo R2 .08 .46 Dependent variable is “Rating Adjustment,� coded 1 if the final rating differs from the one originally proposed by the region. Omitted region is East Europe and Central Asia. Omitted question is 12. Coefficients represent marginal effects evaluated at mean of other independent variables. T-statistics, reported in parentheses below point estimates, are based on standard errors adjusted for non- independence of errors within regional clusters of observations, with *** p<0.01, ** p<0.05, * p<0.1. 31 Table 6: Testing for Regional or IDA Bias Equation 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Rating proposed final proposed final proposed final proposed final Question 12 12 14 14 15 15 16 16 IDA -0.089 -0.135 -0.016 -0.010 -0.036 -0.037 0.017 0.021 (-0.79) (-1.23) (-0.12) (-0.08) (-0.23) (-0.24) (0.15) (0.20) ECA 0.275*** 0.276*** 0.231* 0.201* 0.363*** 0.355*** 0.192** 0.182** (3.00) (3.20) (1.93) (1.86) (3.15) (3.20) (2.01) (2.04) Log per capita GNI 0.012 -0.007 0.156** 0.161** -0.004 -0.010 0.084 0.082 (0.19) (-0.11) (2.32) (2.41) (-0.05) (-0.13) (1.50) (1.57) Index of guidepost 0.896*** 0.907*** indicators (Q12) (11.10) (11.62) DB number of tax -0.004** -0.005*** Payments (-2.05) (-2.42) DB tax rate -0.003** -0.002* (-2.20) (-2.15) EIU red tape/quality of 0.415*** 0.428*** Bureaucracy (7.79) (8.34) Index of guidepost 0.975*** 1.002*** indicators (Q16) (11.32) (12.81) Constant 3.159 3.321 2.692 2.664 3.118 3.161 2.555 2.567 (5.54) (5.81) (4.53) (4.55) (4.74) (4.87) (5.19) (5.65) No. of observations 132 132 130 130 115 115 132 132 2 R .69 .71 .23 .25 .47 .48 .69 .74 Dependent variable is CPIA rating on indicated question. T-statistics, reported in parentheses below point estimates, are based on robust standard errors, with *** p<0.01, ** p<0.05, * p<0.1. Table 7: Other regions relative to ECA (final ratings) Equation 7.2 7.4 7.6 7.8 Question 12 14 15 16 IDA -0.149 -0.031 -0.091 0.013 (-1.31) (-0.23) (-0.58) (0.13) AFR -0.344** -0.169 -0.387*** -0.193 (-2.48) (-0.89) (-2.83) (-1.56) EAP -0.438*** -0.328** -0.205 -0.233* (-3.26) (-1.98) (-1.37) (-1.68) LCR -0.211** -0.120 -0.342** -0.228** (-2.40) (-0.96) (-2.55) (-2.12) MNA -0.217 -0.430*** -0.545** -0.061 (-1.63) (-2.85) (-2.31) (-0.44) SAR -0.191 -0.272 -0.192 0.010 (-1.13) (-1.42) (-1.27) (0.06) Log per capita GNI -0.043 0.147 -0.026 0.084 (-0.51) (1.55) (-0.30) (1.35) Index of guidepost 0.943*** indicators (Q12) (11.54) DB number of tax -0.006*** Payments (-2.70) DB tax rate -0.003** (-2.23) EIU red tape/quality of 0.418*** bureaucracy (7.55) Index of guidepost 1.023*** indicators (Q16) (12.89) Constant 3.915 3.032 3.665 2.747 (5.13) (3.47) (4.73) (4.87) No. of observations 132 130 115 132 2 R .73 .28 .50 .74 Dependent variable is CPIA rating on indicated question. T-statistics, reported in parentheses below point estimates, are based on robust standard errors, with *** p<0.01, ** p<0.05, * p<0.1. 900 797 800 703 700 615 number of words (in thousands) 600 560 499 500 429 400 380 311 300 277 200 167 94 100 61 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Year Figure 1 Word Count (in thousands) of AFR Write-ups, 2001-2012 34 Q12: Word Count Range 25.0 20.0 20.0 Word Count (min, avg, max) 15.0 11.7 10.0 10.0 Mean 8.9 6.9 5.0 4.8 3.8 3.4 3.0 1.6 0.0 A B C D Coverage Grade Figure 2 Mean, maximum and minimum word count (in hundreds) for Question 12 35 Q14: Word Count Range 16.0 14.0 13.3 Word Count (min, avg, max) 12.0 10.0 10.4 10.3 8.0 7.4 6.4 6.3 Mean 6.0 5.2 4.0 4.3 4.1 2.4 2.6 2.0 2.1 0.0 A B C D Coverage Grade Figure 3 Mean, maximum and minimum word count (in hundreds) for Question 14 36 Q15: CPIA Word Count Range 30.0 26.0 25.0 Word Count (min, avg, max) 20.0 16.4 15.0 14.7 Mean 10.0 10.4 9.2 8.4 6.9 5.0 5.6 3.8 3.1 3.2 2.5 0.0 A B C D Coverage Grade Figure 4 Mean, maximum and minimum word count (in hundreds) for Question 15 37 Q16: CPIA Word Count Range 50.0 45.0 45.4 40.0 Word Count (min, avg, max) 35.0 30.0 25.0 20.0 Mean 18.1 15.0 14.2 12.5 10.0 8.9 5.0 4.7 6.0 3.3 2.5 0.0 A B C Coverage Grade Figure 5 Mean, maximum and minimum word count (in hundreds) for Question 16 38 Percent of "C" or "D" grades by Region 60.0 50.0 % of "C" or "D" grades 40.0 30.0 20.0 10.0 0.0 EAP AFR MNA ECA SAR LCR Region Figure 6 Percentage of write-ups graded “C� or “D�, by region 39 Information Requests by Region 0.120 0.100 Average Information Requests 0.080 0.060 0.040 0.020 0.000 EAP SAR MNA ECA LCR AFR Region Figure 7 Network reviewer requests for additional information (%), by region 40 Disagreements by Region 0.200 0.180 0.160 Average Disagreements 0.140 0.120 0.100 0.080 0.060 0.040 0.020 0.000 EAP MNA LCR SAR AFR ECA Region Figure 8 Network reviewer disagreements with proposed ratings (%), by region 41 Appendix CPIA Cluster D Short Definitions 12. Property Rights and Rule-based Governance This criterion assesses the extent to which economic activity is facilitated by an effective legal system and rule-based governance structure in which property and contract rights are reliably respected and enforced. Each of three dimensions should be rated separately: (a) legal framework for secure property and contract rights, including predictability and impartiality of laws and regulations; (b) quality of the legal and judicial system, as measured by independence, accessibility, legitimacy, efficiency, transparency, and integrity of the courts and other relevant dispute resolution mechanisms; and (c) crime and violence as an impediment to economic activity and citizen security. 14. Efficiency of Revenue Mobilization This criterion assesses the overall pattern of revenue mobilization, not only the tax structure as it exists on paper, but revenue from all sources as they are actually collected. Separate sub-ratings should be provided for: (a) tax policy; and (b) tax administration. For the overall rating, these two dimensions should receive equal weighting. 15. Quality of Public Administration This criterion covers the core administration defined as the civilian central government (and sub- national governments, to the extent that their size or policy responsibilities are significant) excluding health and education personnel, and police. The criterion assesses the functioning of the core administration in three areas: (a) managing its own operations; (b) ensuring quality in policy implementation and regulatory management; and (c) coordinating the larger public sector Human Resources Management regime outside the core administration (de-concentrated and arms-length bodies and subsidiary governments). 16. Transparency, Accountability, and Corruption in the Public Sector This criterion assesses the extent to which the executive, legislators, and other high-level officials can be held accountable for their use of funds, administrative decisions, and results obtained. Accountability is generally enhanced by transparency in decision-making, access to relevant and timely information, public and media scrutiny, and by institutional checks (e.g., inspector general, ombudsman, or independent audit) on the authority of the chief executive. The criterion covers four dimensions: (a) the accountability of the executive and other top officials to effective oversight institutions; (b) access of civil society to timely and reliable information on public affairs and public policies, including fiscal information (on public expenditures, revenues, and large contract awards); (c) state capture by narrow vested interests; and (d) integrity in the management of public resources, including aid and natural resource revenues. 42 Descriptions for Rating of “3� Question 12 a. The law protects property rights in theory, but in fact registries and other institutions required to make this protection effective function poorly, making the protection of private property uncertain. Contract enforcement through formal mechanisms is costly and unreliable. Laws and regulations are not changed arbitrarily, but may not be publicly available. b. Judges and prosecutors are sometimes subject to political interference, and laws are sometimes selectively applied (e.g., against the political opposition). Merit plays some role in judicial appointments. Legal claims against government officials or other elites are commonly prosecuted, but rulings against them are not always enforced. Courts are costly and time-consuming to use, even for small claims. Delays are common. Bribes are known to occur occasionally in the system. Judicial decisions are sometimes publicly available. c. The state is somewhat effective in limiting violence and crime against citizens and their property. The state actively attempts to combat organized crime, which accounts for a relatively small share of economic activity. A majority of victims report crimes to the police, and citizens generally do not view the police as a source of crime and violence. Question 14 a. Taxes on trade are a major source of revenue; turnover and other distortionary taxes and levies remain. Consumption-based taxes (e.g., a VAT) are planned, or in limited use. Import tariffs are moderate, but there are too many rates. The income tax base is narrow, and the rate structure is only partly rationalized. Exemptions are moderate. b. Tax administration is weak, but tax laws are not inordinately complex, and discriminatory enforcement is the exception rather than the rule. Information systems are functioning (e.g., unique taxpayer identification numbers are used). Corruption exists, but there are efforts to improve integrity as well as capacity. Collection and compliance costs are nevertheless somewhat excessive, and collection rates are relatively low. Question 15 a. The core administration demonstrates modest internal management capacity: major personnel actions, such as recruitment and selection, promotions, and dismissals sometimes reflect merit and performance; terms of employment, and pay are barely sufficiently attractive to ensure that the public administration can compete reasonably effectively for any scarce skill sets it requires; the public sector pay regime is sometimes unable to motivate effort within the public service. b. The core administration demonstrates modest capacity to ensure quality in policy and regulatory management: Cabinet decisions, presidential or ministerial policy announcements are occasionally dropped or otherwise not implemented; the institutional responsibilities for data collection, analysis and reporting in the sectors are occasionally weak or unclear; and the bodies with responsibility for sector regulation (infrastructure, transport, etc.) are occasionally not regarded as independent in practice and few have adequate regulatory quality management arrangements in place. c. The core administration demonstrates modest capacity to coordinate the broader public sector HRM regime: (i) merit is the predominant factor in obtaining appointments or promotion in many entities; and (ii) the aggregate public sector wage bill is at some risk of unsustainability. 43 Question 16 a. Checks and balances on executive power are somewhat effective. External accountability mechanisms may exist, but they have inadequate resources or authority. Regulation of political financing is poorly enforced, usually to the benefit of incumbents. Anticorruption efforts tend to focus on the political opposition. Citizens are sometimes able to bring claims against the state, and legitimate claims are sometimes successful. b. Decision-making is generally not transparent, and public dissemination of information on government policies and outcomes is a low priority. Some key budget documents are not publicly available. Official restrictions on the media, as well as violence against and harassment of journalists, limit the media’s potential for information-gathering and scrutiny. c. Boundaries between the public and private sectors are moderately well defined, but violations are frequent and often not investigated or sanctioned. Elected and other high-level public officials often have private interests that conflict with their professional duties. Conflict of interest and asset disclosure rules do not apply to high-level officials or are enforced only selectively. d. Public funds are sometimes diverted to unintended uses by high-level officials, but the prospect of sanctions has some deterrent effect. Bribery and collusion between bidders are common in public contracting, and value for money is often a minor consideration in contract awards. 44