PEFA, Public Financial Management, and Good Governance

The research is based on the PEFA framework and methodology for assessing public financial management performance and the data set that is generated from the PEFA assessments. The research quantified PEFA scores and aggregated them into overall scores which required developing assumptions on weighting scores, measures, and assessments. The research acknowledges methodological limitations of using the PEFA data set, including the assumptions. In general, the research follows the approach taken by previous researchers who have used PEFA data for quantitative analysis, but this does not eliminate the challenges that persist in transforming grades to numerical values and aggregating them. The time inconsistency issues and the limited number of observations also influenced the regression analysis using the PEFA data set. The team acknowledges that the PEFA data set was not designed for statistical analysis and that using it in quantitative regressions presents a series of econometric issues that cannot be fully resolved in this book, or in other papers which apply a similar approach. The research report builds on general recognition that PFM is important for development and recognizes that there is limited evidence based on the nontechnical determinants of PFM performance, as well as the outcomes of a good PFM system. The report therefore aims to bridge some of this gap between theory and practice using data on PFM performance from PEFA assessments. The report undertakes a closer examination of the key debates on what constitutes a good PFM system by providing an overview of the PEFA framework, and the data set that is generated through PEFA assessments, including its strengths and weaknesses. This was done to enable the research team to undertake quantitative analysis of the relationship between PFM performance and other governance indicators and outcomes.

Calculating an overall Public Expenditure and Financial Accountability (PEFA) score 29 4. 1 Average quality of the public financial management (PFM) system in fragile and nonfragile countries (Fragile 1) 68 4.2 Average quality of the public financial management (PFM) system in fragile and nonfragile countries (Fragile 2) 69 5. 1 Distribution and correlation of the Public Expenditure and Financial Accountability (PEFA) and control of corruption (WGI_ COC) scores 100 5. 2 Distribution of scores for subindexes of transparency of budget execution reporting 102 5.3 Correlations between subindexes and control of corruption (WGI_ COC) 102 6. 1 Mean tax-to-GDP ratio, by dimension score for Public Expenditure and Financial Accountability (PEFA) indicators PI-13, PI-14, and PI-15 126 6.2 Frequency distribution (number), by dimension score for Public Expenditure and Financial Accountability (PEFA) indicators PI-13, PI-14, and PI-15 128 6.3 Distribution of scores, by dimension and income group for Public Expenditure and Financial Accountability (PEFA) indicators PI-13, PI-14, and PI-15 129 6.4 Changes in PI-14ii and PI-14i scores between assessments 135 Tables   2. 1 Number of pillars, indicators, and dimensions of the Public Expenditure and Financial Accountability (PEFA) framework 11 2.2 How to score an A on the three dimensions under PI-11-orderliness and participation in the annual budget process 11 2.3 Other diagnostic tools 13 2. 4 Numerical conversion of Public Expenditure and Financial Accountability (PEFA) scores 28 2. 5 Summary statistics for different methodologies for calculating an overall score 30 vi | PEFA, PUBLIC FINANCIAL MANAGEMENT, AND GOOD GOvERNANCE

2.6
Correlations between different methodologies for calculating an overall score 30 3. 1 Spearman rank coefficients for nonbinary macropolitical variables 44 3. 2 Cross-sectional analysis for presidential regimes vs. nonpresidential regimes and other country characteristics 47 3.3 Alternative definition of democratic presidential regimes 48 3. 4 Cross-sectional analysis for majoritarian vs. nonmajoritarian systems and other country characteristics 49 3. 5 Cross-sectional analysis for partisan fragmentation and other country characteristics 50 3. 6 Cross-sectional analysis using alternative measure of divided government 50 3. 7 Cross-sectional analysis for programmatic party systems using other country characteristics 51 3. 8 First-differences analysis with absolute change in programmatic party measure 51 3A. 1 Summary statistics 55 3A. 2 Cross-sectional analysis for presidential regimes vs. nonpresidential regimes controlling for democracy level and other country characteristics 55 3A.3 Cross-country regression using Country Policy and Institutional Arrangements indicator 13 (CPIA-13) 56 3A. 4 First-differences model using absolute change in Country Policy and Institutional Arrangements indicator 13 (CPIA-13) 57 4. 1 Summary of hypothesized links with specific public financial management (PFM) elements 67 4.2 Cross-country ordinary least squares using budget credibility as the dependent variable 71 4.3 Conditional coefficients for overall public financial management (PFM) quality in fragile states 72 4.4 Conditional coefficients for quality of specific elements of public financial management (PFM) in fragile states with compositional budget credibility as the dependent variable 72 4.5 Cross-country ordinary least squares using fiscal outcomes as the dependent variable 73 4. 6 Conditional coefficients for public financial management (PFM) quality in fragile and nonfragile states with fiscal outcomes as the dependent variable 74 4.7 Conditional coefficients for public financial management (PFM) quality in fragile and nonfragile states using sovereign credit rating as the dependent variable 74 4A. 1 Cross-sectional sample of 116 countries by income group using the narrow definition of fragility 77 4A. 2 Cross-sectional sample of 116 countries by income group using the broad definition of fragility 78 4B. 1 Regression results controlling for having an International Monetary Fund program between 2012 and 2015 80 4B. 2 Robustness check using de jure measure of public financial management (PFM) quality as the dependent variable 81 4B.3 Robustness check using Country Policy and Institutional Assessment indicator 13 (CPIA-13) as the alternative measure of overall public financial management (PFM) quality 82 4B. 4 Robustness check using baseline models restricted to a sample of countries with Public Expenditure and Financial Accountability (PEFA) assessments from 2012 onward 83 4B. 5 Robustness check using sovereign credit rating as the dependent variable 83 4C. 1 Regression results using primary balance (% of GDP) as the dependent variable 84 4C. 2 Regression results using public external debt (% of GDP) as the dependent variable 85 4C.3 Regression results using a narrow definition of fragility with aggregate budget credibility as the dependent variable 86 4C. 4 Regression results using a broad definition of fragility with aggregate budget credibility as the dependent variable 87 4C. 5 Regression results using a narrow definition of fragility with compositional budget credibility as the dependent variable 88 4C. 6 Regression results using a broad definition of fragility with compositional budget credibility as the dependent variable 89 5. 1 Examples of corruption, by type of government expenditure 95 5. 2 Public Expenditure and Financial Accountability (PEFA) indicators for transparency in budget preparation (TRANS1) 101 5.3 Public Expenditure and Financial Accountability (PEFA) indicators for transparency in budget executing reporting (TRANS2) 103 5.4 Public Expenditure and Financial Accountability (PEFA) indicators for transparency in audit (TRANS3) 103 5. 5 Public Expenditure and Financial Accountability (PEFA) indicators for budget execution controls (CONTROLS) 104 5. 6 Spearman correlation coefficients for public financial management (PFM) subindexes 104 5.7 Control variables 105 5.8 Weighted least squares estimates for Public Expenditure and Financial Accountability (PEFA) indicators and control of corruption 107 Tax administration assessment indicators and dimensions in the 2011 Public Expenditure and Financial Accountability (PEFA) framework 125 6.4 Spearman correlation coefficients for tax-to-GDP ratio and dimensions under Public Expenditure and Financial Accountability (PEFA) indicators PI-13, PI-14, and PI-15 127 6.5 Cross-sectional sample 130 6. 6 Control variables 131 6.7 Sample size, by small island developing states (SIDS) status and region 132 6.8 Unbalanced panel data sample, 2005-15 132 6.9 Ordinary least squares (OLS) estimates for the relationship between performance indicators and the tax-to-GDP ratio 133 6. 10 Panel estimates for the relationship between performance indicators and the tax-to-GDP ratio controlling for country-specific factors 134 6A. 1 Cross-sectional sample of 112 countries by income group at time of most recent assessment 137 6A. 2 Panel sample of 61 countries by income group at time of most recent assessment 138 6B. 1 Ordinary least squares (OLS) estimates for the relationship between performance indicators and the tax-to-GDP ratio using dummy variables for PI-14(ii) (1 of 2) 139 6B. 2 Ordinary least squares (OLS) estimates for the relationship between performance indicators and the tax-to-GDP ratio using dummy variables for PI-14(ii) (2 of 2) 139 6B.3 Ordinary least squares (OLS) estimates for the relationship between performance indicators and the tax-to-GDP ratio for reduced sample sizes 140 6B. 4 Ordinary least squares (OLS) estimates for the relationship between performance indicators and the tax-to-GDP ratio using alternative samples for general government data 141 ix Preface This book examines the interplay between public financial management (PFM) and other key aspects of governance in low-and middle-income countries, using the Public Expenditure and Financial Accountability (PEFA) framework and related data sets to measure the quality of PFM systems. The PEFA framework was developed on the premise that effective PFM institutions and systems play a crucial role in the implementation of national policies for development and poverty reduction. It is part of a broader set of initiatives aimed at strengthening public sector governance frameworks. Governments and development partners have been using PEFA to support analysis of PFM since 2005. They have also used it to provide a baseline for reform initiatives and to inform action plans for improving performance. This book uses the PEFA assessment results to understand the impact of PFM performance on other governance initiatives.
The book is part of a project to improve the evidence base for understanding the impact of PEFA and PFM reforms with respect to political institutions, fragility, anticorruption, and revenue mobilization. The research was undertaken by the Overseas Development Institute (ODI) in close cooperation with the PEFA Secretariat.
The research seeks to strengthen the understanding of the relationship between political institutions, including forms and types of government, electoral systems, and political parties and the quality of PFM systems. It further explores the credibility of the budget and fiscal outcomes in fragile contexts and compares those to nonfragile contexts to highlight the role that PFM can play in environments with weak institutional capacity. The book also aims to disentangle the relationship between perceptions of corruption and PFM performance. Finally, it looks at the role of revenue administration in domestic resource mobilization and particularly at the credible use of penalties for noncompliance for improving tax performance.
The primary audience includes government officials, staff of bilateral and international organizations, researchers, and members of civil society involved in PFM reforms and other governance initiatives. This book contributes to discussions on the role of PFM in strengthening governance frameworks by offering a crosscountry analysis to outline determinants and outcomes associated with better PFM performance. It also provides an overview of key debates on what constitutes a good PFM system, highlights which parts of the PFM system matter more for different governance initiatives, and attempts to quantify the impact of PFM reforms.
• A coordinated program of support from donors and international finance institutions in relation to analytical work, reform financing, and technical support for implementation • A shared information pool on public financial management (PFM): information on PFM systems and their performance, which is commonly accepted by and shared among the stakeholders at country level, thus avoiding duplicative and inconsistent analytical work The PEFA program produced the PEFA framework, which assesses the status of a country's PFM. It measures the extent to which PFM systems, processes, and institutions contribute to the achievement of desirable budget outcomes: aggregate fiscal discipline, strategic allocation of resources, and efficient service delivery. For more information about PEFA, visit www.PEFA.org. xvii

Summary
The United Kingdom's Department for International Development (DFID) funded a research project to generate a robust evidence base for understanding the impact of Public Expenditure and Financial Accountability (PEFA) and public financial management (PFM) reforms. The purpose of the research project, based on the PEFA data set, was to understand how PEFA can be potentially utilized to shape policy development at the interface of PFM and other major relevant policy areas like anticorruption, revenue mobilization, political economy analysis, and fragile states. Four research papers were produced to outline the relationship between PEFA, PFM, and the four selected policy areas. Additional papers on methodology, outlining the study approach and PEFA data set specifics, were produced to accompany the research project outputs. The Overseas Development Institute (ODI) was contracted to carry out the project work. The research is based on the PEFA framework and methodology for assessing PFM performance and the data set that is generated from the PEFA assessments. The research quantified PEFA scores and aggregated them into overall scores, which required developing assumptions on weighting scores, measures, and assessments. The research acknowledges methodological limitations of using the PEFA data set, including the assumptions. In general, the research follows the approach taken by previous reseachers who have used PEFA data for quantitative analysis, but this does not eliminate the challenges that persist in transforming grades to numerical values and aggregating them. The time inconsistency issues and the limited number of observations also influenced the regression analysis using the PEFA data set. The team acknowledges that the PEFA data set was not designed for statistical analysis and that using it in quantitative regressions presents a series of econometric issues that cannot be fully resolved in this book, or in other papers that apply a similar approach.
The research report builds on general recognition that PFM is important for development and recognizes that there is limited evidence based on the nontechnical determinants of PFM performance, as well as the outcomes of a good PFM system. The report therefore aims to bridge some of this gap between theory and practice using data on PFM performance from PEFA assessments. The report undertakes a closer examination of the key debates on what constitutes a good PFM system by providing an overview of the PEFA framework, and the data set that is generated through PEFA assessments, including its strengths and weaknesses. This was done xviii | PEFA, PUBLIC FINANCIAL MANAGEMENT, AND GOOD GOvERNANCE to enable the research team to undertake quantitative analysis of the relationship between PFM performance and other governance indicators and outcomes.
The report looks at the question of what shapes the PFM system in low-and middle-income countries by examining the relationship between political institutions and the quality of the PFM system. The report builds on the existing theoretical and empirical literature by refining and nuancing previous hypotheses on the relationship, retesting hypotheses using a larger sample, and testing new hypotheses. The report finds little evidence that these relationships hold in low-and middleincome country contexts and notes several relationships that are in fact counterintuitive. Although the report finds some evidence that having multiple political parties controlling the legislature is associated with better PFM performance, more generally, the report findings point to the need for further refinement and testing of the theories on the relationship between political institutions and PFM in low-and middle-income countries.
The report deals with the question of the outcomes of PFM systems, distinguishing between fragile and nonfragile states. Specifically, it explores whether the credibility of the budget and fiscal outcomes improve with better PFM performance using various definitions of fragility. The report findings are mixed. The report finds that better PFM performance is associated with more reliable budgets in terms of expenditure composition in fragile states, but not aggregate budget credibility. Moreover, in contrast to existing studies, it finds no evidence that PFM quality matters for deficit and debt ratios, irrespective of whether a country is fragile or not. The research study also concluded that there will be significant value in future research of conducting case studies on governments that have systematically met fiscal targets over a defined period of time.
The report also explores the relationship between corruption and PFM performance. The analysis is limited by the constraint that there is no cross-country measure of actual corruption, and the report is therefore reliant on corruption perceptions indexes as a proxy and the potential measurement error that comes with such an instrument. Nevertheless, the report finds strong evidence of a relationship between better PFM performance and better perceptions of corruption. It also finds that PFM reforms associated with better controls have a stronger relationship with better perceptions of corruption compared to PFM reforms associated with more transparency. However, it finds the magnitude of the relationships underwhelming when compared with the magnitude of the relationship between economic growth and perceptions of corruption. This is in line with the findings of other studies. The report findings suggest that PFM reform may be part of an effective anticorruption campaign, or that contexts where perceptions of corruption are improving are more amenable to PFM reform. However, there remains much scope for further research in this area that more tightly defines individual PFM measures to more relevant measures of corruption.
The last chapter of the report looks at the relationship between PEFA indicators for revenue administration and domestic resource mobilization. It focuses specifically on the credible use of penalties for noncompliance as a proxy for the type of political commitment that is necessary for improving tax performance. The analysis shows that countries that credibly enforce penalties for noncompliance collect more taxes on average. Because of the potential for measurement, further in-country research on the dynamics of penalties for noncompliance is warranted.

PFM as a means to achieving other desirable outputs and outcomes
To turn these laudable goals into a reality, there has been increasing recognition of the "instrumental" role PFM plays in delivering services on which human and economic development rely. For example, "Better payments systems and better cash management make it more likely that payments can be made on time, including for wages, transfers, operations and management, and investments" (World Bank 2012, 51). This link between inputs and service delivery outputs and outcomes led to the use of public expenditure tracking surveys (PETSs) that trace the actual flow of public funds in a program or a sector and establish the extent to which public funds and other resources reach service providers. Although different public services will require a different mix of these inputs, regular payment of staff salaries is likely to be critical to the delivery of all public services (Welham, Krause, and Hedger 2013;Welham et al. 2017).

PFM and development
However, both during the MDG era and now in the SDG era, donors seeking to promote state-led development through country PFM systems face a dilemma: many of the countries that they are seeking to support have extremely weak PFM systems. Indeed, early PETSs revealed large amounts of leakage in the flow of funds. This leakage exposes donor support to fiduciary risk or the more general risk that their support will have little impact. It has also led to an increase in technical support to improve PFM systems through reforms.
Conditionality has aimed to strengthen the PFM system in aid-recipient countries to help to ensure that aid is used effectively for the purposes intended (DFID 2009). This conditionality was particularly important with the shift toward budget support as an aid modality during the MDG era, with funds channeled directly to a recipient government's treasury account and thereafter executed using the country's own allocation, procurement, and accounting systems. Similarly, debt relief programs launched in the late 1990s and 2000s have been used as leverage to move the indebted country into a new mode of operations to ensure that resources freed up through debt relief are used to reduce poverty or increase growth. To meet aid conditionalities, countries have had to develop action plans to strengthen systems for public expenditure management.

The emergence of diagnostic tools for assessing PFM systems
However, during the MDG era, each donor was initially using its own diagnostic tool to assess whether it should provide budget support through country systems, creating a massive compliance burden for recipient countries. The Paris Declaration on Aid Effectiveness (2005) committed donors to implement harmonized diagnostic reviews and performance assessment frameworks in PFM.
The PEFA framework emerged as the instrument to harmonize these various diagnostic tools and, as a result, has become the most widely used assessment of PFM performance in low-and middle-income countries.
The PEFA framework was introduced with three goals in mind: (a) to strengthen the ability of governments to assess systems of public expenditure, procurement, and fiduciary management and contribute to a government-led reform agenda; (b) to support the development and monitoring of reform and capacity development programs and facilitate a coordinated program of support; and (c) to contribute to the pool of information on PFM. 1 Since its launch in 2005, nearly 600 formal assessments (national and subnational) in 150 countries and territories have been undertaken and verified by the PEFA Secretariat. Today, most development partners use the PEFA framework as the basis for their diagnostics of PFM systems and assessment of associated fiduciary risks, especially to determine when to use country systems for individual operations. It has become the go-to measure of PFM.
Because of the international recognition of PEFA, there has also been a proliferation of other institutional diagnostics that largely replicate the approach and methodology of the PEFA framework. Most of these diagnostics focus on specific elements of the PFM system. Examples include the World Bank's Debt Management Performance Assessment (DeMPA) as well as the International Monetary Fund's Tax Administration Diagnostic Assessment Tool (TADAT) and Public Investment Management Assessment (PIMA).

Donor spending for strengthening PFM systems
Donors provide considerable financial support to PFM. Data from the OECD's Development Assistance Committee database shows a dramatic increase in disbursed funds for activities related to public sector financial management, which trebled from US$406 million in 2002 to US$1.3 billion in 2016 after peaking at roughly US$1.8 billion in 2011 ( figure 1.2). This surge in financing has naturally led to questions about whether this spending is achieving the desired results.

RESEARCH CONTRIBUTION TO DISCUSSIONS ON PFM PERFORMANCE AND ISSUES IN PFM REFORM
While there is general recognition that PFM is important for development, there is limited empirical evidence on what determines "better" PFM performance and the outcomes associated with a "good" PFM system. This report seeks to bridge some of this gap between theory and practice using data on PFM performance from PEFA assessments.
In the next chapter, we undertake a closer examination of the key debates on what constitutes a good PFM system by providing an overview of the PEFA framework and the data set that is generated through PEFA assessments. This overview includes an analysis of the pros and cons of undertaking quantitative analysis using PEFA and similar governance indicators. Our aim is to address specific criticisms of the PEFA framework and similar diagnostic tools and to provide a guide to interpreting the analysis in the remaining chapters, including understanding its inherent strengths and weaknesses.
Chapters 3 to 6 examine the relationship between PFM performance and other indicators of governance. Across all four chapters, we try to tease out which parts of the PFM system matter more for different questions and attempt to quantify the impact of PFM reforms where relevant, albeit with important caveats.
In chapter 3 we investigate what shapes PFM systems in developing contexts by examining the relationship between political institutions and the quality of PFM systems. This chapter builds on the existing theoretical and empirical literature by refining and nuancing previous hypotheses on this relationship, retesting hypotheses using a larger sample, and testing new hypotheses. Much of this theoretical and empirical literature is based on observations for higher-income countries. We find little evidence that these relationships hold in low-and middle-income countries and note some counterintuitive relationships. Although we do find some evidence that having multiple political parties controlling the legislature is associated with better PFM performance more generally, our findings point to the need for further refinement and testing of the theories on the relationship between political institutions and PFM in low-and middle-income countries.
Chapter 4 assesses the outcomes of PFM systems, distinguishing between fragile and nonfragile states. Specifically, we explore whether the credibility of the budget and fiscal outcomes improves with better PFM performance using various definitions of fragility. Our findings are mixed. We find that better PFM performance is associated with more reliable budgets in terms of the composition of expenditures in fragile states, but not with aggregate budget credibility. Moreover, in contrast to existing studies, we find no evidence that PFM quality matters for deficit and debt ratios, irrespective of whether a country is fragile or not. In chapter 5, we turn our attention to the relationship between corruption and PFM performance. Our analysis is limited by the constraint that there is no crosscountry measure of actual corruption. We therefore use corruption perception indexes as a proxy, with the potential measurement error that comes with using such a blunt instrument. Nevertheless, we find strong evidence of a relationship between better PFM performance and better perceptions of corruption. We also find that PFM reforms associated with better controls have a stronger relationship with better perceptions of corruption than PFM reforms associated with more transparency. However, the magnitude of the relationship is underwhelming when compared with the magnitude of the relationship between economic growth and perceptions of corruption. This finding is in line with the findings of other studies. Our findings suggest that PFM reform may be part of an effective anticorruption campaign or that contexts where the perceptions of corruption are improving are more amenable to PFM reform. However, much scope remains for further research in this area to define individual PFM measures more tightly with more relevant measures of corruption.
We follow this advice in chapter 6 by looking at a more tightly defined relationship between domestic resource mobilization and revenue administration. We focus on the impact on tax performance of the credible use of penalties for noncompliance. This tool has become somewhat neglected from a research perspective, as more modern revenue administrations have shifted their focus toward voluntary compliance and taxpayer services. Our analysis shows that countries that credibly enforce penalties for noncompliance collect significantly more taxes on average. Because of the potential for measurement, further in-country research on the dynamics of penalties for noncompliance is warranted. This would allow for analysis of the individual responses of taxpayers to the use of penalties for noncompliance.
In this chapter we provide an overview of the Public Expenditure and Financial Accountability (PEFA) framework and methodology for assessing public financial management (PFM) performance and the data set that is generated from these PEFA assessments. We present the methodological issues we encounter when using the data and how we deal with these issues in the chapters that follow.
The rest of the chapter proceeds as follows. First, we describe the PEFA framework, how it measures PFM performance, how it compares with other diagnostic tools, and how the framework has changed over time. We also describe descriptive statistics regarding the coverage and performance of our data set and provide a summary of the discussion. Then, we discuss the various approaches to quantifying PEFA scores for quantitative analysis and highlight common issues encountered when using these scores in regression analysis. We conclude by summarizing the key points.

The PEFA framework
The PEFA methodology has changed over time. The first PEFA framework was released in 2005, and updates followed in 2011 and 2016. These frameworks were developed to assess PFM performance at the national level. The framework has been applied at the subnational level as well. This report is based on a data set compiled from assessments using the national-level 2011 PEFA framework, which is the main focus of discussion in this chapter. However, we also discuss the 2005 and 2016 national-level PEFA frameworks given that some of the revisions are relevant to the analysis in subsequent chapters.
As discussed in chapter 1, the PFM system is commonly described in terms of the stages of the annual budget cycle, and a good PFM system is instrumental in supporting the objectives of aggregate fiscal discipline, strategic allocation of resources, and efficient delivery of services. This is the approach taken in the PEFA framework, which organizes key PFM processes into pillars and links process quality to budgetary outcomes. Figure 2.1 illustrates the PFM system as outlined in the 2011 PEFA framework. It includes four pillars corresponding to the phases of the budget cyclepolicy-based budgeting; predictability and control in budget execution; accounting, recording, and reporting; and external scrutiny and audit-and one cross-cutting pillar on comprehensiveness and transparency (see box 2.1 for further discussion). In addition to well-aligned budget support from donors, improvements in these five core dimensions are expected to deliver budget credibility in the form of aggregate fiscal discipline, allocative efficiency, and operational efficiency (PEFA Secretariat 2011). The features of the budget cycle vary from country to country, but the outline is similar to what is found in most countries and what others have proposed. 1

Measuring performance
Under each pillar of the 2011 PEFA framework are indicators of PFM performance (table 2.1). There are 28 performance indicators in total, denoted as PI-1 to PI-28, as well as three donor performance indicators, denoted as D-1 to D-3. Predictability and control in budget execution make up the largest pillar, with nine indicators (three of these indicators are related to tax administration and are the focus of chapter 6). Policy-based budgeting is the smallest pillar, with just two indicators. Under each PI are 1-4 dimensions that are assessed to determine the PI score. Each dimension measures performance against a four-point ordinal scale from D to A that captures levels of compliance with good practices in PFM. There are 76 dimensions within the 2011 framework, of which 5 are related to donor practices. Policy-based budgeting 2 7 Predictability and control in budget execution 9 29 Accounting, recording, and reporting 4 9 External scrutiny and audit 3 10 Comprehensiveness and transparency 6 10 Donor practices 3 5

PI-11(i)-Existence of and adherence to a fixed budget calendar
A A clear annual budget calendar exists, is generally adhered to, and allows ministries, departments, and agencies (MDAs) enough time (at least six weeks from receipt of the budget circular) to complete their detailed estimates meaningfully and on time.
B A clear annual budget calendar exists, but some delays are often experienced in its implementation. The calendar allows MDAs reasonable time (at least four weeks from receipt of the budget circular) so that most of them are able to complete their detailed estimates meaningfully and on time.
C An annual budget calendar exists but is rudimentary, and substantial delays may often be experienced in its implementation. It allows MDAs so little time to complete detailed estimates that many fail to complete them in a timely manner. D A budget calendar is not prepared, OR it is generally not adhered to, OR the time allowed for MDAs' budget preparation is clearly insufficient to make meaningful submissions.

PI-11(ii)-Guidance on the preparation of budget submissions
A A comprehensive and clear budget circular is issued to MDAs, which reflects ceilings approved by the cabinet (or equivalent) prior to the circular's distribution to MDAs.

B
A comprehensive and clear budget circular is issued to MDAs, which reflects ceilings approved by the cabinet (or equivalent). This approval takes place after the circular is distributed to MDAs, but before MDAs have completed their submission.
C A budget circular is issued to MDAs, including ceilings for individual administrative units or functional areas. The budget estimates are reviewed and approved by the cabinet only after they have been completed in all details by MDAs, thus seriously constraining the cabinet's ability to make adjustments. D A budget circular is not issued to MDAs, OR the quality of the circular is very poor, OR the cabinet is involved in approving the allocations only immediately before the submission of detailed estimates to the legislature, thus providing no opportunities for adjustment.

PI-11(iii)-Timely budget approval by the legislature
A The legislature has, during the last three years, approved the budget before the start of the fiscal year.

B
The legislature approves the budget before the start of the fiscal year, but a delay of up to two months has happened in one of the last three years.

C
The legislature has, in two of the last three years, approved the budget within two months of the start of the fiscal year.

D
The budget has been approved with more than two months delay in two of the last three years.
Source: PEFA Secretariat 2011. Note: The M2 method is based on an approximate average of the scores for the individual dimensions of the Performance Indicator (PI); it is also referred to as the "averaging method." MDA = ministries, departments, and agencies.
For example, under the policy-based budgeting pillar, PI-11 measures orderliness and participation in the annual budget process. Table 2.2 shows the minimum required for a country to score an A on each of the three dimensions under PI-11. In addition, the PEFA Secretariat regularly provides training for assessors carrying out assessments, and the PEFA Fieldguide provides further guidance for assessors on the evidence that is required to assign a dimension score (see PEFA Secretariat 2012a). Nevertheless, the frequently asked questions that form part of the Fieldguide highlight the fact that at times assessors may find it difficult to apply the performance measurement framework easily and consistently. Moreover, because of the breadth of a PEFA assessment, performance measurement is generally carried out by a team of assessors, and some countries have established their own PEFA Secretariat and carry out self-assessments. These issues have raised concerns about quality control both within and across assessments. These issues are discussed in the context of recent changes in the PEFA framework and in the context of measurement error in this chapter.
To arrive at the PI scores, the assessor must combine the dimension scores using one of two methods referred to as method 1 (M1) and method 2 (M2). The scoring method is clearly prescribed for each of the indicators. Regardless of the method used, the first step in assigning a score to a PI is to score each of its dimensions separately based on the D through A ranking. For multidimensional indicators, where poor performance on one dimension of the indicator is likely to undermine the impact of good performance on other dimension(s) of the same indicator, assessors must apply the M1 method. Under this method, the indicator is assigned the score of the lowest dimension, but a "+" is added if one of the other dimension scores is higher. If a three-dimensional indicator scores two Ds and one C, then the indicator is assigned a D+ score. Because the score is determined primarily by the lowest score, the M1 method is also referred to as the "weakest link" method.
The M2 method is applied for some multidimensional indicators where a low score on one dimension of the indicator does not necessarily undermine the impact of higher scores on other dimensions of the same indicator. Because it applies equal weighting to each of the dimension scores within the PI, the M2 method is also referred to as the "averaging method." The PEFA framework provides conversion tables for two-, three-, and four-dimensional indicators. For our PI-11 example in table 2.2, a score of two Cs and one A would combine for a PI score of C+ under M1, but would be considered a B under M2, which is actually how PI-11 is assessed. Single-dimension indicators simply take the score of the single dimension and are not eligible for a "+" rating. As shown in figure 2.2, most indicators are scored according to the M1 methodology (the figure excludes donor indicators). The implications of the different scoring methodologies for quantitative analysis are discussed in chapter 3.

Framework comparability with other assessment frameworks
Although the scoring system, performance measures, and other aspects of the framework have been the subject of some criticism, other assessment frameworks apply similar methodologies, and PEFA remains the primary tool for measuring performance in PFM. Nevertheless, other tools exist for measuring aspects of PFM performance in more depth that are complementary rather than comparable to the PEFA framework. Some of these comparable and complementary tools are discussed further below.
Several diagnostic instruments are available to assess public expenditure, financial management, and procurement. Some broad diagnostic tools include the World Bank's Public Expenditure Reviews (PERs)  PER assesses public expenditure policies and programs to provide governments with an external review of their policies in order to strengthen budget analysis and processes and achieve a better focus on growth and poverty reduction. Despite the number of tools and instruments available, PFM performance is increasingly measured by PEFA. PEFA has several advantages over other frameworks. First, it is the most comprehensive measure of PFM to date, covering the entire budget cycle as well as other key PFM areas. Second, it is standardized so that it can be repeated and changes can be tracked over time. Third, it includes a narrative report that discusses qualitative evidence to complement the quantitative scores. Fourth, the PEFA Secretariat provides quality assurance to ensure that the standards are met consistently across countries and time. As a result, PEFA has the most coverage globally.
Moreover, PEFA tends to produce scores comparable to those of similar diagnostics ( figures 2.3-2.5). CPIA-13 (CPIA indicator 13) data have been collected for longer than PEFA data and are generated annually for most low-and middle-income countries, but ratings are made publicly available only for countries receiving International Development Association lending. CPIA-13 is rated on a scale from 1 (worst) to 6 (best). General trends between the two data sets are the same. Lowincome countries are underperforming compared with lower-and upper-middleincome countries. Likewise, Europe and Central Asia are performing better than the other regional groups, and Sub-Saharan Africa is performing the worst. However, the variations among income groups and regions are much smaller for CPIA-13 than for PEFA data. As mentioned previously, a disadvantage of the CPIA indicator is that it provides a single measure rather than a more disaggregated and detailed perspective on PFM performance, such as that provided by PEFA assessments. This narrow perspective is reflected in the narrow dispersion of averages, ranging from only 3.10 for low-income countries to 3.63 for upper-middle-income countries and from 3.17 for Sub-Saharan Africa to 3.79 for Europe and Central Asia.  Another publicly available PFM-related indicator is the Open Budget Index (OBI) scores for budget transparency. The ratings (1-100) cover various years between 2006 and 2017, and components of the OBI are scored in a fashion similar to PEFA's M1 "weakest link" methodology. Like CPIA-13, global patterns are similar to PEFA, with low-income countries underperforming and the Europe and Central Asia region having a higher average than the rest. The main difference is the Middle East and North Africa region, which has the lowest average OBI score, but an overall score for PEFA and CPIA-13 in the middle of the other regions.

PEFA framework revisions
The data set we use in this report is based on the PEFA 2011 framework. The 2011 framework did not represent a significant departure from the 2005 framework, with the revision of just three indicators . As such, the data set includes assessments carried out under the 2005 framework that are comparable with assessments carried out under the 2011 framework. The 2016 framework represents a more significant revision ( figure 2.6). The conceptual framework, though still based on the annual budget cycle, has been revised and now includes 7 pillars, 31 indicators, and 94 dimensions ( figure 2.7). Whereas some indicators remain directly comparable, other indicators have been revised, dropped, or added, rendering them less comparable or, in some cases, incomparable. Moreover, the scoring guidance has been revised to clarify what constitutes a D score and to clarify issues that arose using the 2011 framework. However, the transition to the 2016 framework has been managed using a 2011 annex, whereby dual assessments are carried out using both the 2011 and 2016 frameworks. This treatment has had the benefit of generating one more wave of comparable assessments within the data set used in this report, which allows us to observe a larger sample of changes in PFM performance over time.
The upgrade was introduced to reflect evolution in the field of PFM and address shortcomings in the 2011 framework. It was developed with feedback from development partners, government officials, and other users and experts, as well as through public consultation. Significant changes between the 2011 and 2016 versions of the framework include the following: • The addition of four new indicators • The expansion and refinement of existing indicators The public financial management (PFM) system according to the 2016 Public Expenditure and Financial Accountability (PEFA) framework Accounting  • Clearer and more consistent structure for reporting PEFA findings as well as improved terminology and measurement 3 • Increased emphasis on the use of macrofiscal forecasts, the medium-term fiscal strategy and outlook, a medium-term perspective in expenditure budgeting, and the alignment of strategic plans with budget allocations • Expansion of coverage of revenue administration to include both tax and nontax revenues • Elimination of specific indicators of donor practices • Application of a D score for all practices below the basic level of performance and where there is insufficient information to validate a higher score. D also replaces the NR (not rated) code used previously where there is insufficient information on an indicator.
The PEFA Secretariat has been reviewing assessments for quality since the launch of the framework in 2005; however, the decision to include proposed changes in the final assessment report rests with assessment managers, teams, and funding agencies. Prior to the introduction of PEFA Check-an official endorsement by the PEFA Secretariat-quality assurance was less standardized; therefore, the data from the assessments may be more susceptible to measurement error. PEFA Check sought to improve confidence in the findings of the PEFA assessment. This quality assurance process ensures the accuracy of supporting evidence and compliance with the PEFA methodology. PEFA Check indicates that the PEFA methodology was followed and fulfilled six formal criteria (PEFA Secretariat 2012b). Since its introduction in 2012, 85 out of 121 national assessments, or 70 percent, have received the PEFA Check. But improvements to the 2016 framework for the quality assurance process also represent weaknesses in the 2011 framework and in assessments that were not quality assured. By extension, these weaknesses translate into weaknesses in our data set. In the next section, we provide further descriptive statistics from our data set.

Coverage of PEFA assessments
Our data set contains the scores from 307 PEFA assessments completed in 144 countries between June 2005 and March 2017. Per figure 2.8, almost all of today's low-income countries, lower-middle-income countries, and upper-middleincome countries have undertaken one or more assessments. In contrast, very few of today's high-income countries have undertaken an assessment. Moreover, some of today's higher-income countries undertook assessments when they were classified as lower income, which further biases the number of observations in the data set toward lower-income countries.
This lower-income-country bias inevitably leads to geographic bias within the data set. Coverage is almost complete across the world's poorest countries in South Asia and Sub-Saharan Africa, as highlighted in figure 2.9. Although the East Asia and Pacific and the Latin America and the Caribbean regions also have high coverage ratios, they are overrepresented by small island developing states (SIDS). Norway is the only high-income OECD country to have undertaken an assessment.

Frequency of PEFA assessments
Between 2006 and 2016, approximately 27 countries, on average, completed PEFA assessments annually. The overall number of countries carrying out assessments has declined from a peak of 37 in 2008 to 22 in 2016 (see figure 2.10). Repeat   assessments now make up most assessments undertaken, although 40 of the 144 countries have yet to undertake a second assessment. While some of these countries may undertake repeat assessments in the future, more than a decade has passed since some countries carried out their first and only assessment, suggesting that they may be one and done. Of the 104 countries that have carried out at least one repeat assessment in our data set, 2 are on their fifth assessment, 9 are on their fourth assessment, and 35 are on their third assessment. The average length of time between assessments in our data set is 50 months (approximately four years), with the shortest time span between assessments being 9 months and the longest being 104 months.

Publication of PEFA assessments
Approximately 66 percent (202) of the assessments in the data set have been made publicly available through the PEFA Secretariat website. In some cases, the failure to publish is simply due to delays, while in others, the government has chosen not to publish the report. In addition, 30 assessments are drafts that have yet to be finalized, while a further 75 have been finalized but not published. The data set does not distinguish between an explicit decision not to publish and a failure to publish arising for more mundane reasons. Nevertheless, the standard time from draft to publication (six months to one year according to the PEFA Secretariat) suggests that few of the older assessments are likely to become publicly available whether they are draft or final. While some countries tend to publish all or none of their assessments, others choose to publish some but not others ( figure 2.11). For example, 45 countries have made public all of their assessments, 13 have made none available, while 46 have chosen to make some but not others available. For countries that have carried out just one assessment, 18 have published, while 22 have not.

Donor involvement in PEFA assessments
As discussed in chapter 1, one of the original objectives of the PEFA assessment was to coordinate donor assessments of PFM performance in the countries in which they provide financial and technical support. The seven PEFA partners continue to commission PEFA assessments, but an additional 33 international organizations have been involved in some capacity (figure 2.12). Almost 20 bilateral and multilateral development organizations have led PEFA assessments, with the European Union and the World Bank undertaking by far the most to date, followed by the IMF and the Swiss State Secretariat for Economic Affairs. As discussed earlier, a growing number of governments are managing the assessment process and writing the reports themselves. 4

PFM performance
Using the conversion, weighting, and aggregation methods described in more detail in the next section, we observe an upward trend in the aggregate overall PEFA score over time, rising from an average of between C and C+ in 2006 to slightly above C+ in the 2016 ( figure 2.13, panel a). But some variation is also evident in the median score across years and in the spread of the overall score within years ( figure 2.13, panel b).
Some of the upward trend is because most assessments undertaken since 2010 have been repeat assessments, which have tended to produce higher overall scores on average than in previous years ( figure 2.14). This finding is not surprising given the incentives associated with improving PFM performance and attracting donor financing. Nevertheless, the overall trend in year-on-year performance has been relatively slow moving and well below "good practice" A scores.
Lower-middle-income countries have contributed more to the improvement in average overall performance over time than low-income countries and  Repeat upper-middle-income countries, whose performance has been more stagnant ( figure 2.15). In fact, since 2015 overall scores have been higher, on average, for lowermiddle-income countries than for uppermiddle-income countries. Sub-Saharan Africa has been consistently the lowest-performing region, on average, while Europe and Central Asia has generally produced the highest average overall scores over time ( figure 2.16). The average score for South Asia has climbed, although the sample is relatively small. Just 14 assessments have been undertaken by the 8 countries in South Asia over the sample period, compared with 115 by the 48 countries in Sub-Saharan Africa. The average overall performance of other regions has been more variable from year to year.
Over time, the external scrutiny and audit pillar has had consistently the worst average performance, while the cross-cutting comprehensiveness and transparency pillar has had the best performance. In recent years, there has been an upward trend in performance on the predictability and control in budget execution and policy-based budgeting pillars and a downward trend in performance on the accounting, recording, and reporting pillar. Over the long run, there are signs of improvement across all five pillars ( figure 2.17). Within pillar 1 (policy-based budgeting), countries have consistently performed better on PI-11 (orderliness and participation in the annual budget process) than on PI-12 (a multiyear perspective in fiscal planning, expenditure policy, and budgeting). Although performance on the latter has trended upward over time, while performance on the former has been relatively stagnant, a gap of 0.5 to 1.0 on the ordinal scale remains ( figure 2.18).
Within pillar 2 (predictability and control in budget execution), there is an equally distinct upward trend for most indicators as well as distinct differences in performance across indicators ( figure 2.19). On average, countries have performed best on PI-17 (recording and management of cash balances, debt, and guarantees) and worst on PI-21 (effectiveness of internal audit) on a fairly consistent basis, with a 1.0 to 1.5 differential in the ordinal scale. Average performance over time on the other indicators is more bunched, although countries tend to perform better on PI-16 (predictability in the availability of funds for commitment of expenditures) compared with indicators related to expenditure controls on payroll (PI-18), procurement , and other expenditures (PI-20) which are the subject of discussion in chapter 5 on PFM and corruption.
On pillar 3 (accounting, recording, and reporting), PI-23 (availability of information on resources received by service delivery units) stands out as the indicator where performance is consistently poor on average and relatively stagnant over time ( figure 2.20).
Although PI-25 (quality and timeliness of annual financial statements) has been fairly consistently the second worst-performing indicator, average annual performance has improved over time. In contrast, average annual performance for both PI-22 (timeliness and regularity of accounts reconciliation) and PI-24 (quality and timeliness of in-year budget reports) has barely changed over time.
As shown in figure 2.21, under pillar 4 (external scrutiny and audit), we observe a similar separation in performance across indicators, with countries performing better on PI-27 (legislative scrutiny of the annual budget law) and, to a lesser extent, on PI-26 (scope, nature, and follow-up of external audit) and scoring close to a D+ on average over time for PI-28 (legislative scrutiny of external audit reports). PI-26 has displayed a more discernible trend of improvement in average annual performance compared with the other two indicators over time.  Finally, for the cross-cutting pillar 5, we observe an upward trend in the average score for all six indicators from approximately 2011, which suggests that countries undertaking repeat assessments have improved on these indicators. But again, as shown in figure 2.22, we observe a fairly consistent hierarchy of scoring over time, with countries performing better on average on PI-5 (classification of the budget), PI-6 (comprehensiveness of information included in budget documentation), and PI-8 (transparency of intergovernmental fiscal relations), compared with PI-7 (extent of unreported government operations), PI-9 (oversight of aggregate fiscal risk from other public sector entities), and PI-10 (public access to key fiscal information).
The foregoing results suggest that it is easier to achieve better scores on some indicators than on others. Figure 2.23 shows the distribution of scores for the interquartile range by indicator-that is, scoring for the middle 50 percent of the distributionfurther demonstrating that the distribution of some indicator scores is skewed. Notable examples are PI-22 and PI-23, where scores are concentrated in the D+ to C+ range, compared with PI-11 and PI-17, where scores are concentrated in the C+ to B+ range. For the purposes of statistical inference used in later chapters, it is preferable to have more normally distributed data.
Indeed,  notes that it is easier to improve on some indicators than on others by changing the form of parts of the PFM system rather than how they function, which he describes as isomorphic mimicry. He notes that de jure, upstream, and concentrated functions of the PFM system are more amenable to isomorphic mimicry than de facto, downstream, and deconcentrated functions and characterizes each dimension of the PEFA 2011 framework in these terms. Using this characterization of the data, we construct indexes to compare relative performance. Panels a to c of figure 2.24 clearly show that performance is stronger on de jure, upstream, and concentrated dimensions on average over time compared with performance on de facto, downstream, and deconcentrated dimensions, respectively. However, performance on the latter has also trended up over time, in line with the trend in overall performance in panel d, implying that functional dimensions have also improved over time. Policy-based budgeting Predictability and control in budget execution Accounting, recording, and reporting External scrutiny and audit Comprehensiveness and transparency

Summing up
The PEFA framework has changed over time, with the most significant changes occurring between the 2011 and 2016 frameworks. Nevertheless, the framework remains based on the annual budget cycle, the scoring methodology has remained broadly similar, and therefore the data across assessments remain comparable. However, in using the data, which are based predominantly on the 2011 framework, it is important to be cognizant of the revisions made in 2016, as they represent weaknesses in our data. These weaknesses include poor coverage of some PFM functions, lack of clarity on the scoring of some dimensions, and issues regarding quality assurance of assessments. All of these issues have been addressed in the 2016 framework revisions but remain pertinent when using the data set, particularly with respect to earlier assessments.
We have also noted that many other PFM diagnostic tools cover similar areas and use similar scoring methodologies to PEFA. These include the CPIA and OBI assessments, which produce comparable findings to PEFA assessments. Other diagnostics, including PERs and Public Expenditure Tracking Surveys (PETSs), should be viewed as complementary, more in-depth analyses. The main strength of PEFA over similar assessments is its breadth of coverage, which has made it the most frequently used PFM diagnostic tool globally, with repeat assessments allowing PFM performance to be tracked over time. Nevertheless, several biases are evident in the data set, with poorer regions and smaller countries overrepresented compared with higher-income regions and larger countries. The former may be driven by donor engagement in these countries, with donors still commissioning the vast majority of assessments. The latter may be explained somewhat by the fact that larger countries have switched their attention to carrying out subnational PEFA assessments. A final analytical concern is the lack of variation in performance, both across countries and across time, which is discussed in more detail in the following section.

ISSUES IN QUANTIFYING AND ANALYZING PFM PERFORMANCE
As noted in the previous section, PEFA dimension and indicator scores are based on an ordinal scale from D to A. Unlike performance assessments such as the OBI, a PEFA assessment carries no overall score. However, this has not stopped researchers from quantifying and aggregating PEFA assessment scores to investigate their relationships with other indicators. The main advantage of quantifying and aggregating the assessment scores is to facilitate the analysis and comparison of PFM performance across a large sample of countries and over time. This report is no different in this regard. Chapters 3-6 all convert PEFA scores to numerical values to investigate the relationship between aspects of PFM performance and other governance indicators. In this section, we explain the conversion, weighting, and aggregation methodologies used in subsequent chapters and their limitations. We also discuss other limiting factors associated with using PEFA assessment scores for quantitative analysis.

Quantifying PEFA scores
The PEFA Secretariat has noted that there is no scientific method for conversion and aggregation, which requires assumptions about the weighting to be applied to scores, measures, and assessments (PEFA Secretariat 2009). With respect to scores, numerical conversion requires a judgment about the distance between the ordinal rankings D to A (that is, should progressing from C to B carry the same weighting as improving from B to A?). With respect to measures, there is a question of whether some dimensions, indicators, or pillars are more important than others. And with regard to assessments, there is a need to consider whether some assessments should be assigned lower importance or disregarded because of concerns over the quality of the assessment. As discussed in chapter 1, issues of quality may arise because of biases generated by assessment teams and a lack of quality assurance over some assessments. There are also related questions over how to treat missing data.
Initially, the PEFA Secretariat made no recommendations on how to undertake conversion and aggregation aside from appealing to researchers to document the reasons for their assumptions (PEFA Secretariat 2009). Recently, the PEFA Secretariat has recommended converting indicators using the methodology employed in de Renzio (2009). 5 However, in general, researchers have tended to take  Score B+ A P I -5 P I -6 P I -7 P I -8 P I -9 P I -1 0 P I -1 1 P I -1 2 P I -1 6 P I -1 7 P I -1 8 P I -1 9 P I -2 0 P I -2 1 P I -2 2 P I -2 3 P I -2 4 P I -2 5 P I -2 6 P I -2 7 P I -2 8 Score the PEFA framework as they find it, using only limited subjective judgment. As such, most research has applied equal weights to the distance between scores, weighted either indicators or dimensions with equal importance, and treated all assessments with equal status. This report does not diverge significantly from previous research in this respect.

Weighting scores
The standard approach of researchers has been to convert the categorical PEFA scores D to A to numerical values 1 to 4, as shown in table 2.4 (see, for example, de Renzio 2009). This approach is sometimes applied to dimension scores and sometimes to indicator scores, depending on the assumptions related to calculating an aggregate score. For the individual dimension and indicator scores, the implied assumption is that the same level of effort is required to move from D to C, from C to B, and from B to A. Andrews (2009) provides an alternative approach to scoring individual dimensions, assigning dummy variables to separate lower scores (that is, assigning a 0 to D or C) from higher scores (that is, assigning a 1 to a B or A). This conversion methodology is used and discussed further in chapter 6, where we examine the effect of individual tax administration dimensions on domestic resource mobilization. However, the analysis in chapters 4 to 6 is based either on an overall score or on composite scores and therefore requires numerical conversion to make aggregation possible. Following previous studies, we make the assumption of equal weights between categorical scores.

Weighting measures
de Renzio (2009) pioneered the approach to quantifying and aggregating PEFA assessment scores and investigating their relationship with other indicators, including income, aid dependency, population, and governance indicators. The conversion method he uses involves assigning numerical values from 1 to 4 to the ordinal scale from D to A for each indicator (table 2.4) and calculating an overall score as the average for the 28 indicators. He excludes the three indicators of donor practice because of the possible bias to the overall score of the "country PFM system performance" and because of the number of missing values for these indicators. Subsequent studies using PEFA data have taken a slightly more nuanced approach to calculating an overall PEFA assessment score depending on their research question. In an evaluation of donor support to PFM in low-and middle-income countries, de Renzio, Andrews, and Mills (2010) calculate their overall score based on indicators PI-5 to PI-28. They justify the exclusion of PI-1 to PI-4 on the basis that these are indicators of outcome rather than indicators of PFM quality per se. Similarly, Haque, Knight, and Jayasuriya (2012) omit both the PI-1 to PI-4 and the D-1 to D-3 indicators from their analysis of PFM in the Pacific to avoid "results being biased by macroeconomic factors or the different practices of development agencies operating in different countries." Investigating the drivers and effects of PFM performance,  further omit indicators PI-13 to PI-15, which measure the quality of tax administration, to obtain an overall score that covers "the quality of PFM systems on the expenditure side.  take the same approach in a study on the political economy of PFM reform experiences.
Another difference between these studies is the choice of whether to aggregate the scores for indicators or their underlying dimensions. Aggregating indicators recognizes the M1 "weakest link" scoring methodology and gives equal weighting to each indicator. Aggregating dimensions disregards the M1 "weakest link" scoring methodology and gives equal weighting to each dimension. de Renzio, Andrews, and Mills (2010) do the latter, justifying the decision on the basis of fully exploiting the information underlying indicators as well as avoiding the downward bias associated with the M1 "weakest link" scoring methodology. Both  and  aggregate an overall score using the converted indicator scores.
In this report, we borrow from these previous methodologies, adding our own nuances. Like , we exclude the revenue administration indicators because our research questions in chapters 4, 5, and 6 are more relevant to the expenditure side of PFM. However, we follow the example of de Renzio, Andrews, and Mills (2010) by disregarding the M1 "weakest link" scoring methodology and aggregating on the basis of dimension rather than indicator scores. But our approach to aggregation is slightly more nuanced. As described in figure 2.25, it involves three aggregation steps in calculating an overall score. Our justification for this approach is that it provides an equal weighting to each of the pillars of the 2011 PEFA framework, rather than ascribing more importance to phases of the budget cycle that have more indicators.
Nevertheless, we investigate the implications of calculating the overall score in different ways.  , recognizes the M1 "weakest link" scoring methodology and gives equal weight to the indicators. Methods 2, 3, and 4 are all variations that disregard the M1 "weakest link" scoring methodology. Method 2 is simply an average of the dimensions, following de Renzio, Andrews, and Mills (2010), and so provides equal weighting to each dimension. Method 3 gives equal weight to indicators through a two-step calculation. Method 4, our preferred method, gives equal weight to pillars through a three-step calculation. As expected, the first methodology provides the lowest scores due to the downward bias of the M1 "weakest link" scoring methodology. Nevertheless, all four scoring methodologies provide approximately similar summary statistics. The largest difference between mean scores excluding tax administration indicators is 0.092 or 3 percent between method 1ii and method 3ii. Standard deviations and variances are also similar across methodologies. Moreover, as shown in table 2.6, all four scoring methodologies are highly correlated with one another, at the 95 percent level or higher. As such, the question of which to use for the purposes of statistical analysis is a question of judgment as to the weighting of the constituent parts of the PEFA framework. In this report, we base our calculation of the overall score on the view that all four stages of the budget cycle and the cross-cutting theme of transparency as represented by the pillars of the PEFA 2011 framework should carry an equal weighting.  A fundamental problem with the relatively equal weighting applied to dimensions and indicators in all of these methodologies is the issue of form over function. Some PEFA dimensions measure form (often categorized as de jure) as opposed to function (de facto) (see Andrews 2009). Ronsholt (2011) contrasts de jure dimension PI-11(i), where "a C score is attained as long as an annual budget calendar exists, even though there may be substantial delays in implementation, with not enough time allowed to budget entities to complete detailed estimates" compared with de facto dimension PI-12(i), where a C score "requires that twoyear forecasts of fiscal aggregates are actually produced on a rolling annual basis." De facto dimensions are frequently correlated with upstream and concentrated 6 activities as opposed to downstream and deconcentrated activities. Andrews (2009) estimates that de jure, upstream, and concentrated dimensions account for 41 percent, 25 percent, and 41 percent of PEFA dimensions, respectively, noting that progress on these dimensions has been slower for African countries.  raise concerns that, given the donor-recipient dynamics associated with PEFA scores, recipient countries may engage in "gaming" by targeting easier-to-move indicators.
These issues provide further justification for providing equal weighting to the pillars of the PEFA 2011 framework rather than to indicators or dimensions. Moreover, although the overall score is used throughout the chapters that follow, we also formulate hypotheses based on individual dimensions, individual indicators, and composites of indicators. This discussion follows the example of more recent research examining specific questions using specific PEFA dimensions. For example, Knack, Biletska, and Kacker (2017) focus on PI-I8 to investigate the effect of better procurement practices on corruption. Similarly, Ricciuti, Savoia, and Sen (2016) use tax administration dimensions to investigate the effect of political institutions on fiscal capacity.

Weighting assessments
Several interrelated issues arise with respect to the weighting of assessments. These issues include how to treat missing values, how to treat earlier assessments, and how to assure the quality of assessments. In general, we seek to maximize the sample by retaining as many assessments as possible. Nevertheless, we ascribe more importance to more recent assessments.
The validity of converting categorical scores to numerical scores and then aggregating is also affected by missing data for some dimensions. With regard to the 2011 PEFA framework, data may be missing for three reasons: the data were NA (not applicable to the context), NU (not used for the assessment), or NR (not rated due to insufficient information). The revised 2016 methodology ascribes missing values to NA and NU categories and assigns a D score when sufficient information is not available to establish actual performance (equivalent to NR under the 2011 framework). Therefore, researchers using the 2011 framework data set generally assign a D score to an NR score and missing values to NA and NU scores.
However, according to discussions with the PEFA Secretariat, prior to the introduction of the 2016 guidance, assessors may have ascribed NR scores unsystematically. Therefore, we assign NR scores as missing values rather than D scores. Similarly, earlier assessments carried out under the 2005 framework include missing values for dimensions that were added through the introduction of the 2011 framework. The effect of the missing values for some dimensions is that, when multidimensional indicator scores are aggregated (as described above), the missing dimension value assumes the value of the other dimensions. This implies an upward bias if the dimension would have been assessed at a lower score and a downward bias if the dimension would have been assessed at a higher score. As most missing values apply to earlier assessments, in the chapters that follow we construct samples that focus on a country's most recent assessments rather than pooling observations.

Analyzing PFM performance
In the chapters that follow, we employ regression analysis to examine the relationship between PFM performance and political institutions, budget credibility, corruption, and domestic resource mobilization (DRM). In general, we use ordinary least squares (OLS), but also use weighted least squares (WLS) and panel estimators where the data are amenable. 7 However, our research design, data, and estimators suffer from inherent problems, including endogeneity and limited sample size. The extent to which these problems can be and are addressed in the next four chapters is discussed below, along with the implications for interpreting the results.
In the chapters that follow, we generally estimate equations in the form of equation (2.1): where Y i is our dependent variable, X i is our explanatory variable with estimated coefficient β, Z i is a control variable with estimated coefficient γ, α is the estimated constant term, and ε i is the estimated error term. Furthermore, we generally use PEFA scores as our explanatory variable, apart from chapter 4, where we use PEFA scores as the dependent variable. Technically, endogeneity refers to a situation where the explanatory variable and the estimated error term are correlated. This presents a problem for our estimated coefficients because least squares estimation works on the assumption of no endogeneity. When this assumption is violated, least squares estimation may produce biased results. In other words, relationships may be estimated to be higher or lower than their true relationships. Endogeneity concerns can arise because of omitted variable bias, measurement error, and simultaneity, all of which are present to varying degrees in the chapters that follow. Omitted variable bias arises when the estimated equation is poorly specified. For example, in chapter 5, although we hypothesize that there is a relationship between corruption and PFM, we also recognize that corruption is not wholly explained by PFM, and some of the factors influencing corruption may be unobservable. To deal with this issue, we include control variables based on the existing literature on the relationship. However, adding control variables reduces the degrees of freedom available to estimate the parameters' variability. To circumvent omitted variable bias arising from unobservable factors, we estimate the relationship over time using panel estimators. This method is possible with our data set because of the presence of repeat assessments. However, it is a valid method for dealing with omitted variable bias only when the suspected omitted variable is not expected to change over the sample period. For example, panel estimators are a good way of dealing with the fact that "culture" is often an important but unobservable determinant of corruption that changes only slowly over time.
Measurement error is another potential source of bias in each of the chapters that follow. As discussed previously, measurement error in our PEFA variables may arise from incorrect weighting of the scores, measures, and assessments. As noted previously, we assume equal weights on the distance between scores that may not reflect the "true" level of effort required to improve from D to C compared with improving from C to B and so forth. Measures have been given relatively equal weighting in calculating both the overall score and composite scores within some of the chapters. In chapter 6, we avoid the weighting issue associated with aggregation by examining the relationship between DRM and specific PEFA dimensions related to tax administration. However, this approach does not fit the research design across all chapters, so measurement error due to the biases associated with aggregation remains a concern in chapters 4, 5, and 6.
Of course, measurement error may also arise in our dependent variables. This is of particular concern in chapter 5, which investigates the relationship between PFM and corruption using perceptions of corruption as a proxy for the latter. This approach has been criticized for not capturing corruption accurately. Similarly, in chapter 6, which investigates the relationship between PFM and DRM, our ratio of tax to gross domestic product may induce bias because there are inconsistencies in the treatment of subnational revenues. To address these concerns, we also use alternative variables as robustness tests when possible and appropriate.
Measurement error may also arise because of inconsistency in scoring across assessments. As discussed above, the PEFA Secretariat has been reviewing the quality of assessments since the launch of the framework in 2005; however, the decision to include proposed changes rests with assessment managers, teams, and funding agencies. To strengthen the quality of PEFA reports, the PEFA Secretariat introduced a quality assurance system (the PEFA Check). However, this was only done in 2012, and, although compliance is improving, it is far from perfect. As a result, there may be measurement errors within some assessments. We attempt to circumvent the issue at least partially by focusing on the most recent country assessments. We also maximize the sample size where feasible to reduce the risk of measurement error biasing our estimated coefficients.
In addition, concerns regarding measurement error arise because of time inconsistencies between PEFA variables and other variables of interest. These concerns may arise because of potential inconsistencies relating to the "date of assessment" within the data set. While a PEFA assessment provides an evidence-based analysis of PFM performance at a specific point, it takes four to five months to conduct the assessment and prepare a draft report (PEFA Secretariat 2012a). Although two assessments may have a date of assessment of June 2010, the evidence may represent 2006-08 in one country and 2007-09 in the other country. These concerns are generally addressed by matching the PEFA score to three-year moving averages for the year of the assessment and the previous two years for other variables in the studies. For example, if the PEFA score is for 2015, the associated variable takes the average value for 2013, 2014, and 2015.
Simultaneity bias, or reverse causality, arises when the direction of causality between the dependent and explanatory variable is unclear. Taking the example of PFM and corruption once again, while we argue that "better PFM" can reduce corruption, it is equally plausible that lower levels of corruption allow for "better PFM" or that both are jointly determined by other factors such as country income level. Methods to address endogeneity concerns arising from simultaneity are beyond the scope of this research and the sample size of the data set.
Biases in our sample with respect to income levels, geography, and donor influence, discussed in the previous section, are pertinent issues for regression analysis. Although our data set includes more than 307 assessments in 144 countries, in the chapters that follow the number of usable observations falls because of the unavailability of other data required to address the research questions. For example, in chapter 5, which investigates the relationship between corruption and PFM, the sample size falls to 99 in our cross-sectional regression analysis. Moreover, not all countries have completed repeat assessments, reducing the sample size of our panel estimations further. As such, although the data set is the most comprehensive source of data on PFM performance to date, sample size remains a limiting factor, and the robustness of the results in later chapters needs to be interpreted in this light.

Summing up
Quantifying PEFA scores and aggregating them into overall scores require assumptions on weighting scores, measures, and assessments. There is no theoretical or scientific basis for these assumptions. In general, we follow the approach taken by previous researchers who have used PEFA data for quantitative analysis, but this does not eliminate the significant challenges that persist in transforming number grades to numerical values. We also note that, from a statistical perspective, differing methodologies for calculating the overall score are highly correlated with one another, so the choice of methodology is largely academic.
We also note that significant endogeneity concerns arise when using PEFA data. Although we attempt to circumvent some of these problems, others are beyond the scope of this research and data set. Consequently, estimated coefficients in the results section of later chapters may be biased, which would affect the integrity of the results. They should be interpreted as indicators of the direction of the relationship rather than as actual effects. Furthermore, the time inconsistency issues and the limited number of observations further compound the challenges of regression analysis with PEFA data. These issues are further exacerbated for the panel regressions that use repeated PEFA assessments.
Overall, it is worth emphasizing that the PEFA assessments were not designed for statistical analysis and that using them in quantitative regressions presents a series of econometric issues that cannot be fully resolved in this book, or in other papers that apply a similar approach.

SHAKIRA MUSTAPHA
This chapter investigates the extent to which political institutions are associated with public financial management (PFM) performance. Using cross-country data on PFM performance from the Public Expenditure and Financial Accountability (PEFA) data set, we find no evidence in support of theoretical propositions that ex ante legislative budgetary institutions are stronger in presidential systems or majoritarian systems. We also find no evidence that having a more programmatic political party system is associated with better systems for strategic budgeting or better institutions for overseeing the handling of public finances. We do, however, find some evidence that having multiple political parties controlling the legislature is associated with better PFM systems-overall and ex ante legislative budgetary powers.

INTRODUCTION
Practitioners active in the field of public sector reform have long recognized that reform is far from a purely technocratic exercise whereby technical solutions based on best practices can be transferred easily from one country to the next irrespective of context. This is perhaps more pertinent in the field of PFM, where reforms affect the budget, an inherently political process that entails politicians allocating scarce resources to competing priorities (von Hagen and Harden 1995;. In addition to the "public politics" of negotiating trade-offs, there are the "private politics" of special interests engaging in rent seeking and pursuing political advantage . Analysis of the political economy of PFM suggests that actors with incentives to obstruct reforms are a more critical bottleneck than weak capacity (Bunse and Fritz 2012;Keefer 2011). Political incentives to reforming the PFM system often stem from the wider political and institutional environment. Most of the existing theoretical and empirical literature on the political and institutional determinants of PFM performance comes from countries that are already at an advanced stage of economic development (Lienert 2005;Wehner 2010; Wehner and de Renzio 2013). In this chapter, we use this literature to formulate hypotheses relating to the form of government, electoral system, programmatic parties, and divided government and then use the PEFA data set to probe whether hypotheses developed with reference to high-income countries travel to other contexts. This is important given that formally similar institutions can have quite different "real-life" implications and consequences in high-versus low-and middle-income countries, as described by North, Wallis, and Weingast (2006) and Rodrik, Subramanian, and Trebbi (2002). Although a few papers have sought to do this using the PEFA data set , we add value to the discussion in three ways. First, we focus on the relationship between political institutions and specific elements of the PFM system rather than the entire PFM system. Second, we retest some hypotheses from previous studies using a larger sample size. Third, we consider two additional characteristics, specifically the electoral system and divided government.
The analysis presented in this chapter seeks to assess the advantages and disadvantages of using the PEFA data set to deepen our understanding of the contextual factors that can influence the potential scope for PFM reforms in a given country. This is important given the increasing recognition of the importance of good PFM for the effectiveness of the state. Good PFM not only supports fiscal discipline and macroeconomic stability but also is critical for effectively delivering the services on which human and economic development rely. For these reasons, many donors consider PFM to be a priority.
The chapter is laid out as follows. We begin with a brief overview of relevant literature and the hypotheses to be tested. We then describe the variables and data sources used in the analysis, some basic bivariate analysis, and the empirical models to be tested. This is followed by a presentation and discussion of the results of the econometric analysis.

LITERATURE REVIEW
Several studies have used the PEFA data set to investigate country characteristics associated with strengthening the overall PFM system. Of the political and institutional variables considered to date, state fragility and political instability have been found to have a statistically significant negative correlation with the quality of PFM systems . The argument is that political stability is a prerequisite for developing and improving institutions because, in its absence, capacity tends to be very weak, informality predominant, and political will lacking. In contrast, the link between PFM quality and other political variables such as forms of government and democracy level is much less compelling, with studies often finding either weak (in magnitude and statistical significance) or no relationship.
This chapter adds value to this existing literature in the following ways. First, we focus exclusively on political and institutional contextual factors that are likely to influence the incentives of politicians to reform specific elements of the PFM system, such as legislative budgetary powers, strategic budgeting, and accountability structures. Notably, we use the literature on higher-income countries (Lienert 2005;Wehner 2010) to formulate our hypotheses and use the PEFA data set to probe whether hypotheses developed with reference to Organisation for Economic Cooperation and Development (OECD) countries apply to other contexts. Second, although we consider the association between the quality of the aggregate PFM system and each of these macropolitical or institutional factors, we do not consider all PFM elements individually. Instead, we limit our focus to those areas for which the theoretical relationship with certain political and institutional variables tends to be more compelling and for which the required data are available. This approach has three advantages: 1. It allows us to retest previous variables-for example, form of government-that were found to be weak or statistically insignificant in previous studies that focused on explaining the performance of the aggregate PFM system or very broad PFM pillars.
2. It allows us to consider a wider set of political and institutional variables than those that have been considered to date-for example, electoral system and divided government.
3. It makes it easier to assess the plausibility of the underlying causal arguments by focusing on specific elements of the PFM system rather than the entire system. Our theoretical propositions are as follows:

Forms of government
According to Posner and Park (2007) legislatures in OECD countries tend to have a stronger role in presidential systems than in Westminster-style parliamentary systems, where the executive often dominates. 1 Lienert (2005) contends that in presidential systems "the legislature is a powerful agenda-setter and decision-maker." For a sample of 28 (mostly) high-income countries, he examines the linear relationship between an index of legislative budgetary powers and an index of separation of political power. He finds that the legislative authority to shape the size of the annual budget is strong in a presidential form of government and particularly weak in countries with Westminster parliamentary systems. Using a multiple ordinary least squares (OLS) regression, Wehner (2005) finds no evidence of an inherent difference in legislative budgetary powers between presidential 2 and nonpresidential systems for a sample of 43 national legislatures in OECD countries. Similarly, for a sample of 43 (mostly) low-and middle-income countries, de Renzio (2009) finds no statistically significant relationship between the overall quality of the PFM system (as measured by PEFA) and the form of government after controlling for other factors, including democracy. One reason for these contradictory results is that the sample size may be too small and lacking in variation to uncover these relationships. Another reason is that the hypothesis may be too broad. Although presidential systems often create a separation of powers allowing a greater role for the legislature in the management of public finances, this role may not translate immediately into improvements in the overall quality of the PFM system, because of other factors beyond the legislature's scope of control, such as technical capacity. We are therefore interested primarily in the relationship between the political system and the parts of the PFM system that are specific to the role of the legislature. We further distinguish between the role of the legislature in ex ante budgeting and ex post oversight, because these tend to differ depending on the type of political system. 3 Our first hypothesis is that countries with presidential regimes are likely to allow the legislature to be more involved in the management of public finances.

Hypothesis 1:
Countries with presidential regimes are more likely to have an incentive to develop PFM systems that allow for more legislative involvement in the management of public finances, especially ex ante.

Electoral systems
The type of electoral system also shapes legislative behavior. The argument is as follows. Plurality or majoritarian rule is geared toward holding politicians accountable, and proportional representation is geared toward representing different voters in the legislative process (Persson and Tabellini 2005). This means that in plurality systems, it is possible for the voters to identify who is responsible for policy decisions and to oust officeholders whose performance they find deficient. Politicians in majoritarian systems are therefore more likely to face sharper individual incentives to please their territorially defined constituencies than politicians under proportional elections (Persson and Tabellini 2005) and thus will have incentive to push for a greater role in formulating the budget. In contrast, politicians under a majoritarian system are likely to have less interest in exercising oversight ex post since they might not be able to hold any minister closely associated with the president to account and doing so may have limited relevance for their chances of reelection.
Although there are several empirical studies on the relationship between the electoral systems and fiscal outcomes Persson and Tabellini 2005;von Hagen 2002), there is none exploring the relationship between electoral systems and the quality of PFM institutions. The first set of studies generally finds that overall government spending and deficits are smaller in majoritarian countries, supporting the idea that the design of electoral rules entails a trade-off between accountability and representation. Our second hypothesis is that countries with a majoritarian electoral system are more likely to allow the legislature to be more involved in the management of public finances.

Divided government
The dispersion of political power among different political parties in the government may also be associated with the quality of the PFM system. Divided government is defined as "the absence of simultaneous same-party majorities in the executive and legislative branches of government" (Elgie 2001). According to this definition, divided government in parliamentary regimes takes the form of minority government (Wehner 2005).
The study of divided versus single-party governments on PFM systems has been confined largely to OECD countries. Wehner (2010), for example, found that divided government is associated with greater legislative financial scrutiny in a sample of 30 OECD countries. The underlying argument is that, in countries experiencing protracted spells of divided government, legislatures have an incentive to champion reforms to strengthen their capacity for scrutiny in order to have the means to challenge executive-led fiscal policy. Our hypothesis is therefore that divided governments are associated with higher-quality legislative involvement in the management of public finances. Cruz and Keefer (2012) argue that, when politicians are not collectively organized, particularly into programmatic political parties, they have weak incentives to pursue broad public policies that rely on a well-functioning administration. They further contend that, in the absence of programmatic parties, politicians are Hypothesis 3: Countries with divided governments are more likely to have an incentive to develop PFM systems that allow a higher quality of legislative involvement in the management of public finances, ex ante and ex post.

Hypothesis 2:
Countries where legislators are elected under a majoritarian electoral system are more likely to have an incentive to develop PFM systems that allow for more legislative involvement in the management of public finances, especially ex ante. less able to act collectively to demand that the executive implement transparent and rule-bound administrative practices. They support their argument using the ratings of 511 World Bank public sector reform loans in 109 countries as the dependent variable in a logistic regression.  apply these insights to examine the relationship between programmatic parties and the quality of public financial management (as measured by a country's most recent PEFA), finding a relationship of potentially substantial impact, though of weak statistical significance for a sample of 102 countries. In fact, the authors conclude that the relationship is "significantly weaker compared to the relationship between the presence of programmatic parties and the success of World Bank projects supporting public sector reforms that Cruz and Keefer (2012) report and more likely to be influenced by which countries are included and how specific countries and parties are coded." In contrast, a revised version of the paper ) that uses all PEFA observations for each country, including repeat assessments, finds that programmatic parties appear to have a positive and strong impact on PFM quality. However, when using the World Bank's Country Policy and Institutional Assessment indicator 13 (CPIA-13) as the proxy for PFM performance, programmatic parties no longer appear as a significant factor.
Here, instead of looking solely at the quality of the overall PFM system, we focus on the relationship between the programmatic party variable and specific elements of the PFM system that are likely to be of particular salience to politicians organized into programmatic parties: (a) strategic budgeting; (b) internal audit; (c) accounting, recording, and reporting; and (d) external audit.
Programmatic political parties provide electorates with meaningful choice over policies by reaching out to them through coherent political programs (Cheeseman et al. 2014). Politicians belonging to such parties therefore have an incentive to support reforms to ensure that systems are in place to link high-level policy decisions to the PFM system to maintain credible stances on broad public policies. These high-level policy decisions may include the overall fiscal strategy and the allocation of resources in line with politically determined priorities. However, in countries where political parties do not campaign on a coherent policy program, politicians are less likely to have an incentive to develop systems that would allow the budget to be used as a planning tool for achieving the government's policy goals. They may even be averse to such systems, which can undermine their ability to allocate resources according to their own private interests. Our fourth hypothesis is therefore that countries with programmatic political parties are more likely to develop higher-quality strategic budgeting as a feature of their PFM systems.
Countries with programmatic parties should also prefer financial management systems that allow them to monitor the possible diversion of financial resources away from their priorities. These systems might include higher-quality arrangements for accounting and reporting, internal audit, and external audit. Weaknesses in these areas allow for leakages and other corrupt practices that would undermine the credibility of the electoral commitments of a programmatic party. If such a party does not govern according to its programmatic platform, it could be held accountable in the next electoral round (Cheeseman et al. 2014). Our fifth hypothesis is therefore that countries with programmatic political parties are likely to have higher-quality accountability mechanisms for their PFM.

Hypothesis 4:
Countries with programmatic political parties are more likely to have an incentive to develop PFM systems for strategic budgeting.

Hypothesis 5:
Countries with programmatic political parties are more likely to have an incentive to develop PFM systems that allow for higher-quality accountability mechanisms.

Quality of the PFM system
Our primary measure of the quality of PFM systems is based on the PEFA data set as described in chapter 2. We exclude countries with missing scores on several dimensions when measuring the quality of the overall PFM system 4 or specific elements. In addition to measuring the aggregate PFM system, we also compute measures of specific elements of the PFM system that are relevant to our theoretical propositions. Given that we are looking at specific elements rather than the overall PFM system, we use the M1 scoring methodology where applicable. These elements are as follows: • Legislative budgetary powers (budget preparation). Average of scores of the following PEFA indicators: PI-6, comprehensiveness of information included in budget documentation (submitted to the legislature for scrutiny and approval), and PI-27, legislative scrutiny of the annual budget law.
• Legislative budgetary powers (execution and evaluation). Score of the following PEFA indicator: PI-28, legislative scrutiny of external audit reports.
• Strategic budgeting. Average of scores of the following four PEFA dimensions: PI-12(i), preparation of multiyear fiscal forecasts and functional allocations; PI-12(ii), scope and frequency of debt sustainability analysis; PI-12(iii), existence of sector strategies with multiyear costing of recurrent and investment expenditure; and PI-12(iv), links between investment budgets and forward expenditure estimates.
• Internal audit. Average of scores of the following PEFA dimensions and indicators: PI-18(iv), existence of payroll audits, and PI-21, effectiveness of internal audit.
• Accounting, recording, and reporting. Average of scores of the following PEFA dimensions and indicators: PI-22(i)-(ii), timeliness and regularity of accounts reconciliation; PI-23, information at service delivery level; PI-24, quality and timeliness of in-year budget reports; and PI-25, quality and timeliness of annual financial statements.
• External audit. Score of the following PEFA indicator: PI-26, scope, nature, and follow-up of external audit.
We also use the World Bank's CPIA-13, which measures the quality of budgetary and financial management as a robustness check. The correlation between CPIA-13 and the aggregate PEFA score is quite high at 0.775.

Measuring forms of government
To test whether the form of government affects the quality of the PFM system and legislative budgeting more specifically, we use the Inter-American Development Bank's Database of Political Institutions (2015) to construct a dummy variable for presidential governments that is equal to 1 for systems with unelected executives, with presidents who are elected directly or by an electoral college, or with no prime minister. 5 In systems with both a prime minister and a president, we consider the following factors to categorize the system: a. Hold veto power. President can veto legislation and the parliament needs a supermajority to override the veto.
b. Appoint prime minister. President can appoint and dismiss prime minister, other ministers, or both.
c. Dissolve parliament. President can dissolve parliament and call for new elections.
The system is presidential if (a) is true or if (b) and (c) are true. 6 Governments are parliamentary (PRES 1 = 0) when the legislature elects the chief executive or if that assembly or group can recall him or her.
We also consider a more straightforward classification that is based solely on whether the government in democratic countries 7 can be removed by a legislative majority during its constitutional term in office (also known as a confidence requirement). According to the literature (Persson and Tabellini 2005), systems in which governments cannot be removed by the assembly are coded as "presidential" (PRES 2 = 1), while systems in which they can be removed are coded as nonpresidential (PRES 2 = 0). 8

Measuring electoral systems
Our most basic measure is a simple classification of the electoral formula into "majoritarian," "mixed," or "proportional" electoral rules using the varieties of Democracy Institute's v-Dem database, resulting in a binary indicator (dummy) variable, majoritarian. 9 More precisely, countries electing their lower house exclusively by plurality rule in the year of the PEFA assessment 10 are coded as MAJ = 1 and 0 otherwise.

Measuring divided government
Our measure of divided government is based on the degree of fragmentation of the legislature (Divided govt 1). The divided party control of legislature index from the v-Dem database assesses the extent to which legislative chambers are controlled by different political parties. Extreme positive values represent "divided party control," intermediate values signify "unified coalition control," and extreme negative values signify "unified party control." This variable is available for 46 countries in our sample, with observations for at least six years (inclusive) prior to the year of the earliest or most recent PEFA assessment. 11 We calculate a 10-year average of this variable for these countries.
As an alternative measure, we construct a divided government index, which is the ratio of years in which the government did not command a legislative majority in the lower house (Divided govt 2). It covers the 10-year period immediately before the year of the country's most recent PEFA assessment. We consider the fraction of seats held by all government parties 12 using the Database of Political Institutions (2015), giving a score of 0 when the government held more than 50 percent of seats in that year and otherwise 1. We then compile the index by summing across the 10 years for each country and dividing by 10. Possible index values therefore range between 0 (never minority government) and 1 (always minority government). According to the data, 45 out of the 101 countries for which this measure is available had experience with minority government at some point during the 10-year period considered.

Measuring programmatic parties
The "programmatic parties" variable is constructed in a manner similar to that of Cruz and Keefer (2012) and Fritz, Sweet, and verhoeven (2014), both of which assume that a party is programmatic if it has a specific political orientation (right, left, or center) using variables from the Database of Political Institutions (2015). However, where applicable, we consider the three largest government parties and the largest opposition party, weighing each party by its share of seats in the legislature, and sum these values across the four parties. 13 Our second measure is unweighted and is the fraction of parties in a country that are programmatic (either left, right, or center). Both measures therefore range from 0 to 1. Although programmatic parties exist in several middle-income countries, they are rather rare in low-income environments. Of the 124 countries in our sample, 105 countries have a measure of programmatic parties: 31 are low-income countries (weighted average of 0.29), 42 are lower-middle-income countries (weighted average of 0.42), and 32 are upper-middle-income countries (weighted average of 0.62).

Bivariate analysis
As a first step, we use the Spearman rank coefficients to see the extent to which our data confirm previous findings from the literature as well as some of our hypotheses. Of the two nonbinary political variables considered, only the programmatic party system measure (unweighted) has a weak but statistically significant positive relationship with the quality of the overall PFM system (at the 10 percent level) (see table 3.1). Regarding the specific elements of the PFM system, the programmatic party system measure (weighted and unweighted) has a weak but statistically significant positive relationship with legislative budgetary powers-overall and ex ante. The divided government variable is positively and weakly associated with only one specific PFM element-ex ante legislative budgetary powers.
Concerning the relationship between the form of government and the quality of the PFM system, we do not find a statistically significant difference in the means between presidential and nonpresidential governments with regard to the quality of the overall PFM system as well as legislative budgetary powers (ex ante and ex post).
However, contrary to our expectations, we do find that nonmajoritarian electoral systems have better-quality PFM systems-overall and ex ante legislative budgetary-relative to majoritarian ones, with the difference statistically significant at the 5 percent and 1 percent level, respectively. Majoritarian systems, however, perform better on average with respect to ex post legislative budgetary systems (at the 5 percent level). Overall, simple bivariate statistics do not provide strong evidence in support of our theoretical propositions. However, these tests might not be very informative, because the countries included in our sample are heterogeneous and the quality of their PFM systems are potentially influenced by some important factors that may obscure the impact of the macropolitical variables. We therefore take an econometric approach.

ESTIMATION APPROACH
In this section, we test our hypotheses using multivariate analysis to understand how these and our other variables jointly affect PFM quality. Given the mostly crosssectional nature of our data, the standard econometric method to be used is OLS regression, the limitations of which are discussed in chapter 2.
• Cross-sectional regressions. For these models, we exploit cross-country variation in the quality of PFM in low-and middle-income countries as measured by their most recent PEFA assessment. 14 We regress each country's PEFA score on a five-year lagged average (unless stated otherwise) of the other variables (depending on data availability) prior to the year of the most recent PEFA assessment for the country.
• First-differences method. Although one of the political and institutional features is relatively fixed, some features exhibit within-country variation, specifically with regard to programmatic parties. The measure of divided government is also likely to vary across time, but an insufficient number of observations prevents its use for this method. In order to understand patterns of institutional change as well as to control for possible time-invariant omitted variables, we run a first-differences regression model for countries with repeat PEFA assessments. However, we cannot run a fixed-effects estimation because of the varying time interval between PEFA assessments across countries. Instead, we compute the absolute change in PEFA scores and the absolute change over the same period in the variables capturing country characteristics. This approach allows us to relate changes in PFM quality to changes in these country characteristics. Specifically, we are asking if characteristics change within a country, then how much is PFM quality expected to change on average?
Apart from the variables of interest-"quality of PFM systems" (dependent variable) and "macropolitical variables" (independent variable)-some other independent variables are included in the analysis. They represent other country-specific factors that have been identified in previous studies as influencing the level and change in the quality of the PFM system (de Renzio, . To avoid the trap of "garbage-can" regressions, we only include variables that have tended to be statistically significant in previous analyses, and that have a strong theoretical foundation. This includes variables such as gross domestic product (GDP) per capita, GDP growth, resource dependence, population size, 15 and political stability. Their theoretical relationship with the PFM system is as follows: • Income level. Income is likely to be strongly associated with a wide range of variables that would enable better PFM systems such as financial, human, and technical resources. Citizens in higher-income countries may also have a higher demand for outcomes associated with a well-functioning PFM system, such as better fiscal performance and public service delivery.
• Economic growth. Higher rates of recent growth are expected to facilitate institutional improvements through their impact on resource availability and possibly growing expectations of what government ought to achieve.
• Resource dependency. Resource dependency may undermine the quality of a PFM system in several ways. It can weaken the social contract and accountability between citizens and state elites and create greater incentives for lack of transparency in the management of public funds. In addition, volatile revenues due to commodity price shocks and other types of fiscal shocks might negatively affect budget planning and execution.
• Population size. A large population may be associated with more resources (financial and human) as well as a greater need for advanced PFM systems. Similarly, larger states may find the cost of centralized PFM systems to be low and their return on investment high.
• Political stability. Politically unstable countries find it more challenging to carry out PFM reforms because of weak capacity, widespread informality, and lack of political will.
We also included dummies for colonial heritage, specifically Anglophone and Francophone dummies, although previous studies found them not to have significant effects (de Renzio, Andrews, and Mills 2011). However, we included these variables because cross-national commonalities may be due to institutional replication from colonial powers transferring institutional features to their colonies; once in place, these institutions may be resistant to change (Acemoglu, Johnson, and Robinson 2001;Lienert 2003). Andrews (2010) also found some preliminary evidence that colonial heritage may matter for the quality of certain elements of the PFM system, with Francophone countries substantially lagging other groups 16 in external audit and legislative audit review. Wehner (2005), in contrast, found that British colonial heritage is negatively associated with legislative budget capacity. The summary statistics of these variables are presented in annex 3A, table 3A.1.
The cross-sectional model, across countries, is estimated as follows: The first-differences model focusing on within-country changes over time is as follows: where i indexes countries, Y is the dependent variable of interest (PFM performance), X is the political institutional variable, Z is a matrix of socioeconomic and political macro-level variables, δ is fixed effects, and ε is the error term. These equations are estimated using OLS.

Forms of government
Contrary to our hypothesis, we do not find that countries with presidential systems have better PFM systems. This finding is similar to de Renzio (2009), who found a negative (though statistically insignificant) coefficient on his presidential dummy variable when looking at correlates of the PEFA overall score. Using our broad definition of presidential government in table 3.1, we find that having a presidential regime is negatively associated with the overall quality of the PFM system as well as legislative budgetary powers-overall, ex ante, and ex post. However, none of these coefficients is statistically significant in table 3.2, even when we control for a country's democracy level (annex 3A, table 3A.2). 17 Nevertheless, in line with earlier studies and our expectations, we find that larger population size, lower reliance on natural resources, and greater political stability are generally associated with better PFM quality, albeit at different confidence levels in columns 1-3 in table 3.2. Notably, when we consider more narrow PFM definitions, the fit of the model declines, falling from almost 50 percent in column 1, when we investigate the determinants of the overall PFM system, to as low as 7 percent in column 4, when we measure only ex post legislative budgetary powers. In the case of the latter, only the economic growth variable is statistically significant at the 10 percent level, with faster-growing economies tending to have stronger systems for ex post legislative involvement in the budget process. Using our more simplistic classification of forms of democratic government (Pres 2) in table 3.3 also produces results contrary to our hypothesis, although the negative relationship between the presidential dummy and overall PFM quality is now statistically significant at the 1 percent level in column 1. Furthermore, contrary to our hypothesis that presidential systems are relatively strong ex ante, the coefficient for ex ante legislative powers in column 3 is negative, but not statistically significant.

Electoral system
Contrary to our theoretical proposition, but in line with our bivariate analysis, a majoritarian electoral system is not associated with greater legislative budgetary powers during budget formulation, as shown in column 3 of table 3.4. In fact, although the coefficient is not statistically significant at conventional levels, it is negative rather than the expected positive.

Divided government
Using our first measure of divided government, we find that more divided party control of the legislature is associated with better PFM systems-overall (at the 1 percent level in column 1 of table 3.5) and for specific elements related to legislative powers (at the 10 percent and 1 percent level, respectively, of columns 2-3 in table 3.5). The size of the coefficient is also largest for ex ante budgetary powers (0.32).
Conversely, using our more simplistic measure, we find that having a more divided government is associated with a lower quality of the overall PFM system (as shown in column 1 of table 3.6) as well as specific elements relating to legislative budgetary powers (as shown in columns 2-4). However, none of these coefficients is statistically significant at conventional levels, with the exception of ex post budgetary

Programmatic parties
Contrary to expectations, we find a negative relationship between how programmatic the party system is and the quality of the aggregate PFM system as well as specific elements (table 3.7). This negative relationship, however, is only statistically significant at the 10 percent level in column 4, when the dependent variable is the quality of accounting, recording, and reporting. Moving from having a party system that is completely nonprogrammatic to one that is completely programmatic is associated with a decrease of 0.28 for accounting, recording, and reporting. Notably, the magnitude of the programmatic party coefficient increases to 0.38 when we use an unweighted programmatic party measure in column 6. Ultimately, both results are counter to our proposition that, in a political system in which parties have clear policy agendas, politicians are more likely to have an incentive to demand systems that can provide information on the cost of programs and the use of resources to ensure that resources are allocated to their priorities. Our results differ from those of , because both of our measures of programmatic parties and overall PFM quality are different.     We also test this hypothesis using a first-differences model in table 3.8. This model uses the absolute change in the PEFA-based measure of PFM quality as the dependent variable. The coefficients of the absolute change in the political variable of interest-programmatic parties (weighted) in table 3.8-are not statistically significant at conventional levels. The number of years between assessments also appears to have no statistical correlation with the change in PFM quality. However, both an increase in population size and political stability tend to be associated with a small improvement in PFM quality in some models in table 3.8 at varying confidence levels. For example, in column 3, a 1 percent increase in total population size is associated with an increase in the internal audit score by 0.03.
We also find no statistically significant relationship between the change in our macropolitical variables and the change in our alternative measure of PFM quality, CPIA-13 average (as shown in annex 3A, table 3A.4). However, in these models, the absolute change in GDP per capita is positively associated with a small improvement in the CPIA score at the 10 percent level. More specifically, a 1 percent increase in GDP per capita is associated with an improvement in the CPIA score of 0.0035.

Summary of results
Our analysis shows that, with the exception of divided government, our macropolitical variables generally have a weak or no relationship with the quality of the PFM system (as measured by PEFA and CPIA) when we control for other country characteristics. In fact, to a large extent, we find no evidence in support of our theoretical propositions that the ex ante legislative budgetary institutions are stronger in presidential systems or majoritarian systems, with the sign of the coefficient in the opposite direction from what we predicted. Similarly, we find no evidence that having a more programmatic political party system is associated with better systems for strategic budgeting or better institutions for overseeing the handling of public finances. This lack of evidence in favor of our hypotheses, especially those developed on the basis of the experience of higher-income countries, may be because formally similar political institutions may function differently in low-and middleincome countries for reasons discussed below. We find that more divided party control of the legislature (Divided govt 1) is associated with better PFM systems-overall and specific elements related to legislative budgetary powers, especially ex ante at the 1 percent level. We also find the following weak-and counterintuitive-relationships: • A presidential regime (as defined in terms of a confidence requirement) is negatively associated with the quality of the overall PFM system (at the 1 percent level).
• A more divided government (defined in terms of whether the government had a legislative majority in the lower house) is negatively associated with ex post legislative budgetary powers (at the 5 percent level).
• A more programmatic party system is associated with a lower quality of accounting, recording, and reporting (at the 10 percent level).
Furthermore, when we exploit within-country variation in our first-differences models, we find no statistically significant correlation between the absolute change in our measure of programmatic parties and the absolute change in the quality of the overall PFM system or specific elements. However, a larger population size and political stability are generally associated with an improvement in the quality of the PFM system overall and for specific elements.

Limitations of the study
The lack of a clear empirical relationship between these macropolitical variables and PFM quality should not be interpreted to mean that these factors do not have a strong predetermining effect on the quality of the PFM system and thus should be disregarded when designing PFM reforms for the following four reasons. First, our PEFA-based measure of PFM quality is not without limitations, as noted in chapter 2. This weakness is currently insurmountable given the absence of other available indicators for measuring the quality of PFM systems (overall and most elements) with the coverage and timeliness required for regression analysis.
Second, our measure of the political variables may also be subject to measurement error or be an imperfect proxy for the characteristics they are intended to capture. For example, although we have improved on the programmatic party measure that has been adopted in previous studies by considering the share of seats in the legislature, this measure is less precise than other empirical work investigating the effect of programmatic party systems. Wantchekon (2003), for example, distinguishes between electoral platforms based on clientelism as opposed to the ones based on public policy (public goods) in Benin. Moreover, although the political and institutional variables used in this chapter are relatively well defined for high-income countries, they may not be reliable in some low-and middle-income countries, because they focus on formal aspects of democratic institutions that do not necessarily reflect the actual exercise of political power in these contexts. Informal institutions, such as family and kinship structures, traditions, and social norms, play a critical role in many political systems; and it may be misleading to examine the political incentives for reforms only through the lens of the formal institutions captured by the variables used in this chapter. As Rodrik, Subramanian, and Trebbi (2002, 24) conclude on the question of formal institutions and development, "Desirable institutional arrangements have a large element of context specificity, arising from differences in historical trajectories, geography, and political economy or other initial conditions." Hence, whether or not institutions lead to better PFM systems is as much a question of the incentive and enforcement mechanism of the institutions themselves as the environment in which the institution operates.
Third, for each of the macropolitical variables, the fit of the model is generally lower when we investigate the correlates of specific elements of the PFM system compared with when we look at the quality of the overall PFM system. For example, the country variables used in the regression models in table 3.3 jointly account for only 7 percent of the variation in PFM quality across countries in column 4 as compared with 49 percent when the PEFA-based measure of quality of the overall PFM system is the dependent variable in column 1.
Finally, our first-differences models only consider a relatively short time period, with the average time between a country's first and most recent PEFA assessment being 6.5 years. The lack of a statistically significant coefficient on the change in the programmatic party variable is therefore not surprising given that this variable shows little variation over time compared with the change in other country characteristics considered, specifically income and population size.

Next steps
Given the limitations of quantitative analysis to generate insight into the political incentives for PFM reforms, further research is needed to inform how PFM reforms should be calibrated to country context. This research may not necessarily be empirical, but it might use a country's PEFA score as one of the key selection criteria when undertaking case studies. For example, a group of countries that are highly similar in most respects including their formal political features, but that perform quite differently in regard to their PEFA scores, could be investigated for possible reasons for this divergence, including the interaction between formal and informal institutions. Similarly, because most countries have had repeat PEFA assessments, we can also identify countries that are mostly similar, including similar initial PEFA scores, but that subsequently diverged on the basis of their most recent PEFA assessment. Furthermore, given that most studies in this area have focused on macropolitical variables, another potential area of research is to investigate the relationship between features of the microinstitutional environment and the quality of the PFM system. An example of a more microinstitutional feature is the degree of fragmentation of central finance functions. Such a study can be beneficial for two reasons. First, these institutional features can typically be adjusted more easily than high-level political variables and thus can potentially be altered as part of a PFM reform strategy if a convincing argument can be made. Second, the relationship between microfactors and PFM quality may be more direct than one with macropolitical characteristics, and thus causality may be more easily inferred.
Moreover, some of the existing research that has used the PEFA data set to measure PFM improvements has suggested that the existing quantitative analysis needs to be complemented by qualitative research. Qualitative research is envisaged to provide a more comprehensive understanding of the role of specific contexts, the role of different stakeholders and their motivations in pursuing PFM reforms, and how this influences the results and impact of reforms .
Finally, the growing number of subnational PEFA assessments and growing popularity of decentralization reforms can provide another research opportunity for assessing the extent to which certain political and institutional characteristics may explain differences in the quality of the PFM system. Ultimately, significant scope remains for using PEFA assessments to gain a greater understanding of the determinants of PFM quality, but further work is needed to overcome the challenges to using PEFA scores for statistical analysis.   3. For example, the Westminster model is often weak ex ante. The U.K. Parliament abdicates the right of financial initiative to the executive. In contrast, the U.S. Congress is strong ex ante, with a complex system of specialized committees in both houses to make budgetary decisions with the support of extensive analysis from the Congressional Budget Office. Conversely, the Westminster model is relatively strong ex post, whereas the U.S. Congress conducts less ex post scrutiny, with no public accounts committee or equivalent (Pelizzo, Stapenhurst, and Olson 2006). 4. When looking at the overall quality of the PFM system, we exclude five countries with 10 or more missing scores for PEFA dimensions: Fiji, Lebanon, Myanmar, Nauru, and Uruguay. 5. This definition excludes countries with a communist government system. For the database, see https://mydata.iadb.org/Reform-Modernization-of-the-State/Database-of-Political -Institutions-2015/ngy5-9h9d. 6. If no information or ambiguous information is available on factors (a), (b), and (c), then if sources mention the president more often than the prime minister, the system is considered presidential (Estonia, the Kyrgyz Republic, Romania). 7. A regime is considered a democracy if the executive and the legislature are directly or indirectly elected by popular vote, multiple parties are allowed, there is de facto existence of multiple parties outside of the regime, there are multiple parties within the legislature, and there has been no consolidation of incumbent advantage (for example, unconstitutional closing of the lower house or extension of the incumbent's term by postponing of subsequent elections). 8. Nonpresidential systems include countries in which the head of state is popularly elected for a fixed term in office. 9. For the v-Dem database, see https://www.v-dem.net/en/data/data-version-8/. 10. For countries missing observations for the year of the PEFA assessment, we use the most recent observation three years before or after the PEFA assessment. 11. Data are available for 11 countries before the earliest PEFA assessment and 35 countries before the most recent PEFA assessment. 12. Calculated by dividing the number of government seats by total (government plus opposition plus nonaligned) seats. 13. We consider four political parties: the three largest government parties and the largest opposition party. 14. Ukraine is the exception across all models, with the PEFA scores from 2012 used instead of those from the most recent 2016 assessment because of the large number of missing indicator scores in the latter. 15. We do not control for being a small island state given that population size and small island dummy are highly correlated (−0.71) in our sample. 16. Relative to Anglophone countries and those of Portuguese heritage in Africa. 17. Democracy level is measured using the Freedom House's level of democracy index.

SHAKIRA MUSTAPHA
In this chapter, we explore whether the credibility of the budget and fiscal outcomes improve with the quality of the public financial management (PFM) system in fragile and nonfragile states. Using a cross-sectional multiplicative interaction model, we exploit the variation in PFM quality as measured by Public Expenditure and Financial Accountability (PEFA) indicators and outcomes across countries. Our results are mixed. We find that, controlling for other determinants of credibility, better PFM quality is associated with more reliable budgets in terms of expenditure composition in fragile states, but not with aggregate budget credibility. Moreover, in contrast to existing studies, we find no evidence that PFM quality matters for fiscal outcomes-such as deficit and debt ratios-irrespective of whether a country is fragile or not. This is despite controlling for other key determinants of fiscal outcomes and running several robustness checks.

INTRODUCTION
The literature on PFM reforms has grown extensively in recent years; as a result, we now know much more about the effectiveness of PFM reforms than we did a decade ago. But significant gaps in knowledge remain. One such gap pertains to the outcomes of PFM reforms in "fragile states"-countries that either recently experienced conflict or have weak institutional capacity. This chapter aims to address this void by examining possible links between the quality of the PFM system as measured by PEFA indicators and outcomes such as budget credibility and fiscal outcomes in fragile states.
Building or rebuilding fiscal institutions in fragile states is generally perceived as an important part of state building (Boyce and O'Donnell 2007;Ghani, Lockhart, and Carnahan 2005;World Bank 2011b). The underlying logic is that, if a state cannot tax reasonably or spend responsibly, a key element of statehood is missing because it 4 would be unable to deliver basic goods and services as well as manage expenditure in a manner that its citizens regard as effective and equitable. Although the evidence on what works when it comes to strengthening PFM systems in fragile states is growing (Fritz 2012;IMF 2017a;Williamson 2015), much less is known about the actual effects of these improved systems in these environments. Traditionally, a sound PFM system supports aggregate control, prioritization, accountability, and efficiency in the management of public resources and delivery of services. However, PFM systems in fragile states, even those conforming to "best practice," may fail to function as expected because of a crippling combination of factors that often leaves these states stuck in a "capability trap" (Pritchett and de veijer 2011). Low human capacity, lack of physical infrastructure, and persistence of parallel informal systems are some of the factors that can impair the proper functioning of a well-designed PFM system in a fragile state.
This chapter investigates this wider question regarding whether PFM reforms can produce the desired outcomes in fragile states. From a political economy perspective, evidence that a well-functioning PFM system can be linked to tangible results even in fragile environments is important to convince decision makers in these countries to commit to these reforms. Furthermore, focusing on building sound fiscal institutions in fragile states may bring relatively high returns. For example, even though the development of effective budget institutions takes time and resources, these requirements tend to be much smaller than those needed for more general institutional improvements . Here we consider both a narrow and a broad definition of fragility because, although fragile states share some broad common characteristics, they are all different in their own ways. Context matters and needs to be understood.
We focus on understanding the impact of the PFM system on budget credibility and fiscal discipline in fragile states for two reasons. First, credibility and discipline are often the first and foremost concern in many low-and middle-income countries, with any efforts to address the other PFM objectives-strategic allocation of resources and efficient delivery of services-coming next. In addition, various macroeconomic goals and national objectives for development and public service delivery are also easier to achieve when funds are disbursed as allocated. As a result, a credible budget is seen as a priority for many fragile states. According to the former president of Liberia, Ellen Johnson Sirleaf, "Perhaps our greatest fiscal challenge lies in focusing the expenditure of cash inflows from domestic revenue and from donors on established priorities. The better we can manage our public finances, the better we can deliver on our poverty reduction and job creation agenda" (World Bank 2011a, 3).
Achieving fiscal discipline also tends to be a priority for fragile states. Better fiscal outcomes are expected to widen the fiscal space, providing room to meet pressing development needs as well as the ability to respond to adverse shocks by running expansionary fiscal policies and therefore mitigating the impact of shocks on the population (Gelbard et al. 2015). This improvement can, in turn, enhance state legitimacy as well as avoid or minimize the risk of relapse to conflict. A second reason for focusing on budget credibility and fiscal outcomes relates to data availability. Measuring other PFM outcomes such as efficient service delivery or corruption tends to require special studies or imperfect proxies (see chapter 5 on PFM and corruption). In investigating the interaction between fragility and the effects of the PFM system, it is therefore reasonable to look first at budget credibility and fiscal outcomes.
Using a cross-country interactive regression model and a PEFA-based measure of PFM quality, we find mixed evidence regarding the relationship between PFM quality and budget credibility in fragile states, depending on the definitions of credibility and fragility used. On the one hand, better PFM quality is associated with better budget credibility-aggregate and compositional-in nonfragile states. More important, although this relationship with aggregate budget credibility generally becomes insignificant in fragile states, there is some evidence that a positive and statistically significant relationship persists in fragile states when we look at compositional budget credibility and adopt the World Bank's definition of fragility. Better systems for predictability and control in budget execution, in particular, are associated with a higher level of composition credibility in fragile states. On the other hand, there is no evidence that the quality of the overall PFM system matters for fiscal outcomes in both fragile and nonfragile states. However, given that estimating the impact of budget institutions on fiscal performance is plagued by several identification challenges-such as reverse causality and omitted variable bias as well as potential limitations with the PEFA data set-results should be treated as preliminary.
The remainder of the chapter is structured as follows. We begin by summarizing the literature on the effects of budget institutions on budget credibility and fiscal outcomes before describing how we measure the key variables of interest and our empirical strategy. We then outline and discuss our results.

LITERATURE REVIEW
In this section, we first consider the broader literature concerning the track record of PFM reforms with regard to improving budget credibility and fiscal outcomes and then focus on these same outcomes in fragile states specifically. Although most studies find evidence that a stronger PFM system is associated with a more credible budget and better fiscal outcomes, very little can be gleaned from the existing literature about the achievements of PFM reforms in fragile states.

PFM system and budget credibility
We assume that a credible budget is one that displays minimal deviation from approved allocations, in aggregate and in composition. The budgets in most low-and middle-income countries deviate considerably from budget plans recognized for some time, with Wildavsky and Caiden (1980) identifying the numerous political and technical challenges that affect the ability of poor countries to manage budgets effectively. Schick (1998) also has classified various types of harmful budgeting practices in low-and middle-income countries that contribute to unreliable budgets. These practices include unrealistic budgeting that authorizes more spending than the government can mobilize; hidden budgeting, where the real priorities are known only to a narrow clique within government; and deferred budgeting, where real spending patterns are obscured by the generation of arrears (Schick 1998, 36).
Deviating from budget plans, however, is not necessarily deliberate, with unforeseen budgetary pressures often requiring unplanned expenditures. This is ultimately due to the inherent uncertainty of budgeting. When the assumptions made during preparation of the budget do not materialize, perhaps because of a macroeconomic shock or natural disaster, difficult questions on how to choose between competing priorities can reemerge. Where budgets are overly rigid, there is a risk that spending will be locked into choices made in the past when the world looked very different. At the other extreme, where budgets are constantly remade, the whole credibility of the budget process is undermined.
The few empirical papers that explore the relationship between the quality of the PFM system and these budget deviations generally find that a better PFM system is associated with a more credible budget after controlling for other variables. Using data on expenditure deviations extracted from PEFA reports for a small sample of 45 countries,  finds that compositional accuracy improves with the quality of the PFM system, 1 but that the correlation between aggregate expenditure deviations and the capacity for PFM is small. 2 Using an ordered logit model and looking specifically at expenditure deviations in the health and education sectors for a sample of 73 countries, Sarr (2015) finds that a more transparent budgetary system 3 increases the likelihood of having a credible and reliable budget. 4 Similarly,  find that better PFM systems are associated with a higher rate of overall budget execution for 102 countries and with a more credible budget for 97 countries, meaning that sector allocations are aligned with original allocations. Although the sample is largest for , the model controls only for gross domestic product (GDP) per capita, which increases the likelihood of omitting key predictors, which can sometimes bias the coefficients of included variables.

PFM system and fiscal outcomes
A good PFM system is essential for achieving aggregate fiscal discipline by restraining expenditures. Theoretically, unless regulated by strong institutional arrangements, the deficit (and debt) bias inherent in the political process will lead to an unsustainable fiscal position in the form of excessive expenditures, deficits, and debt levels. This bias has been studied extensively in the literature as the product of two distinct but interrelated theoretical phenomena. The first is the commonpool resource problem  that arises when the various decision makers involved in the budgetary process compete for public resources and fail to internalize the current and future costs of their choices. The second pertains to information asymmetry and incentive incompatibilities-the agency phenomenon-between the government and voters. This phenomenon leads to rent seeking in which politicians appropriate resources for themselves at the cost to citizens (Persson and Tabellini 2000). Strong PFM systems such as a topdown approach to planning the budget can mitigate this tendency to overspend by ensuring that the budgetary consequences of policy decisions are considered appropriately. Strong accountability mechanisms and supporting structures that comprehensively and transparently monitor and enforce budget decisions can minimize the agency problem (Hallerberg, Strauch, and von Hagen 2004;Hallerberg and von Hagen 1999;Ljungman 2009).
Although many factors affect the behavior of public finances, most of the empirical work confirms a relationship between better PFM systems and a more sustainable fiscal balance, albeit with various caveats and nuances. This evidence covers different time periods, geographic regions, and countries with varying political setups and income levels and generally involves constructing indexes of budget institutions. See Yläoutinen (2010), von Hagen (1992), and von Hagen and Harden (1996) for Europe; Perotti and Kontopoulos (2002) for Organisation for Economic Co-operation and Development (OECD) countries; Alesina et al. (1999), and Filc and Scartascini (2007) (2011) for 40 African countries. Several studies explore the relationship between specific aspects of the PFM system and fiscal discipline. For example, by exploiting within-country variation for a panel of 181 countries over the period 1990-2008, vlaicu et al. (2014) find that fiscal discipline improves after the adoption of a medium-term expenditure framework.
In contrast, , using a PEFA-based measure of the quality of the PFM system and controlling only for per capita income, find that a stronger PFM system is not associated with lower deficits for 56 countries. 5 However, the limited number of observations makes it more difficult to establish statistical relationships. In fact, the coefficient, though statistically insignificant, is negative rather than the expected positive. The lack of relationship with deficit levels may also be related to the time period, with many PEFA assessments undertaken as part of the process toward debt relief and during the global financial crisis, which has prompted larger deficits in many countries, including those with stronger PFM systems.

PFM system in fragile countries
Reforms to improve public financial management have been high on the agenda in fragile states for both governments and donors alike. Although there is a growing body of evidence that these reforms improve the quality of the PFM system (Fritz 2012;IMF 2017a;Williamson 2015), much less is known about whether these reforms achieve their ultimate objectives of improving the credibility of the budget as well as fiscal outcomes. In fact, a qualitative study of eight fragile countries found no clear relationship between overall progress made on strengthening PFM systems and processes and achievements on budget credibility (World Bank 2012). The authors conclude that outcomes like budget credibility are substantially influenced by political incentives and considerations and that these can fluctuate and change in negative directions, even where PFM systems as such are improved. In addition, although fiscal deficits have been controlled across the eight case studies, a clear caveat is that current stability does not necessarily imply long-run fiscal sustainability, because grants from development partners still play a significant role in funding public expenditures. To our knowledge, no quantitative study has looked at the relationship between the PFM system and these outcomes in fragile states.
The literature also suggests some plausible reasons why PFM reforms may not have the desired impact in fragile states: • Low human capacity. The effectiveness of formal systems is likely to be weakened by the low human capacity in fragile settings. Emigration, the absence or deterioration of the education system, distorted incentives, and clientelistic appointments are likely to contribute to this low capacity. At the same time, there is great competition for the few skilled staff from other strategic areas in the government or from donors to manage in-country projects.
• Weak physical capacity and basic operating systems and processes to make budgetary institutions function. This feature may be heavily dependent on the nature of the conflict and the emerging political settlement. Physical infrastructure may need to be developed or rebuilt, the banking system may have extremely limited reach, and basic systems and processes may need to be established or reestablished. In Liberia, for example, human resource capacity constraints as well as power and connectivity problems hamper the functioning of the PFM system, particularly the usefulness of the Integrated Financial Management Information System.
• Persistent parallel, informal systems and practices based on personalized arrangements. Such systems and practices ensure that formal systems for PFM remain functionally weak, painfully slow and unreliable, illegitimate, and widely corrupted (Levi and Sacks 2009).
Following from this literature, we test the following two hypotheses.

Measuring PFM quality
Our primary measure of the quality of PFM systems is the set of indicators developed under the PEFA initiative using the 2005 and 2011 versions of the framework. PEFA is the most comprehensive attempt thus far to construct a framework to assess the quality of budget systems and institutions across countries and over time. The 2011 framework comprises 28 indicators that assess institutional arrangements at all stages of the budget cycle, together with cross-cutting dimensions and indicators of budget credibility. Before the 2016 revision, it also included three additional indicators of donor practice. The PEFA data set, however, is not without limitations, including limited availability of time-series data; inconsistent time period of PEFA assessments (between countries and within countries); the fact that some PEFA 2011 indicators measure processes rather than PFM functionality; and potential sample selection bias, with PEFA assessments being largely donor driven. Our findings should therefore be interpreted in the context of these limitations. We worked with a data set that included the results of 307 PEFA assessments completed in 144 countries between June 2005 and March 2017. Several countries were subsequently excluded from our sample because of limited availability of other relevant data. Our main regression models included observations ranging from 93 to 116 countries (see annex 4A for country coverage).
In order to transform PEFA scores into the dependent variable to be used in our empirical analysis, we followed a series of steps. First, we only considered indicators that cover the quality of PFM systems on the expenditure side. We therefore excluded PI-1 through PI-4, which measure PFM outcomes; indicators PI-13 to PI-15, which cover transparency and effectiveness of tax administration; and D-1 to D-3, which are donor-related indicators. This allowed us to compare our results to previous studies that have also tended to focus on expenditure management. Moreover, although the donor-related indicators are likely to affect the credibility of the budget, especially in aid-dependent countries, we excluded these indicators given data quality concerns. Second, for multidimensional indicators, we used dimension scores rather than summary indicator scores to exploit all of the information contained in the PEFA scores. This decision allowed us to avoid the downward bias introduced by the M1 scoring methodology, whereby summary indicators are based on the lowest-scoring dimension or "weakest link." Third, we converted the letter scores included in PEFA reports into numerical scores, with higher scores denoting better performance (from A = 4 to D = 1).
In addition to measuring the aggregate PFM system, we also computed measures of specific elements of the PFM system to shed light on which components

Hypothesis 2:
A well-functioning PFM system will improve fiscal outcomes (that is, lower budget deficits and debt ratios) if and only if the country is not fragile.

Hypothesis 1:
A well-functioning PFM system will increase the credibility of the budget if and only if the country is not fragile. Although not as comprehensive and transparent as PEFA, we used the World Bank's Country Policy and Institutional Assessment indicator 13 (CPIA-13) averaged over the period 2012-15 to test the robustness of our results. CPIA-13 measures the quality of budgetary and financial management on a six-point scale along three dimensions: (a) a comprehensive and credible budget, linked to policy priorities; (b) effective financial management systems to ensure that the budget is implemented in a controlled and predictable way; and (c) timely and accurate accounting and fiscal reporting, including audits.

Measuring fragility
Fragility is a broad term whose definition is highly contested because of its complex, multidimensional nature. Given that a key feature of fragile situations is the risk or presence of conflict, we start with a very narrow definition of fragility based on the number of battle-related deaths-a country is considered fragile if it had more than 100 battle-related casualties (Fragile 1) in any year between 2012 and 2015. We then use a broader definition of fragility (Fragile 2) and consider the countries included in the World Bank's list of fragile states between 2012 and 2015. For a given year, this list classifies countries as fragile either based on their macroeconomic administrative capacity (the World Bank's CPIA score of 3.2 or lower) or based on their capacity to deliver security (signaled by the presence of a peace-keeping or peace-building operation during the past three years). The CPIA rates countries on a set of criteria grouped in four clusters: economic management, structural reforms, policies for social inclusion and equity, and public sector management. Our choice of CPIA as a measure of fragility comes after considering several indicators of fragility used by different donor agencies and international financial institutions. The benefit given to the CPIA score is that it goes through a rigorous review process, although it reflects a degree of subjective judgment.

Comprehensiveness and transparency
A comprehensive budget reduces the risk that public spending outside the budget could redirect resources from the approved budget, while budgetary transparency makes the common-pool problem and the agency problem less likely by increasing the degree of accountability felt by public officials.
Policy-based budgeting The more public expenditure is well aligned with public goals, the higher the probability that the budget will respect the originally approved allocations as well as the fiscal and macroeconomic framework defined by government.

Predictability and control in budget execution
Orderly execution of the budget may strengthen fiscal management by facilitating appropriate in-year adjustment to the budget totals in accordance with the fiscal framework. Strong control arrangements may also prevent expenditures from deviating from what was planned and from leading to higher deficit or debt levels.
Accounting, recording, and reporting Timely, adequate information on expenditure flows and debt levels strengthens the capacity of government to decide and control budget totals as well as manage long-term fiscal sustainability and affordability of policies.
External scrutiny and audit Scrutiny of government's budget and its implementation by parliamentarians and by external audit agencies may motivate a better quality of budgetary execution as well as increase the pressure on government to consider long-term fiscal sustainability issues and to respect its targets.
Some basic descriptive analysis of the data set is suggestive of relative strengths and weaknesses in budget institutions across fragile and nonfragile countries. As expected and in line with the findings of others , the average quality of the PFM system-both overall and specific components-is generally weaker in fragile states than in nonfragile states (as shown in figures 4.1 and 4.2). The gap between fragile and nonfragile countries is most pronounced when we use the broad definition of fragility, with the difference in means statistically significant at the 1 percent level. In general, the weakest component of the PFM system in both fragile and nonfragile countries is external scrutiny and audit, whereas the strongest component tends to be comprehensiveness and transparency. 6

Aggregate budget credibility
In many countries, particularly low-income or fragile states, national budgets are often poor predictors of expenditures. Our first measure of budget credibility is based on PEFA indicator PI-1 and measures whether governments are able to plan aggregate expenditures ex ante and keep to the broad parameter during execution. According to the PEFA methodology, countries in which deviations between actual expenditures and budgeted expenditures were less than 5 percent in the last two or three years receive a score of A or 4. On the other end, countries in which deviations between actual and budgeted expenditures were greater than 15 percent in two or three of the last three fiscal years receive a D or 1.

Compositional budget credibility
Our second measure of budget credibility is based on PEFA indicator PI-2(i), which measures the extent to which reallocations between budget heads during execution have contributed to variance in the composition of expenditures. Countries get a score of A or 4 if the variance in expenditure composition was less than 5 percent in the last two or three years. On the other end, countries for which the variance in expenditure composition exceeded 15 percent in at least two of the last three years get a score of D or 1.

Measuring fiscal outcomes
Consistent with the literature, we consider two measures of fiscal discipline: 1. General government primary net lending or borrowing (percent of GDP) 2. Public external debt (percent of GDP).   (20) Nonfragile (96) We focus on the average primary balance as a preferred measure of the government's fiscal stance because it abstracts from the effect of inflation on interest payments, since interest payments are a function of accumulated debt and not the present fiscal stance. The reason to focus on debt is that primary deficits in some countries may not be driven by a systematic bias but instead may reflect temporary effects. We use official public external debt because the data on total government debt are unavailable for a large number of countries in the sample. We examine the relationship between PFM quality and these fiscal variables during the 2012-15 period, because the fiscal positions in many countries were affected by the food and fuel crisis and subsequently by the global financial crisis between 2008 and 2011.

ESTIMATION APPROACH
In this section, we empirically test whether better PFM systems (as measured by PEFA) are associated with better fiscal outcomes and more reliable budgets, after controlling for relevant explanatory variables and differentiating between fragile and nonfragile countries. We estimate cross-sectional multiplicative interaction models because our hypotheses are conditional in nature-that is, we test whether "an increase in X is associated with an increase in Y when condition Z is met, but not otherwise." These interaction models are common in the quantitative social science literature because institutional arguments frequently imply that the relationship between certain inputs and outcomes varies depending on the institutional context (Brambor, Clark, and Golder 2006), or in this case, fragility. Our model is as follows: where i indexes countries, Y is the dependent variable of interest (average of fiscal balance or public external debt as a percentage of GDP between 2012 and 2015 or our PEFA-based measure of budget credibility), X measures the quality of the overall PFM system or PFM element (based on the country's most recent PEFA assessment), Z is a fragility dummy that equals 1 where a country is fragile and 0 otherwise, and ε is the error term. We estimate this model using ordinary least squares (OLS) because the data are cross-sectional. It is relatively easy to see that the model presented in equation (4.1) captures the intuition behind our hypothesis. This is because, when the country is nonfragile, that is, when Z = 0, equation (4.1) simplifies to:  In equation (4.1), β 1 captures the effect of a one-unit change in X on Y in a nonfragile state. When the country is fragile, that is, when Z = 1, equation (4.1) can be simplified to:

Average quality of the public financial management (PFM) system in fragile and nonfragile countries (Fragile 2)
In constructing these models, we follow good practice (Brambor, Clark, and Golder 2006). First, we include PFM quality (X) and fragility (Z) variables separately alongside the interaction term (XZ) in the model. Second, we do not interpret the coefficients on the constitutive terms (X and Z) as if they were unconditional marginal effects and instead compute substantively meaningful marginal effects and standard errors, that is, we estimate the coefficient on X when Z = 0 and when Z = 1 in a separate table. This is important because it is possible for the marginal effect of X on Y to be significant for different values of the modifying variable Z even if the coefficient on the interaction term is insignificant. We assume that, like previous studies, β 1 will be positive (negative) when the dependent variable is budget credibility or fiscal balance (public debt), indicating that on average better PFM quality is associated with more credible budgets or favorable fiscal outcomes in nonfragile countries. If fragility offsets this effect, we expect β 3 to have the opposite sign, and β 1 + β 3 = 0.
In this chapter, we also control for a larger number of variables than Fritz, Sweet, and verhoeven (2014), adding W to equation (4.1). W refers to a series of control variables dictated by the existing literature.
To assess the impact of PFM quality on budget credibility, we identify other factors that may influence a country's budget credibility. On the basis of  and Sarr (2015), these factors include the level of GDP per capita because governments in wealthier countries can pay for better talent and better systems of control than other governments. The quality of public and civil services and the degree of their independence from political pressures can also be expected to have a significant impact on the formulation and implementation of the budget and are proxied by the government effectiveness index. Finally, we include countries' dependency on natural resource revenue and foreign aid because the volatility of these revenue sources can be expected to affect the way in which the budget is implemented.
When fiscal outcomes are the dependent variables, the selection of control variables also draws heavily on the earlier literature (Dabla-Norris et al. 2010), which serves as a benchmark to compare our results. The control variables include real economic growth (Growth) to control for economic circumstances, the log of initial GDP per capita in 2011 (Initial GDP per capita) to control for differences in economic and overall institutional development, a dummy for resource-rich countries (Resource), and a trade variable (Trade). Following Alesina et al. (1999), changes in the terms of trade (Trade) are scaled by the degree of openness of the economy, measured as the sum of exports and imports to GDP. Because in some countries tax revenues are heavily linked to export activities, we expect improvements in the terms of trade to be associated with lower deficits and debt levels and these effects to be more important for economies that are more open to international trade. Growth, terms of trade changes, and openness are measured as annual averages for the period 2012-15 to control for cyclical effects. For the Resource dummy, we use the same definition as IMF (2011), which classifies countries as resource rich if their resource rents exceed 10 percent of GDP. In our debt regressions, we control for two additional variables shown to be important in previous studies: a dummy for highly indebted poor country (HIPC), post-completion-point countries (HIPC dummy), and the initial debt-to-GDP ratio (Initial debt). The HIPC dummy controls for low-income countries that have benefited from official debt relief and, as a result, are expected to have stronger fiscal positions, while initial debt, proxied by external debt in the year prior to the beginning of the sample (2011), is included to focus on the effect of budget institutions on recent fiscal policy settings. Table 4.2 shows the results from estimating equation (4.4) using OLS, with aggregate budget credibility as the dependent variable in columns 1 and 2 and compositional budget credibility as the dependent variable in columns 3 and 4. As shown in the first row, we find that better PFM systems are associated with more credible budgets (aggregate and compositional) in nonfragile states, with this effect statistically significant (at the 5 percent or 10 percent level). Our estimated coefficient implies that nonfragile states that score 1.0 point higher on our measure for the quality of the aggregate PFM system will score 0.4-0.5 higher on the PEFA budget credibility indicators on average (columns 1 to 4). However, the linear combinations of the PFM coefficients calculated in table 4.3 suggest that this effect for credibility at the aggregate level generally weakens in size and loses statistical significance in fragile states (using both definitions of fragility). However, we do find that a better PFM system is associated with better compositional budget credibility in fragile states using the broad definition of fragility, with a conditional coefficient 0.49 that is statistically significant at the 5 percent level (as shown in table 4.3). 7 Similarly, although we generally find no significant relationship between aggregate budget credibility and five specific PFM elements in fragile states, 8 table 4.4 provides evidence of a positive and statistically significant relationship between credibility at the sectoral level and three specific elements of the PFM system in fragile states (using the broad definition of fragility, Fragile 2). The effect is largest for predictability and control in budget execution. Other things equal, better systems for ensuring predictability and control in budget execution are associated with a higher level of compositional credibility in fragile states at the 5 percent level or better, irrespective of the definition of fragility used.  Fiscal outcomes Table 4.5 shows the relationship between the quality of the overall PFM system and fiscal outcomes, other things equal. The primary balance is the dependent variable in columns 1 and 2, whereas public external debt is the dependent variable in columns 3 and 4. As shown in tables 4.5 and 4.6, we find no statistically significant relationship between our PEFA-based measure of overall PFM quality and the fiscal balance in both nonfragile and fragile states. 9 This finding is in stark contrast to the results of most of the studies reviewed in this chapter. Our results are, however, in line with those of , despite our larger sample size of 116 observations and wider set of control variables. However, given the poor fit of the model, with an R 2 as low as 0.08 in column 1 of table 4.5 and with only the resource dummy statistically significant, these results should be treated with caution. Note: Robust standard errors are in parentheses. To reduce the impact of outliers, the coverage of models using debt as the dependent variable in columns 3 and 4 is limited to countries with an average external debt within two standard deviations of the average debt levels for the sample of countries. HIPC = highly indebted poor country; PFM = public financial management. *** p<0.01, ** p<0.05, * p<0.1

Budget credibility
When debt is the dependent variable in column 3, we find a statistically significant relationship (at the 5 percent level) between PFM quality and public external debt ratio, although the sign of the coefficient is in the opposite direction we expected. This suggests that, on average, a better PFM system is associated with higher external debt ratios in nonfragile states. These results are contrary to our hypothesis as well as the findings of previous studies like those of Dabla-Norris et al. (2010), which found that a better PFM system is associated with lower public external debt ratios. Moreover, the conditional coefficient of the PFM variable in fragile states shown in table 4.6 is negative and not significant at conventional levels, suggesting that there is no relationship between debt and PFM quality in fragile states. As shown in table 4C.1 in annex 4C, we also find that better systems for policy-based budgeting and external scrutiny and audit are associated with higher public debt ratios at the 1 percent and 5 percent levels, respectively, in nonfragile states (using the narrow definition of fragility, Fragile 1). This relationship, however, becomes negative and loses statistical significance in fragile states. The economic  variables-specifically the initial debt ratio and economic growth-are consistently statistically significant at the 1 percent and 10 percent levels, respectively, which is largely in line with a priori assumptions and previous findings. On the basis of these results, we find no evidence that better PFM systems (as measured by PEFA) go hand in hand with better fiscal outcomes (defined as larger primary balances and lower debt ratios) in nonfragile and fragile states. 10 This is also the case when we look at specific elements of the PFM system in annex 4C,  4), and an alternative dependent variable (using sovereign credit rating in table 4B.5). Overall, our results are largely unchanged and do not seem to suggest a relationship between the quality of budget institutions and fiscal performance in the period following the financial crisis in both nonfragile and fragile states.
Notably, using the sovereign credit rating as the dependent variable in table 4B.5 suggests that the positive relationship between PFM quality and the public external debt ratio in nonfragile countries may be because countries with better PFM systems are more likely to convince the markets about their ability and willingness to repay their debt and as a result are able to borrow more externally relative to the size of their economy. However, this relationship between PFM quality and credit rating in nonfragile states may be spurious, with the PFM variable proxying for quality of the broader institutional environment. 11 We test this possibility by controlling for government effectiveness, which results in the conditional coefficient for PFM quality becoming insignificant in both nonfragile and fragile states, as shown in the last two rows of table 4.7.

DISCUSSION
Overall, we find mixed evidence in support of our hypothesis that fragility impairs the functioning of the PFM system. Contrary to our hypothesis, our results suggest that investing in improving the quality of the PFM system can have a positive impact even in fragile environments (as defined by the World Bank) by increasing the credibility of the budget and reducing the variance in the composition of expenditure. Controlling for other factors, better predictability and control in budget execution appear to have a strong relationship with ensuring that functional or sectoral budget allocations are implemented close to plan in both nonfragile and fragile states. Conversely, at the aggregate level, whereas a stronger PFM system is associated with a more credible budget in nonfragile states, this is not the case in fragile states.
With regard to the effects of PFM quality on fiscal outcomes, we find no evidence that the quality of the PFM system matters for the size of deficits and debt ratios in both fragile and nonfragile states. Resource dependency instead tends to be the main factor associated with larger fiscal deficits. This holds when we look at the quality of more specific elements of the PFM system. Moreover, the statistically significant, but counterintuitive, positive relationship between PFM quality and public external debt ratios found in nonfragile states is potentially because countries with better PFM systems also tend to have stronger institutions more broadly and thus are perceived as having a higher capacity to repay. This, in turn, enables them to access more external financing. We find evidence of this when we use sovereign credit rating as an alternative dependent variable and control for government effectiveness.
Our findings are largely in line with those of previous studies of a more qualitative nature, which concluded that the impact of PFM reforms in fragile states remains less than what might be hoped (World Bank 2012). This finding underscores the need to exercise caution when assuming that the outcomes associated with a well-functioning PFM system in nonfragile states will automatically be realized in conditions of fragility.
Nonetheless, our lack of evidence of a relationship between PFM quality and fiscal outcomes in both nonfragile and fragile states can potentially be attributed to several methodological limitations. The main econometric challenge in establishing a relationship between fiscal outcomes and PFM quality in the models presented above is the problem of reverse causality. 12 Reverse causality refers to the possibility that budget outcomes influence the evolution of fiscal institutions, rather than the other way around, as presumed. Further complicating matters are some limitations with using the PEFA data set to test our hypotheses. PEFA indicators do not adequately measure certain aspects of the PFM system that are perceived as important for fiscal discipline in the literature, such as the fragmentation of budgetary authority and the existence of fiscal rules or expenditure ceilings for line ministries (Dabla-Norris et al. 2010). Another critique of the 2011 PEFA framework that is relevant to this chapter is that its indicators may not be appropriate in different contexts . "Best practice for whom?" is the central question. For example, the development of a multiyear budget may not be best or even appropriate in fragile countries where it is very difficult to plan ahead over a longer period of time. A fruitful direction for future research may therefore involve enhancing the PEFA indicators with data that can be extracted from other publicly available data sources and repeating the analysis conducted in this chapter. In addition, there is the possibility that omitted variables will arise from unobservable determinants of the outcomes considered. Despite these limitations, this chapter facilitates a more nuanced understanding of the outcomes of a strong PFM system (as measured by PEFA) in fragile states, while also highlighting the challenges of relying solely on the PEFA data set to explore these complex relationships.

NEXT STEPS
Further research on the outcomes of PFM reforms in fragile environments should focus on outcomes that may be particularly relevant for fragile countries, such as improving budget execution in specific sectors or specific aspects of state building. As noted in this chapter, aggregate outcomes such as fiscal discipline and credibility of the overall budget are good starting points, but a more nuanced analysis that is tailored to the priorities of fragile states is recommended. It would be also worthwhile to focus specifically on the elements of the PFM system that are expected to be most relevant to fragile countries such as cash management. A mixed-methods approach with country case studies is recommended in order to give adequate attention to the contexts and dynamics of specific countries.

Robustness Check Controlling for Having an IMF Program
To address the possibility that fiscal outcomes may influence the quality of the PFM system, we control for a country having an IMF program between 2012 and 2015. It is highly possible that budgetary reforms are tightly linked to IMF programs that are introduced in response to fiscal performance. In that case, the quality of budget institutions could be expected to be endogenous to prior fiscal performance. We tested this possibility by including an IMF program dummy variable in the baseline models. The results are summarized in

Robustness check using quality of de jure PFM elements
To mitigate concerns about reverse causality, the working assumption in earlier papers is that budget institutions are costly to change and should therefore be more stable than fiscal outcomes-at least in the short to medium run. This assumption is likely to be stronger for de jure PFM (or procedural) elements rather than de facto elements because legal frameworks (especially when grounded in the constitution) 13 can take a long time to amend, whereas informal practices can be quickly altered. We therefore repeat the baseline regression models in table 4.1 using this de jure PFM measure, but again we generally find no statistically significant relationship between PFM quality and

Robustness check using CPIA-13 as an alternative measure of PFM quality
Given the limitations of the PEFA data set, specifically the fact that PEFA assessments are conducted in different years and that some indicator scores may measure improvements in form rather than PFM functionality, we consider an alternative measure of PFM quality, CPIA-13 (table 4B.3). Notably, the PEFA-based measure of overall PFM quality and the CPIA measure are highly correlated (0.7776). Although the results are mostly unchanged, we do find that better PFM quality as measured by CPIA-13 is associated with larger primary balances, although the finding is not statistically significant.

Robustness check using PEFA assessments from 2012 onward
Our sample includes countries whose PEFA assessments were undertaken as far back as 2007. Given that these PEFA assessments may not reliably capture the recent quality of the PFM system, especially in fragile states where reversals are common, we restrict our analysis to countries whose most recent PEFA assessments are from the year 2012 onward. Our results in table 4B.4, however, remain largely unchanged when compared with those recorded in table 4.5, with no relationship between PFM quality and primary balance and a positive but weak relationship between PFM quality and public external debt ratio in nonfragile states, but not in fragile states.

Robustness check using sovereign credit rating as dependent variable
Although we do not find a statistically significant relationship between PFM quality and primary balance in table 4.5, the positive relationship between PFM quality and public debt level is large and almost statistically significant in nonfragile states, while it appears to weaken in fragile states using the narrow definition of fragility. This counterintuitive relationship may be due to a supply-side issue-that is, countries with better PFM systems are more likely to convince the markets about their ability and willingness to honor their debt. This relationship should be reflected in better credit ratings, 14 so a logical question to ask is whether countries with better PFM systems have better ratings after controlling for other economic fundamentals. As shown in table 4B.5, a better PFM system is associated with a higher sovereign credit rating in nonfragile states, with this relationship statistically significant at the 5 percent level using both definitions of fragility. This relationship appears to weaken and lose statistical significance for fragile countries. However, it is likely that the quality of PFM system is proxying for quality of the broader institutional environment. This is confirmed when we include a measure of government effectiveness. Nonetheless, like previous studies, we find that rating assignments are related to economic fundamentals, including initial external debt across all models, and per capita income and growth when we do not control for government effectiveness in columns 1 and 2.

NOTES
1. An index of PFM capacity was constructed as an average of the 24 PEFA indicators in dimensions 2 through 6. 2. Controls for drivers of the common-pool behavior as well as political institutions. 3. This transparency is measured using the Open Budget Survey. 4. Controls for GDP per capita, population size, government effectiveness, level of democracy, centralization of the budget process, strength of the legislature, and dependency on oil and foreign aid. 5. The fiscal balance is calculated as a three-year forward average beginning the year of the country's first PEFA score. 6. The exception is fragile countries using the broad definition of fragility, with the average score for policy-based budgeting (average of 2.32) being slightly higher than the average score for comprehensiveness and transparency (average of 2.28). 7. This is not due to the slightly smaller sample size when compositional budget credibility is the dependent variable instead of aggregate budget credibility. We test this by running the models again on the same sample. 8. Results not shown. 9. Our results are unchanged when we include regional dummies as well as a democracy measure. 10. This holds when we exclude three countries (Fiji, Lebanon, and Myanmar) whose PEFA assessments are missing scores for 10 or more dimensions. 11. The relationship between institutional quality and repayment capacity is well established in the literature and is a key assumption underlying the debt sustainability framework of the IMF and World Bank (IMF 2017b). 12. Alesina and Perotti (1996), Knight and Levinson (2000), Perotti and Kontopoulos (2002), and Stein, Talvi, and Grisanti (1999) discuss the difficulties in dealing with this problem of reverse causality. 13.  made this distinction between de jure and de facto elements of the PFM system. 14. Our credit ratings variable is the most dominant sovereign rating on foreign currency long-term debt between 2012 and 2015. The alphabetical ratings are converted into numerical ratings using a simple alphabetical ranking with D (Default) = 1 and AAA (Aaa for Moody's) = 22, with a higher credit rating indicating a better rating.

CATHAL LONG
International development institutions frequently prescribe improving public financial management (PFM) as part of the response to lowering corruption levels in low-and middle-income countries. But to date there has been little cross-country analysis on whether better PFM is associated with lower levels of corruption. This chapter investigates the relationship between PFM and corruption using the most widely available cross-country measures of both. We use measures from Public Expenditure and Financial Accountability (PEFA) assessments to construct indexes for transparency and controls in public expenditure. We find statistically significant relationships between all of our indexes and perceptions-based measures of corruption, but stronger relationships and more evidence for controls. We also find that the estimated relationships are small compared with other determinants of corruption, particularly economic growth. This finding is in line with the findings of others.

INTRODUCTION
Perspectives on corruption vary. For many, particularly those working in international development, corruption is a constraint on economic growth and development because it results in the inefficient allocation of a country's own resources and limits the quantity of resources that the country can attract from abroad, either through foreign aid or through investment. This view frequently leads to a policy prescription of institution building. As a result, aid agencies have spent large sums supporting the betterment of PFM institutions in low-and middle-income countries, based on an understanding that this institutional development will increase government transparency and accountability, reduce opportunities for corruption, and allow for more and better spending, ultimately resulting in development progress. Domestic actors also frequently include improving PFM as part of their anticorruption strategies. The fact that countries with higher measured PFM performance have lower measured corruption is often used as evidence to support this view and justify an institution-building approach to international development . However, it is equally plausible that causation runs in the opposite direction. Some scholars hypothesize that development progress itself, through the emergence of a market-based economy, gives rise to demands for better institutions, leading to declining levels of measured corruption. Others point to a coevolutionary process in which markets and institutions mutually adapt to one another (Ang 2017). Moreover, measures of both corruption and PFM are hotly disputed. various problems are associated with measuring corruption, most notably the fact that it is difficult to observe and therefore measures tend to be based on perceptions. Measures of PFM are also the subject of much criticism, for sometimes emphasizing the measurement of form over function (see chapter 2 for further discussion). Regardless, both sets of measures remain influential, particularly with respect to developing donor-funded programs of technical assistance for institution building. This chapter reviews some of the hypothesized links between PFM and corruption and whether they are borne out empirically, using data from PEFA assessments and various corruption indexes and controlling for other determinants of corruption. Our findings suggest that expenditure controls are more important for combating corruption than PFM reforms related to transparency in budgeting, reporting, and audit.
The chapter proceeds as follows. We begin by reviewing the literature on PFM and corruption, developing hypotheses for testing, and providing an overview of the data on corruption and PFM and their empirical relationships. We then outline the methodology for estimating the relationship between corruption and PFM using these data, discuss other determinants of corruption to be used in the model, and present the results from our estimation models. We conclude with further discussion and conclusions regarding our results.

LITERATURE REVIEW
The most widely accepted definition of corruption is "the abuse of public office for private gain" (IMF Staff Team 2016). However, this definition is very broad. Andvig and Fjeldstad (2000) distinguish between "bureaucratic corruption" 1 and "political corruption." Nevertheless, even within the category of bureaucratic corruption, activities may range from the solicitation of bribes by police officers to the embezzlement of large sums of money by government officials through creative accounting. 2 Although activities are illegal in most countries, political corruption can encompass both illegal and legal activities (Khan 2006). In the extreme case of state capture, the law-making process itself can become perverted (IMF Staff Team 2016). More common examples of legal corruption include the allocation of rents to political constituencies through the budget process in the form of pork-barrel projects (Ware et al. 2007) or through preferential regulation and land allocation (Khan 2006). Political corruption is also distinguishable from bureaucratic corruption in its relationship with campaign financing (Tanzi 1998).
In many countries, public spending and the public sector are synonymous with corruption. The PFM system itself presents opportunities for corruption . As a result, many low-and middle-income countries and donors view strengthening the PFM system as an anticorruption strategy ).

The PFM system provides opportunities for corruption
Most corruption takes place during the budget execution stage of the budget cycle, where resources actually flow and assets change hands . One way of thinking about corruption in public expenditure is to consider the types of expenditure and the corrupt practices associated with each (table 5.1). But what happens at the budget formulation stage also matters for corruption. 3 Weak budget formulation allows for the development of faulty practices, such as open-ended budgeting, which present problems for budget execution and opportunities for corruption later in the budget cycle (Schiavo-Campo and Tomasi 1999).
In contrast, the reporting and audit stages of the budget cycle are frequently held up as effective anticorruption strategies because they increase the probability of detection and therefore act as disincentives to engage in corruption in the first instance (see, for example, Johnsøn, Taxell, and Zaum 2012;Rocha Menocal and Taxell 2015). Accounting and reporting do not generally offer direct opportunities for engaging in corruption, 4 but lack of accurate and timely reports on revenue and expenditure reduce the probability of detecting it. Similarly, delays or political interference in external audit and oversight limit the possibilities for detecting and punishing corruption.

Strengthening PFM as an anticorruption strategy
Although strengthening PFM as an anticorruption strategy has some sound theoretical underpinnings, the evidence base is limited (French 2013). Reform of the PFM system can reduce opportunities for corruption in two broad ways: directly, by introducing controls that reduce opportunities for corruption (often by minimizing the discretion of politicians and bureaucrats), or indirectly, by increasing the probability of detection and punishment (often by increasing transparency). We discuss each in turn and develop testable hypotheses for the relationships between PFM and corruption.

"Sunlight is said to be the best of disinfectants"
The oft-quoted statement, by U.S. Supreme Court Justice Louis D. Brandeis, captures the essence of the argument for greater transparency in public finances: transparency enables citizens to hold governments accountable. The International Monetary Fund (IMF) considers fiscal transparency-which it defines as "the comprehensiveness, clarity, reliability, timeliness, and relevance of public reporting on the past, present, and future state of public finances"-as "critical for effective fiscal management and accountability." 5 Similarly, the PEFA Secretariat (2011) considers transparency to be a desirable cross-cutting feature of the PFM system and budget cycle. From an anticorruption perspective, it is useful to consider transparency in budget preparation, transparency in the reporting of budget execution, and transparency in the auditing of public expenditures.

Transparency in budget preparation
There is relatively little evidence to support the hypothesized link between budget transparency and corruption, particularly with respect to low-and middle-income countries (French 2013). Moreover, what studies exist tend to establish a statistically significant association rather than a causal link (de Renzio and Wehner 2015). Furthermore, they tend to focus on budget transparency with respect to the entire PFM cycle rather than on budget preparation specifically. Finally, they estimate the relationship between transparency and perceptions-based measures of corruption rather than actual corruption. The problems associated with using perceptionsbased measures of corruption are discussed in the next section. Hameed (2005) finds that fiscal transparency has a positive and statistically significant effect on controlling corruption. 6 However, the effect is quite small following the introduction of other controls, 7 and the sample size is small (56 countries) and limited in coverage for low-and middle-income countries. Moreover, further estimations using four subindexes of the composite fiscal transparency indicator find that only the indicator for medium-term budgeting is statistically significant, whereas the subindex more closely related to budget transparency is not. Bastida and Benito (2007) find a negative relationship between budget transparency and corruption, but their sample is limited to 41 predominantly higher-income countries. Martí and Kasperskaya (2015) find that the correlation between budget transparency and corruption 8 decreases in size and is statistically insignificant once segmented by economic development. They conclude, "Countries with similar governance perception scores show different patterns of PFM practices, suggesting that there is no one-size-fits-all approach," although they acknowledge the limitations of their sample size (49 countries). Bellver and Kaufmann (2005) use an institutional transparency index to estimate a statistically significant effect on reducing corruption 9 that is robust to the inclusion of other controls 10 for a large sample (104) of countries. However, although their measure of institutional transparency includes measures of budget transparency, 11 it also includes numerous other measures of transparency. Using the same measure of transparency as Bellver and Kaufmann (2005), Lindstedt and Naurin (2010) find that the effects of transparency on corruption are conditional on press freedom and democracy.
Building on this literature, our first hypothesis tests whether a cross-country relationship exists between transparency in budget preparation and corruption.

Transparency in budget execution reporting
Of course, governments may say they are going to do one thing and then do another. Budgets in poor countries are characterized by a lack of credibility (Simson and Welham 2014). Martinez-vazquez, Boex, and Arze del Granado (2004) note that corruption is particularly prevalent when oversight by the legislature and civil society is limited.
Perhaps more than any other study, Reinikka and Svensson (2011) make the case for the effect on corruption of transparency in budget execution reporting. Their study established a plausible causal effect of increased transparency-in the form of reporting disbursements to primary schools through newspapers-on a reduction of funds captured by local government bureaucrats. The study simultaneously made the case for Public Expenditure Tracking Surveys (PETSs) and likely influenced the

Hypothesis 1:
Countries with a more transparent and orderly budget process will have lower levels of corruption.
inclusion of the PI-23 indicator, which measures the availability of information on resources received by service delivery units, in the PEFA framework.
However, this is something of a special case, whereby the link between government spending (by local officials) and those affected (pupils, their parents, and head teachers) was very tangible. More common forms of budget execution reporting are in-year budget reports (measured by PI-24) and annual financial statements (measured by PI-25), which tend to be less specific in nature and may be less digestible to broad groups of stakeholders. 12 Using a transparency index that measures the frequency with which governments update economic data that they make available to the public, Islam (2006) finds that countries with better information flows also govern better. 13 Our second hypothesis builds on this latter strand of literature by examining whether a cross-country relationship exists between transparency in budget execution reporting and corruption.

Transparency in external auditing
Of course, whether transparency creates the necessary accountability between government and citizens depends on the latter using the information to hold government to account. According to Heald (2006), "For transparency to be effective, there must be receptors capable of processing, digesting, and using the information. . . . It is possible for an organization to be open about its documents and procedures yet not be transparent to relevant audiences if the information is perceived as incoherent." He further notes, "The expected benefits [may] not materialize because the receptors have been disabled by overload and/or government spin." One way in which this "transparency illusion" might be bridged is through more specialized surveillance-namely, through an independent audit function.
Ex post review through external audit is a means by which institutions with technical expertise can hold government accountable for its performance and use of resources. Generally, this role is carried out by a supreme audit institution, 14 which typically reports to the legislature. The theoretical link between auditing and corruption is straightforward. Audits increase the probability that corruption will be detected, thereby increasing the ex ante cost of engaging in corrupt activities, assuming that corrupt practices are sanctioned. However, the evidence for the impact of audit on corruption is context specific. In contrast to most of the cross-country literature on PFM and corruption discussed thus far, studies on the impact of audit on corruption generally take the form of researchers focusing on a particular sector in a specific country and using a measure of actual corruption, similar to the study by Reinikka and Svensson (2011).
Di Tella and Schargrodsky (2003) find that a large increase in audit intensity, during an anticorruption crackdown, was associated with a 15 percent decline in input prices paid by hospitals in Buenos Aires. 15 Lagunes (2017) finds that Peruvian districts subject to monitoring by both civil society and the supreme audit institution spent 51.39 percent less in the execution of public works than comparable districts that were less scrutinized. This contrasts with the findings of Olken (2007), who finds that grassroots monitoring of corruption has limited impact on corruption in Indonesian village road projects, but that increasing the probability of external audits from 4 percent to 100 percent reduced missing expenditures 16 by 8.5 percentage points, from 27.7 percent to 19.2 percent. He suggests that the effect was not larger because 100 percent probability of audit does not translate into 100 percent probability of detection and punishment and further suggests that providing audit

Hypothesis 2:
Countries with more transparent budget execution reporting will have lower levels of corruption.
results to the public, who can then use them in making their electoral choices, may be a useful complement to formal punishments.
A study by Ferraz and Finan (2008) examines the effect on municipal electoral outcomes when something much like Olken's suggestion was implemented in Brazil. They find that the dissemination of audit reports revealing corrupt practices to the general media reduced the likelihood of incumbent mayors being reelected. Two and three violations associated with corruption reduced the likelihood of reelection by 7 percent and 14 percent, respectively. Furthermore, they find that, in the presence of a local radio station, the effect on incumbents' likelihood of reelection was reduced further in the presence of corruption and increased in the absence of corruption. The random nature of the timing of the audits, both before and after the elections, allowed them to establish a plausible causal link between the revelations of corruption in the reports and the election outcomes.
In contrast to the more precise nature of these studies, PEFA indicators of the audit function tend to be more broad based. Nevertheless, they are sufficient to construct a hypothesis around the relationship between a more transparent audit function and corruption.

Controls limit discretion and reduce opportunities for corruption
Weak regulatory and control environments offer the best opportunities for corruption in public spending . It is therefore not surprising that sequencing strategies for PFM reform such as the "platform approach" recommend establishing the integrity of very basic data and control systems (such as payroll and procurement) before undertaking more complex reforms (Brooke 2003). Ware et al. (2007) describe public procurement as "a perennial challenge" from a corruption standpoint because of its specific characteristics. Public procurement expenditures, such as public investments or contracts for the supply of goods and services, are typically low-volume but high-value transactions. This characteristic makes public procurement an attractive arena for corrupt individuals, because bribes are generally extracted as a percentage of the contract. Furthermore, public procurement is frequently characterized by high levels of discretion, both in terms of politicians' discretion over the location of investments and bureaucrats' discretion over the award and management of the related contracts. Moreover, these problems are exacerbated further in low-and middle-income countries where the private sector is more dependent on public procurement. Using PEFA indicators and firmlevel survey responses, Knack, Biletska, and Kacker (2017) find that firms pay less in kickbacks in countries with better procurement systems.
Although payroll and welfare abuses present potentially lower monetary incentives for corruption, human resource systems in low-and middle-income countries are frequently plagued by nepotism, ghost workers, and absenteeism. Reforms in this area are typically related to creating better links between personnel systems, social welfare systems, and payment systems, which is often the equivalent of enforcing data sharing across government entities. Gupta et al. (2017) present evidence from case studies in Ghana on the number of ghost workers and in India on the power of digitalization to reduce corruption in welfare benefits.
Our final hypothesis considers the relationship between budget execution controls and corruption.
In the next section, we outline the data we use to test these hypotheses and discuss their strengths and limitations with respect to investigating the relationships between PFM and corruption.

Hypothesis 3:
Countries with more transparent external audit institutions will have lower levels of corruption.

Hypothesis 4:
Countries that more closely adhere to best practice in budget execution controls will have lower levels of corruption.

DATA AND ANALYSIS
As with previous research on the determinants of corruption, our study is constrained by the absence of comparable cross-country data on actual corruption. Like others, we rely on perceptions-based indicators of corruption. The primary data source we use for our dependent variable is from the World Bank's Worldwide Governance Indicators for control of corruption (hereafter the WGI_ COC). Our data on PFM performance come from PEFA assessments. Both data sets have important limitations with respect to how well they represent the hypotheses outlined in the previous section. We discuss these limitations in more detail below.
The WGI_ COC "captures perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as 'capture' of the state by elites and private interests." 17 It is constructed using 16 questions from 7 representative sources and 27 questions from 15 nonrepresentative sources. Sources include surveys of households and firms such as the Afrobarometer survey and the Gallup World Poll and expert opinions from commercial providers of business information (for example, the Economist Intelligence Unit), nongovernmental organizations (for example, Freedom House), and public sector organizations (for example, the African Development Bank). Questions vary with respect to their direct relevance to PFM. For example, although the WGI_ COC indicator includes a measure related to the diversion of public funds (from World Economic Forum 2017), it also includes more general questions, for example, on whether corruption among government officials is perceived to be widespread (from the Gallup World Poll). This perception has implications for our hypotheses. If improvements in PFM are correlated with improvements in components of the WGI_ COC that are wholly unrelated to improvements in PFM, then we may find support for our hypotheses in spurious relationships.
More generally, perceptions-based indexes of corruption are the subject of criticism. Cobham (2013) is particularly critical of the use of expert opinion surveys within Transparency International's corruption perceptions index, which, he says, "embeds a powerful and misleading elite bias in popular perceptions of corruption, potentially contributing to a vicious cycle." As noted, the WGI_ COC uses expert opinion surveys, although it also uses citizen perceptions surveys. Donchev and Ujhelyi (2013) highlight particular biases within perceptions indexes with respect to measurement errors for low-and middle-income countries and large countries. Furthermore, microlevel data on actual corruption suggest that perceptions of corruption may be off the mark in either direction by a wide margin (Olken and Pande 2012). These criticisms again raise concerns about measurement error in our dependent variable.
A related problem is that improvements in PFM, particularly those related to increased transparency, may result in revelations that actually lead to a worsening in perceptions of corruption. As noted in the discussion of the findings of Ferraz and Finan (2008), information about corrupt practices revealed by transparency in budgets, reporting, and audits could have the opposite effect of our hypotheses regarding their relationships with perceptions of corruption. Indeed, Fisman and Golden (2017) point out that, since the commencement of President Xi Jinping's crackdown on corruption, China's ranking on Transparency International's corruption perceptions index has actually worsened, lending credence to the notion that perceptions are driven more by revelations than by corruption itself. As such, our hypotheses that transparency in budgeting, reporting, and auditing is associated with lower levels of corruption may be compromised by using perceptions of corruption as a proxy for actual corruption.
However, the WGI_ COC also has advantages. Chief among them is its crosscountry coverage with respect to countries that have also undertaken a PEFA assessment. Moreover, Treisman (2000) makes a strong case for the usefulness of corruption perceptions indexes, noting high correlations between different indexes across countries, within indexes over time, and between surveys of business people and citizens. Moreover, as a composite of numerous corruption surveys or a poll of polls, the WGI_ COC includes a measurement error term. This inclusion allows for analysis that places greater weight on composite scores where the various surveys produce similar scores. Nevertheless, the previously noted problems remain pertinent. In particular, perceptions of corruption may not correlate with actual corruption and may be driven by the type of revelations that sometimes come with improvements in PFM. We consider these factors further when interpreting our results in the concluding section.
WGI_ COC scores are based on a scale from −2.5 to 2.5, with higher scores indicating better outcomes. For comparison with other indexes and easier interpretation, we rescale the index from 0 to 100. 18 For comparison with backward-looking PEFA assessments, we also calculate our dependent variable as the moving average of the WGI_ COC score corresponding to the year of the PEFA assessment and the two preceding years. For the 99 countries in our sample (annex 5B, table 5B.1), the scoring distribution is skewed left, with most countries having a WGI_ COC score between 20 and 60 ( figure 5.1, panel a). This distribution is not surprising given that low-and middle-income countries dominate the sample. The mean score for the sample is 39.8, corresponding most closely to the score of Peru in 2015. The highest score is 90.2, for Norway in 2008, while the lowest is 21.2, for Myanmar in 2012. The WGI_ COC scores are also strongly correlated with both the Transparency International and the International Country Risk Guide (ICRG) country risk indexes. 19 PEFA scores are also prone to criticism. A common complaint is that some PEFA indicators emphasize the measurement of form over function. There is also debate around which indicators matter most, particularly when it comes to designing PFM reforms. 20 Our aim in this chapter is to select the indicators that may matter most for corruption and examine whether these relationships are observable in the data. We therefore construct indexes that best match the hypotheses outlined in the previous section. PEFA scores are converted to numeric values using the methodology outlined in chapter 2 of this report.
Compared with the distribution of the WGI_ COC scores, the distribution of the overall PEFA scores is skewed right, with 75 percent of countries scoring 2 or higher, despite the lower-income bias in the sample ( figure 5.1, panel b). Nevertheless, an observable relationship exists between the two measures ( figure 5.1, panel c), with a correlation coefficient of close to 0.5. At the same time, panel c also shows quite a number of outliers, particularly with respect to countries that have performed well on PEFA assessments but have poor WGI_ COC scores.
To test hypothesis 1-countries with a more transparent and orderly budget process will have lower levels of corruption-we construct an index of the former (TRANS1) using indicators PI-5, PI-6, PI-11, PI-12, and PI-27 of the PEFA framework (table 5.2). The TRANS1 index is calculated as the average score for each of the dimensions underlying these indicators. However, we exclude PI-12(ii) (on debt sustainability analysis) because the link to corruption is more ambiguous and PI-12(iv) (on in-year amendments) because this budget execution control issue is included in the relevant index for budget execution controls. The distribution of scoring on this

PI-5
Classification of the budget-calls for the use of a standardized chart of accounts in line with government financial statistics The use of a standardized chart of accounts makes it easier for other stakeholders to understand and engage with budget documents, increasing the probability that corrupt allocations are detected.

PI-6 Comprehensiveness of information included in budget documentation-calls for the inclusion of nine types of budget documentation
The more information is provided, the more other stakeholders can engage with the budget process, increasing the probability that corrupt allocations are detected.

PI-11
Orderliness and participation in the annual budget process-calls for a timely and structured budget process, using a budget calendar, call circulars, and timely submissions and reviews An orderly and timely budget preparation process should limit the opportunities for corruption in the budget formulation process by introducing a structured set of checks and balances into the preparation process and reduce discretionary practices (such as openended budgeting).
PI-12 a Multiyear perspective in fiscal, planning, expenditure policy, and budgeting-calls for a longer-term perspective in planning and budgeting Better information on future allocations increases the probability that corrupt allocations are detected.

PI-27 b
Legislative scrutiny of the annual budget law-calls for the legislature to have a clearly defined and timebound role in the scrutiny of the annual budget law Legislative oversight increases the probability that corrupt allocations are detected.  index is skewed to the right, similar to the overall PEFA index ( figure 5.2, panel a) and is weakly correlated with the WGI_ COC index ( figure 5.3, panel a).
To test hypothesis 2-countries with more transparent budget execution reporting will have lower levels of corruption-we construct an index (TRANS2) based on the dimensions of indicators PI-23, PI-24, and PI-25 (see table 5.3). Indicators PI-24 and PI-25 measure the quality and timeliness with which the government prepares standard financial reports. PI-23 is more of a special case that obliges central government to take steps to ensure that resources are reaching schools and health facilities. 21 The TRANS2 index is distributed more normally and correlated more strongly with the WGI_ COC than the TRANS1 index ( figure 5.2, panel b; figure 5.3, panel b). These are indicators of internal transparency in budget execution reporting. Whether they are made publicly available is measured separately under PI-10. However, the PI-10 indicator does not provide enough precision to determine whether budget execution reporting is made publicly available. 22 Nevertheless, the fact that they are produced makes it more plausible that they will make it into the public sphere.
In contrast, TRANS3, which was constructed to test hypothesis 3-countries that have more transparent audit institutions will have lower levels of corruption-shows the greatest variation in its distribution and the weakest relationship with the WGI_ COC ( figure 5.2, panel c; figure 5.3, panel c, respectively). Most notable is the level of variation in scoring on the TRANS3 index across those countries that score below the mean of approximately 40 on the WGI_ COC ( figure 5.3, panel c). The index is constructed as the average of the dimensions under PI-26 and PI-28 (table 5.4). Our hypothesis is that adherence to best practice in auditing increases the probability of detection and that placing audit reports before the legislature increases the probability of sanction.
Our final subindex (CONTROLS) is a composite of indicators PI-18, PI-19, and PI-20 and the fourth dimension of PI-27 (table 5.5), constructed to test our fourth hypothesis-countries that adhere more closely to best practice in budget execution controls will have lower levels of corruption. Our hypothesis is that these types of

PI-26
Scope, nature, and follow-up of external audit-calls for comprehensive scope of audits, timely submission to the legislature, and evidence that issues raised have been followed up Increases the probability that corruption will be detected and sanctioned PI-28 Legislative scrutiny of external audit reports-calls for timely scrutiny of audit reports, in-depth hearings on qualified or adverse audit opinions, and evidence that the legislature's recommendations on action have been implemented by the executive Increases the probability that corruption will be sanctioned controls are associated with lower levels of corruption on the basis that they limit opportunities and incentives for specific types of corruption. The fact that they are in place may also demonstrate political commitment to budget priorities. Of our four indexes, the CONTROLS index has the most normal distribution and is correlated most strongly with the WGI_ COC ( figure 5.2, panel d; figure 5.3, panel d, respectively). All of our subindexes are positively correlated with one another. TRANS2 and CONTROLS have the highest correlation of 0.57, while the other subindexes are weakly correlated with one another (table 5.6). All of the subindexes are strongly correlated with the overall PEFA index.
The analysis in this section has served to establish relationships between various parts of the PFM system and corruption based on the hypotheses outlined in the previous section. However, this analysis comes with the important caveat that our data on corruption, the WGI_ COC, is a perceptions-based indicator rather than a measure of actual corruption. Furthermore, our measurement of components of the PFM system using aggregated PEFA scores may not perfectly reflect our hypotheses. The implications for the interpretation of the observed relationships are revisited in the concluding discussion. In the next section, we describe our approach to estimating these relationships when controlling for other determinants of corruption.

ESTIMATION APPROACH
Following the example of Treisman (2000), we take a two-step approach to estimating the relationship between our PFM indexes and the WGI_ COC. We use linear  regression as the estimation technique rather than maximum likelihood estimation because the dependent variable is closer to being a continuous variable than a categorical variable. As a first step, in equation (5.1) we employ weighted least squares (WLS) to estimate the relationship in levels: where Y i is the WGI_ COC, X i is the relevant PFM index for country i, Z i is a matrix of country-level controls, and ε i is our error term. The equation is estimated using data for country i's most recent PEFA assessment, which covers the period from 2005 to 2017 for the 99 countries in our sample (see annex 5B, table 5B.1). Following Treisman (2000), observations are weighted using the inverse of WGI_ COC variance between surveys, which gives less emphasis to countries with wide variations in the components making up the WGI_ COC.
Our control variables are based on the findings of similar studies on the determinants of corruption (table 5.7). Countries with large natural resource endowments are more susceptible to rent seeking and corruption, whereas openness to trade is associated with less corruption (Ades and Di Tella 1999). We use natural resource rents as a percentage of gross domestic product (GDP) to control for the former and trade as a percentage of GDP to control for the latter. Higher-income countries tend to have lower perceptions of corruption, which we control for using the log of GDP per capita. Following the example of Treisman (2000), we use lagged values for each of these first three controls in recognition that current levels of corruption and development are likely to be jointly determined. Specifically, we use the four-year moving average of the year of the PEFA assessment lagged by five years (for example, for a PEFA assessment score of 2015, the natural resource endowments variable will be the average of natural resource rents as % of GDP for the four years 2011, 2010, 2009, and 2008).
We also control for country size using the log of population because of its association with political structures such as federalism, although the effects on corruption are ambiguous (Treisman 2000). And again, following the example of Treisman (2000), we control for both democracy and press freedom using indexes of each and differences in region and colonial origin using dummy variables. Finally, following Knack, Biletska, and Kacker (2017), we employ year dummies for the year in which the PEFA assessment was carried out, because this varies across our sample.
The model outlined in equation (5.1) suffers from obvious endogeneity concerns, particularly simultaneity bias arising from the likelihood that corruption and its determinants (including PFM performance) may be jointly determined (Olken and Pande 2012), measurement error (because both our dependent and independent variables may not accurately reflect actual corruption and PFM performance, respectively), and omitted variable bias arising from unobservable determinants of corruption, for example, culture. Overcoming the endogeneity issues related to the former two is beyond the scope of this chapter, which aims to investigate relationships rather than establish causal mechanisms. The second step of our two-step approach is aimed at addressing some of the concerns regarding omitted variable bias. In this second estimation, we exploit the repeat assessments within the PEFA data set to estimate the relationship using the fixed-effects estimator for panel data in equation (5.2): This model estimates the relationship between changes in the WGI_ COC (Y) and changes in our PFM indexes (X) in country i over a period of time t, also controlling for changes in our other controls (Z) and nonobservable, nonchanging country fixed effects (α i ). However, if the model fails to specify important determinants of corruption that do change over time, the problem of omitted variable bias remains. This discussion is taken up further in our conclusions later in the chapter. Not every country that has undertaken a first PEFA assessment has undertaken a second, which reduces our sample size for estimating equation (5.2) to 60 countries (see annex 5B, table 5B.2). The next section discusses the results from our estimations. Table 5.8 outlines the estimates of our model using WLS estimation. We find that our overall PEFA index (PEFA) as well as the three subindexes for transparency (TRANS1-3) and our subindex for controls (CONTROLS) have a positive relationship with the control of corruption index (WGI_ COC) after controlling for other factors. Furthermore, these relationships have a statistical significance of 5 percent or better. Our results suggest that scoring 1 point higher on the PEFA index scale of 1-4 is associated with a score that is 10.7 points higher on the WGI_ COC index scale of 0-100 on average (column 1). The results are lower for each of our subindexes, but our CONTROLS index has a stronger relationship than our TRANS1-3 indexes (columns [2][3][4][5]. Moreover, when each of the subindexes competes in the same model (column 6), the effect of the CONTROLS index dominates the effect of the transparency indexes, which suggests that controls are a more important determinant of perceptions of corruption than transparency. With respect to our control variables, our findings are largely in line with theory and previous empirical findings. We find that a natural resource base (NAT_ RES) that is 1 percent of GDP higher on average is associated with a WGI_ COC score that is 0.3 point lower on average and that GDP per capita (LOGINCOMEPC) that is 1 percent higher on average is associated with a WGI_ COC score that is 5.5-6.5 points higher on average. Country size is a significant determinant of WGI_ COC in our model. Countries whose population (LOGPOP) is on average 1 percent larger have WGI_ COC scores that are on average 2-3 points lower. We also find a small effect for press freedom (PRESS). A 1-point improvement in the press freedom index is associated with a 0.2-point improvement in the WGI_ COC score. 23 For trade openness (TRADE) and democracy (POLITY), we find small negative associations with the WGI_ COC. This is the opposite of what is predicted by theory and contradicts the findings of others, although we do not find either relationship to be statistically significant within our model.

RESULTS
As a robustness check, we reestimate our model excluding the top and bottom 5 percent of observations for the WGI_ COC. 24 Our results remain broadly similar (annex 5C, table 5C.1). The estimated coefficients of our PFM indexes are slightly lower, with the exception of TRANS1, which is found to be slightly higher. TRANS2 is no longer found to be statistically significant; neither are population size (LOGPOP) and press freedom (PRESS). As a second robustness check, we estimate the model using the ICRG index instead of the WGI_ COC. This reduces the sample size from 99 countries to 76. Our estimated coefficients are smaller, and the coefficients for TRANS1 and TRANS2 are no longer found to be statistically significant, but we again find the CONTROLS index to have the largest and most statistically significant effects (annex 5C, table 5C.2).
The panel results in table 5.9 broadly corroborate our core findings from the WLS estimates. Again, we find positive relationships between our PFM indexes and the WGI_ COC. We find the largest effect for the overall PEFA index, where the estimated coefficient suggests that a 1-point improvement on the PEFA index 1-4 scale is associated with an improvement of 3 points along the WGI_ COC index 0-100 scale (table 5.9, column 1). Also consistent with the WLS estimates are the weaker estimated relationships for the transparency subindexes (columns 2-4) compared with the controls index (column 5). Again, we find that, when forced to compete in the same model, the controls index dominates the effects of the other indexes in both magnitude and statistical significance. Moreover, we find the estimates of overall PEFA and controls indexes to be statistically significant at the 10 percent and 5 percent levels, respectively, but we do not find the estimates of the transparency indexes to be statistically significant. However, our estimates of the relationship between improvements in our PFM indexes and improvements in the WGI_ COC index are quite small. When one considers first that Organisation for Economic Co-operation and Development (OECD) countries do not score perfectly on a range of these indicators 25 and the amount of time it took for them to get to that point, a movement of one score-from an average PEFA index score of B to A, C to B, or D to C-would require a lot of effort to achieve a 3-point improvement on the WGI_ COC index. In contrast, our estimates for the relationship between increases in income per capita and increases in the WGI_ COC index are substantially higher, ranging between a 5-point to a 6-point increase in the WGI_ COC index for a 1 percent increase in GDP per capita (LOGINCOMEPC). We also find a statistically significant relationship between democracy (POLITY) and the WGI_ COC index, with a 1-point improvement in the score of the former corresponding to a 0.4-point to a 0.5-point improvement in the latter. We do not find statistically significant relationships between changes in the natural resource base, trade openness, country size, or press freedom and the control of corruption index. Our panel estimate results are sensitive to the inclusion of the top and bottom 5 percent of countries in terms of absolute change in WGI_ COC scores (annex 5C, table 5C.3). 26 Once those countries are excluded, we no longer find any of our PFM indexes to be statistically significant. The estimated coefficients for changes in income (LOGINCOMEPC) and democracy (POLITY) remain statistically significant at the 10 percent and 5 percent levels, respectively.
We also try to replicate similar results using the ICRG index for a smaller sample of 44 countries (annex 5C, table 5C.4). In this instance, we do not find statistically significant relationships for any of our PFM indexes individually and actually find negative coefficients for TRANS2 and TRANS3. We again find the largest coefficient for the CONTROLS index, and when we include all four indexes, the effect of CONTROLS is positive and statistically significant at the 5 percent level. However, we also find negative effects for the TRANS2 and TRANS3 indexes in this specification. The negative signage of the estimated effects for TRANS2 and TRANS3 provides some weak evidence for the alternative hypothesis that improvements in PFM that increase transparency lead to revelations that worsen the perceptions of corruption.

DISCUSSION
PFM reform often forms part of a low-or middle-income country's anticorruption strategy, frequently with external support from its development partners in the form of funding and technical assistance. It is therefore important for the governments of both donor and recipient countries, as well as PFM practitioners, to consider whether there is evidence that PFM reforms have an impact on corruption. The literature that tries to establish a causal link between PFM reforms and corruption tends to have a reform niche and country focus. Cross-country examination of the relationship is limited to higher-income countries. This chapter tries to fill the gap in the literature by looking at the relationship for a large sample of predominantly lower-income countries. We also try to provide a more nuanced examination of the relationship by testing four hypotheses related to PFM reforms regarding transparency in budgeting, reporting, and auditing and in expenditure controls. Our analysis provides evidence that there is a relationship between "better" PFM, particularly expenditure controls, and lower levels of corruption. But these results come with important caveats.
Our estimation of the cross-country relationship in levels shows a statistically significant correlation between our four measures of PFM performance and perceptions of corruption after controlling for other determinants of the latter. Compared with greater transparency in budgeting, reporting, and auditing, we find a stronger correlation between lower perceptions of corruption and "better" expenditure controls. Moreover, when allowed to compete in the same model, the effect of our expenditure controls index dominates the effects of our transparency indexes. To address potential omitted variable bias concerns, we also estimate the relationship between PFM performance and perceptions of corruption over time. Using a fixed-effects estimator, we once again find a statistically significant correlation between "better" expenditure controls and lower perceptions of corruption, but our estimates of the relationship between transparency in budgeting, reporting, and auditing and in perceptions of corruption are no longer found to be significant.
We interpret these findings as supporting the idea that expenditure controls are likely to be useful in an environment characterized by political commitment to budget credibility. We find weaker evidence of a relationship between transparency in budgeting, reporting, and auditing and in perceptions of corruption. But we do not find evidence to support an alternative hypothesis that greater transparency leads to revelations that worsen perceptions of corruption. We also note that, compared with the effect of economic growth, PFM performance has a very limited impact on perceptions of corruption.
However, our analysis has important limitations. Robustness checks show that our results are sensitive to changes in sample size and alternative measures of corruption. But more fundamentally, our estimation technique suffers from endogeneity issues and our data are far from perfect. These weaknesses were largely insurmountable for this research and may lead to possible interpretations of our results.
A causal interpretation suggests that improvements in PFM performance lead to improvements in perceptions of corruption. However, several endogeneity concerns lead us to caution against this interpretation. The first is reverse causality. Our estimation technique cannot rule out the possibility that causality flows in the opposite direction-that is, that lower levels of corruption allow for improved PFM performance. However, our results fit with the general theory outlined in our hypotheses as well as the results of previous studies.
A larger concern is omitted variable bias. Although our panel estimation controls for nonvarying determinants of corruption, PFM reforms do not occur in a vacuum. Often the more salient political issue is corruption. Politicians rarely campaign on the promise of PFM reform, but they frequently campaign on an anticorruption platform. PFM reform tends to be part of a package of wider anticorruption reforms. Because our estimations do not include controls for other anticorruption strategies, our results may be biased. As such, we cannot rule out the possibility that our results are picking up a co-movement in improvements in PFM performance and corruption perceptions that are not directly related. This possibility becomes more of a concern given the potential for measurement error. As noted throughout, perceptions of corruption and actual corruption may diverge, while our measures of PFM performance may be unrelated to the types of corruption that are driving our perceptions of corruption indicator. Wages paid to civil servants may also be an important determinant of corruption, 27 but the paucity of a comprehensive source of data on wages across countries meant that controlling for wages was beyond the scope of this paper.
Further research in this area should focus on identifying more specific crosscountry measures of corruption that can be linked to more specific PFM reforms. This chapter has outlined the relationships between the most commonly used measures of both PFM and corruption. Our findings suggest that, during windows of opportunity when there is strong high-level commitment to combatting corruption, the focus of support for PFM reform should be on improving expenditure controls.

PI-11
Orderliness and participation in the annual budget process PI-12 Multiyear perspective in fiscal planning, expenditure policy, and budgeting

NOTES
1. Bureaucratic corruption is sometimes called "routine" corruption because it often plays out in the form of bribes for government services by junior to midlevel officials. It is also sometimes referred to as "survival" corruption because of the low wages received by those extracting bribes (Fjeldstad 2005 . This chapter focuses solely on the expenditure side of the PFM system. The revenue side also provides ample opportunity and incentives for corruption. For a review, see Fjeldstad (2005). 4. Although anecdotal evidence suggests that clean audits are for sale in some countries. 5. See https://www.imf.org/external/np/fad/trans/. 6. Fiscal transparency was constructed from IMF Reports on the Observance of Standards and Codes, and corruption was based on the Worldwide Governance Indicators (WGI) control of corruption index. 7. These other controls include controls for the log real gross domestic product (GDP), a dummy for high-income economies, dummies for geographic location, dummies for legal origin, trade openness, fractionalization, and education.

Using the International Budget Partnership's open budget index score and Transparency
International's corruption perceptions index. 9. Using the WGI control of corruption index and the World Economic Forum Executive Opinion Survey. 10. Including income per capita and administrative regulations. 11. Using data from the International Budget Partnership and the Organisation for Economic Cooperation and Development (OECD) on budget transparency. 12. These PEFA indicators of transparency in budget execution reporting do not measure whether information is made publicly available, only that the relevant analysis is prepared.
Public dissemination is measured separately through PI-10 (public access to key fiscal information), although it is not a perfect measure of transparency in budget execution reporting alone, because it also includes publication of budget documents, audit reports, and procurement contracts. 13. As measured by the WGI for government effectiveness, regulatory burden, and control of corruption. 14. Usually referred to as the auditor general in Anglophone contexts.

VARIABLE
(1)  15. They further find that wages played no role in reducing corruption when audit intensity was at its peak but did have an effect on lowering corruption when audit intensity returned to normal levels in the aftermath of the crackdown. 16. As measured by the difference between actual project cost and estimates of engineers. 17. For a list of the surveys and sources used to compile the WGI_ COC, see https://info.worldbank.org/governance/wgi/pdf/cc.pdf. 18. The transformation is as follows: [cc_ est-(-2.5)] * [100 -(0)]/ [2.5 -(-2.5)] + 0. 19. Spearman correlation coefficients for the WGI_ COC with these indexes are 0.92 (98 observations) and 0.71 (76 observations), respectively. 20. See  for a review of the arguments. 21. It is notable that in its PEFA assessment Norway scored a D on this indicator and decided that it was not a problem that needed rectifying, arguing that it was an issue to be taken up at the subnational level if at all . 22. The PI-10 indicator calls for the publication of six types of documents: three related to budget execution reporting and three related to budget documents, procurement contracts, and audit reports. 23. The press freedom index runs counterintuitively-that is, negative scores are better. 24. As a result, Angola (2016)

GUNDULA LÖFFLER, CATHAL LONG, AND ZAC MILLS
In this chapter, we estimate the cross-country relationship between penalties for noncompliance and tax collection. Our central hypothesis is that more consistent administration of penalties for noncompliance is a proxy for the type of political commitment required to increase domestic resource mobilization (DRM) in low-and middle-income countries. We find that countries that score higher on the measure of penalties for noncompliance in Public Expenditure and Financial Accountability (PEFA) assessments have ratios of tax to gross domestic product (GDP) that are 1.3 percent higher on average after controlling for other established determinants of cross-country variation in tax collection. We also find that improvements regarding penalties for noncompliance are associated with increases in the tax-to-GDP ratio over time. We further discuss the plausibility of a causal interpretation of these results. Although our results come with some caveats, we conclude that the credible administration of penalties for noncompliance is potentially a much better indicator of the commitment of low-or middle-income countries to DRM than those indicators currently in use. Unfortunately, the measure was discontinued in the updated 2016 PEFA framework without being assimilated into the frameworks of other international financial institutions that assess public administration.

INTRODUCTION
With the advent of the Sustainable Development Goals (SDGs) and the related Addis Ababa Financing for Development Agreement, domestic resource mobilization is again a hot topic in international development circles (Long and Miller 2017). The Addis Ababa agreement states that the international community "welcome[s] efforts by countries to set nationally defined domestic targets and timelines for enhancing domestic revenue as part of their national sustainable development strategies and will support developing countries in need in reaching these targets" (United Nations 6 2015a). But with this renewed focus on DRM came a renewed focus on revenue targets. Indeed, in the runup to the conference in Addis Ababa, the setting of revenueto-GDP targets was hotly debated, with the zero draft of the document proposing that countries with "government revenue below 20 percent of GDP agree to progressively increase tax revenues, with the aim of halving the gap toward 20 percent by 2025" (United Nations 2015b). However, many took issue with these targets, and they were ultimately abandoned (Moore et al. 2015). Nevertheless, they remain pervasive. The standard recommendation of the International Monetary Fund (IMF) is that low-income countries should target a tax-to-GDP ratio of 15 percent. And, though dropped as a target, the revenueto-GDP ratio was retained as an indicator under SDG 17, the rationale being that it "enables easy comparisons across countries, . . . facilitate[s] transparent policy dialogue, and provide[s] policy makers with an important tool to assess alternative fiscal reforms and to undertake relevant policy actions." 1 Donors are often overly focused on these types of targets (European Court of Auditors 2016). This is not surprising given that they are accountable to their taxpayers to achieve results. Arguably, of most interest for donors supporting DRM reforms are indicators of the political will required "to collect taxes efficiently and effectively without fear or favor" (Bird 2015), so that they can program their financial support where it will add most value.
In this chapter, we argue that, for low-and middle-income countries, more coercive measures of tax administration, specifically the credible administration of penalties for noncompliance, are potentially a good indicator for the type of political will necessary to generate higher revenue. In the next section, we review the literature on tax compliance and note a gap in the literature with respect to cross-country analysis of the use of penalties for noncompliance, particularly in low-and middle-income countries. Then we present some initial analysis of the relationship between tax collection and indicators of revenue administration using data from PEFA assessments and discuss why we think that development agencies should pay more attention to the indicator of penalties for noncompliance. Next, we outline our methodology for examining the relationship between revenue outcomes and penalties for noncompliance in the presence of other explanatory factors of the former. We conclude by presenting and discussing our findings.

LITERATURE REVIEW
Getting citizens to comply with their tax obligations and liabilities is central to increasing DRM. Early theorists, most prominently Allingham and Sandmo (1972), looked at tax compliance as a question of rational choice, where taxpayers weigh the benefit of the additional income they get to keep if they do not pay taxes against the cost of being caught for not doing so. The latter was considered a function of the likelihood of being caught and the severity of punishment. Rational actors were expected to evade their taxes if the benefits they gain from the retained income outweigh the probability of being caught and having to pay a penalty. Accordingly, the original deterrence model emphasized tax enforcement, which identified effective tax administration as the key ingredient to improving compliance.
A rich empirical literature began emerging from this theoretical concept, relying mostly on findings from laboratory experiments that largely confirmed the stipulated mechanism (see, for example, Andreoni, Erard, and Feinstein 1998;Cowell 1990;Friedland, Maital, and Rutenberg 1978;Smith and Stalans 1991;Spicer and Revenue Administration Performance and Domestic Resource Mobilization | 123

Hypothesis 1:
More consistent administration of penalties that are set sufficiently high to deter noncompliance is a proxy for the type of political will required for higher levels of taxation. Hero 1985; Thomas and Spicer 1982 for a review of this literature). However, much of this empirical research routinely found compliance levels to be significantly higher than what the deterrence model would predict (Andreoni, Erard, and Feinstein 1998;Cowell 1990;Cummings et al. 2009;Torgler 2007). This realization prompted research on tax compliance to evolve into two lines of thinking.
The first line of thinking is based on the role of uncertainty regarding the likelihood of detection. Empirical findings from both lab experiments and field research made it increasingly clear that, in the real world, taxpayers are unsure about their chances of getting audited, and this has a considerable effect on their compliance decision (Andreoni, Erard, and Feinstein 1998;Beck, Davis, and Jung 1991;Kleven et al. 2011;Slemrod, Blumenthal, and Christian 2001;Spicer and Hero 1985;Thomas and Spicer 1982). Uncertainty with regard to the risk of facing a penalty for evasion tends to make taxpayers overly cautious, resulting in increasing compliance (Mascagni 2017).
The second line of thinking, which has dominated the more recent research in this area, explores the role of nonmonetary motivations for compliance, sometimes referred to as the positive incentives for tax compliance (Smith and Stalans 1991). The literature in this area incorporates a wide range of factors that enhances people's tax morale-that is, their intrinsic motivation to comply with their tax obligations and liabilities (Alm and Martinez-vazquez 2003;Alm, Martinez-vazquez, and Torgler 2010;Cummings et al. 2009;Feld and Frey 2002;Smith and Stalans 1991;Torgler 2007). The factors affecting tax morale include people's understanding that they pay taxes in return for receiving public services-contractual taxation or fiscal exchange (Ali, Fjeldstad, and Sjursen 2014;Fjeldstad and Semboja 2001;Luttmer and Singhal 2014;Moore 2004Moore , 2007Tilly 1985); the social norms dominating their reference group-social influence theory (Ali, Fjeldstad, and Sjursen 2014;Fjeldstad and Semboja 2001;Levi 1988;Torgler 2007); and their perceptions of vertical and horizontal equity-comparative treatment (Ali, Fjeldstad, and Sjursen 2014;D' Archy 2011;Luttmer and Singhal 2014). This research on tax morale has dominated much of the more recent research, which predominantly conducts field experiments or exploits natural experiments to measure the effect of changes in people's tax morale or perceptions of taxation on their compliance behavior.
This shift in attention away from deterring noncompliance and toward encouraging voluntary compliance has led to an increased focus on the ability of tax authorities to engage in taxpayer education and communication and generally to be more transparent and accountable to taxpayers. Although this may result in positive outcomes in terms of tax morale and people's attitudes toward taxation, it has shifted attention away from the core enforcement functions of tax authorities. But in low-and middle-income countries where the use of third-party information to make evasion more difficult is less common, a "healthy fear" of the tax authority is still an important way to get people to comply with their tax obligations and liabilities. Furthermore, the focus on individual-level data of taxpayers has limited cross-country comparisons, resulting in only a small number of studies exploring the determinants of tax compliance across countries (Ali, Fjeldstad, and Sjursen 2014;Riahi-Belkaoui 2004;Richardson 2006).
This chapter seeks to address this gap, while focusing on the specific area of penalties for noncompliance. We test the following hypothesis.
Our contention is that countries that do this well provide political support to their tax administrations, resulting in higher tax-to-GDP ratios on average. In the next section, we outline our motivations for this hypothesis using data from PEFA assessments. 124 | PEFA, PUBLIC FINANCIAL MANAGEMENT, AND GOOD GOvERNANCE

DATA AND ANALYSIS
Our data on penalties for noncompliance come from PEFA assessments in 112 predominantly low-and middle-income countries over the period 2005-15 (see annex 6A)-specifically, indicator 14, dimension 2 (hereafter PI-14(ii)) of the 2011 PEFA framework. 2 The advantages of using penalties for noncompliance as a proxy for political will are that penalties are considered a more functional measure of revenue administration than other dimensions; scoring on the relevant dimension has a more distinct relationship with revenue outcomes than other dimensions; scoring on the dimension is distributed more normally than other dimensions; and the relationship between dimension scores and revenue outcomes is less susceptible to reverse causality concerns than other dimensions. Some endogeneity concerns exist with respect to measurement error, and these concerns are discussed further below.
An important distinction to be made between the dimensions is whether they measure the form or the function of the revenue administration. In an analysis of the PEFA framework, Andrews (2011) distinguishes between de jure dimensions that measure form and de facto dimensions that measure function (see chapter 2 for further discussion). Table 6.1 outlines Andrews's categorization of the nine dimensions of the PEFA assessment concerned with revenue administration. Of the nine dimensions, he considers just four to be de facto measures of revenue administration, including PI-14(ii). In line with our hypothesis, we would expect that scoring on de facto measures would require political will for the revenue administration to be more functional. Therefore, we might expect to see stronger relationships between revenue outcomes and better scoring on these dimensions.
PEFA dimensions are measured on a scale from A to D, with As indicating the achievement of "good practice," Bs and Cs representing some progress toward good practice, and Ds representing lack of effort. As shown in table 6.2, to score an A on PI-14(ii), a country must show evidence that penalties are set sufficiently high to deter noncompliance and are administered consistently. In the guidance material (PEFA Secretariat 2012), assessors are expected to consider the following questions: Are there penalties for noncompliance with registration and tax declaration in existing legislation or current administrative procedures? If the answer is yes, are they sufficient to affect compliance, or are changes needed? How do the penalties work in practice? Are they enforced? Between its first assessment in 2008 and its second in 2015, Nepal moved from a C score to an A score on PI-14(ii). In the 2008 report, the most notable justification given for the C score was that penalties for noncompliance existed for most relevant taxes, but they were not always effective because of inconsistent administration (Nepal PEFA Secretariat 2008). In contrast, in the 2015 report, the most notable point made in favor of the A score was that Nepal's Inland Revenue Department investigated 373 cases of tax evasion in fiscal 2012 and found NPR 1.75 billion in payables (tax and fines). In fiscal 2013, it investigated 737 cases and found NPR 2.09 billion in payables (Nepal PEFA Secretariat 2015).
PI-14(ii) is one of nine dimensions that measure good practice in tax administration under the 2011 PEFA framework. PI-13, PI-14, and PI-15 each has three dimensions measuring the transparency of taxpayer obligations and liabilities, effectiveness of measures for taxpayer registration and tax assessment, and effectiveness in the collection of tax payments, respectively. Table 6.3 lists each of the dimensions by indicator.
Our data on tax collection come from the Government Revenue Dataset (GRD) of the International Centre for Tax and Development (ICTD) and the United Nations University World Institute for Development Economics Research (UNU WIDER). 3 The GRD provides the best coverage of revenue collection and its disaggregates for low-and middle-income countries. Of the 124 countries that carried out at least one PEFA assessment between 2006 and 2015, the GRD holds revenue time series for 112 (see annex 6A). For this chapter, we use taxes excluding social contributions as a percentage of GDP (hereafter tax-to-GDP ratio). The option to exclude social contributions is useful because social contributions exist in some countries but not C Penalties for noncompliance generally exist, but substantial changes to their structure, levels, or administration are needed for them to have a real impact on compliance.
D Penalties for noncompliance are generally nonexistent or ineffective (that is, set far too low to have an impact or rarely imposed).
Source: PEFA Secretariat 2011. Frequency of complete accounts reconciliation between tax assessments, collections, arrears records, and receipts by the treasury in others. However, according to notes accompanying the GRD, for some countries they are not easily separated, raising concerns about potential measurement error. For the purposes of comparison with PEFA scores, we use a three-year moving average of the tax-to-GDP ratio throughout the chapter to reflect the fact that PEFA is a backward-looking assessment. Figure 6.1 shows the trends for the mean tax-to-GDP ratio by dimension score for PI-13, PI-14, and PI-15. The trends are indicative of the potential importance of some tax administration functions for increasing the tax-to-GDP ratio.
The evidence for the dimensions under PI-13 is mixed. There seems to be no relationship of note for PI-13(i) (clarity and comprehensiveness of tax liabilities) and PI-13(ii) (taxpayer access to information on tax liabilities and administrative procedures). However, a more distinct positive relationship is evident between the average tax-to-GDP ratio and PI-13(iii) (existence and functioning of a tax appeals mechanism), but its correlation coefficient is among the weakest (table 6.4). Similarly, we find mixed evidence for the dimensions under indicator PI-15. There is no clear relationship for PI-15(i) (collection ratio for gross tax arrears) 4 and PI-15(iii) (frequency of complete accounts reconciliation between tax assessments, collections, arrears records, and receipts by the treasury). The average tax-to-GDP ratio is higher for countries scoring an A on PI-15(ii) (effectiveness of transfer of tax collections to the treasury by the revenue administration), but no notable difference is evident between the average tax-to-GDP ratios associated with scoring a B, C, or D.
In contrast, PI-14(i) (controls in the taxpayer registration system) displays a clear trend of stepped increases in the average tax-to-GDP ratio along the PEFA scale from D to A and has the strongest correlation with the tax-to-GDP ratio (table 6.4). PI-14(ii) and PI-14(iii) (planning and monitoring of tax audit programs) 5 display less obvious trends and have weaker correlations.
The distribution of scores is also revealing (see figure 6.2). We observe more normal distributions for the dimensions under PI-14 as well as PI-13(i) and PI-13(iii), with most countries scoring a B or C. Most countries perform well on PI-13(ii) and PI-15(ii) and poorly on PI-15(i) (for the 92 countries where it was even possible to assess the dimension), whereas performance on PI-15(iii) is at the extremes, with most countries scoring either an A or a D. These findings fit with the discussion in chapter 2-namely, that some indicators measure form over function and are susceptible to isomorphic mimicry and gaming, with countries focusing on those measures that are easier to change in order to satisfy external funders .  puts forward evidence that countries tend to perform better on measures of de jure reforms (that is, legal and procedural changes) than on measures of de facto reforms (that is, actual changes in practice). Of the nine dimensions, he considers only PI-14(ii) and the three dimensions under PI-15 to be de facto reforms. This supports our hypothesis that PI-14(ii) serves as a good proxy for the political will required to improve tax performance.
PI-14(ii) is also less susceptible to critiques that the relationship with the tax-to-GDP ratio is endogenous because of simultaneity or reverse causality. Many of the reforms related to the nine dimensions are associated with having market-based economies and higher levels of income, which are also associated with higher taxto-GDP ratios. For example, it is likely to be difficult to perform well on PI-14(i), which requires rather sophisticated links between government and financial market databases to score an A, unless the government can retain software engineers, who are often in short supply in low-and middle-income countries. Similarly, scoring well on PI-14(iii) likely requires the retention of a well-paid cadre of tax auditors, which is often not possible in lower-income countries. As a result, low-and middle-income countries frequently receive support to overhaul their tax administrations in the form of technical assistance, which might improve PEFA scores without improving tax performance. This is because there are limits to what can be achieved through technical assistance. With respect to PI-14(ii), it seems plausible that external advisers could assist with setting credible penalties for noncompliance, but they are unlikely to be able to do much about the administration of penalties in the absence of political will. Figure 6.3, which shows similar variation in scoring for PI-14(ii) across income levels, provides some evidence in this respect, in contrast to the distribution of scoring for other dimensions. As such, our contention is that the relationship between the tax-to-GDP ratio and PI-14(ii) is a more plausible measure of the political will required to improve tax performance than other measures, because politicians at all income levels may be motivated to raise more taxation or to stymie efforts to do the same. The case against a causal interpretation is that the enforcement of penalties for noncompliance is expensive, and therefore higher scores can only be achieved in countries that have resources. However, the countries that score an A on PI-14(ii) are spread relatively evenly across income groups.
Potentially larger endogeneity concerns are measurement error and omitted variable bias. Concerns about measurement error apply to both our dependent variable (the tax-to-GDP ratio) and our independent variable (PI-14(ii)). As previously stated, our data for the tax-to-GDP ratio is from the GRD, which for 30 countries in our sample of 112 uses general rather than central government data on tax revenues. The justification for this is simple: ICTD and UNU WIDER use general government data where they are available, which is generally the case for larger or federal states, and are less concerned about using government data for unitary or highly centralized states where local taxation is often negligible, particularly in smaller lower-income countries (Prichard, Cobham, and Goodall 2014). But PEFA assessments are carried out at the central government level, so this presents a potential problem for our hypothesis, unless we can assume that penalties for noncompliance are set and administered similarly at both the central and lower levels of government in nonunitary states. The ability to account for subnational revenues may indicate a certain level of coherence that makes this assumption plausible. Nevertheless, we consider this potential source of measurement error in more detail in the sections that follow. Potential sources of measurement error in PI-14(ii) include the bias of assessors and a mismatch between the report date and reporting period. Many PEFA assessments are self-assessments carried out by government officials themselves, including the example of Nepal cited above. Although the PEFA Secretariat provides quality assurance, some countries choose not to avail of this offer. Unfortunately, the data set does not provide sufficient detail to distinguish or control between self-assessments and more independent assessments. There are also potential mismatches between the date of the assessment report in the PEFA data set and the actual time period covered due to publication lags. The example of Nepal's 2008 report is a case in point. Although the date in the database is 2008, the document clearly states that it covers the year ended 2005/06. These concerns as well as omitted variable bias are discussed in more detail in the next section.

ESTIMATION APPROACH
Our preliminary approach is to estimate the relationship between the tax-to-GDP ratio and penalties for noncompliance using ordinary least squares (OLS) in equation (6.1): where Y i is the tax-to-GDP ratio, X i is PI-14(ii) as measured by the PEFA assessment for country i, Z i is a matrix of country-level controls, and ε i is our error term. Our control variables are based on the findings of similar studies on the cross-country determinants of tax collection (table 6.5). variables from PEFA assessments enter the equation with an ordinal assignment of A = 4, B = 3, C = 2, and D = 1, which has been in common use in research papers since de Renzio, . For our sample of 112 countries, we use the latest PEFA assessment available over the period 2007-15 (table 6.6).
There is a standard approach to modeling tax performance using proxies for the tax base and the structure of the economy. Those proxies used most commonly for low-and middle-income countries for this purpose are the share of agriculture in GDP as a proxy for the size of the informal economy, international trade as a share of GDP as a measure of the openness of the economy, and GDP per capita (Morrissey et al. 2017). We expect tax performance to be negatively associated with the share of agriculture in GDP because the sector is difficult to tax and, in the case of subsistence agriculture, does not generate taxable income. In contrast, trade taxes are easier to collect, so we expect a positive association between the trade share of GDP and tax performance. GDP per capita, a proxy for the level of economic development, is expected to be positively correlated with tax performance, but other studies have often found the opposite (Morrissey et al. 2017).
When modeling the determinants of tax collection in low-and middle-income countries, natural resources are often considered. For example, Gupta (2007) uses dummy variables for oil-producing and mineral-exporting countries. We control for the share of natural resource rents in GDP but are ambiguous about the relationship. Natural resource government revenues are included in revenue-to-GDP ratios, but not in tax-to-GDP ratios. However, taxation on the companies that generate these revenues is included. Therefore, there is the potential for a negative association where natural resource rents deter tax effort, but also a positive association where taxation on the activities of extractive industries mechanically generate more taxation (Bornhorst, Gupta, and Thornton 2009). In keeping with the literature on tax morale, we use the Worldwide Governance Indicators (WGI) for voice and accountability as a proxy for democracy. 6 We expect democracy to be positively correlated with the tax-to-GDP ratio in line with the literature on fiscal contracting.
We also employ dummy variables for regions as defined by the World Bank and include a dummy variable to account for the presence of 24 small island developing states (SIDS) within the sample (table 6.7) and for Botswana, Lesotho, Namibia, and Swaziland (BLNS), which are members of the Southern Africa Customs Union and subject to the peculiarities of its revenue-sharing formula (Basdevant 2012). To account for potential measurement error between the assessment date in the data set and the period covered by the report, the dependent and control variables in table 6.6 enter the equation as a three-year moving average of the year of the assessment and the two preceding years. And for potential measurement associated with the use of general government data, as discussed above, we employ a dummy variable for federal states for which the tax-to-GDP ratio is for general government in the ICTD and UNU WIDER data set.
Our secondary approach is to control for omitted variable bias. Omitted variable bias is a concern for cross-sectional estimation using OLS in equation (6.1) if tax-to-GDP ratios are determined by unobservable national characteristics, such as culture. If they are, our OLS estimates of the coefficient for PI-14(ii) will be biased. However, if these unobservable variables are fixed over time, then estimation over time allows us to remove this bias. Because the PEFA data set contains repeat assessments, it is possible to estimate over time by estimating equation (6.2):

Informal economy
This model estimates the relationship between changes in the tax-to-GDP ratio (Y) and changes in PI-14(ii) (X) in country i over a period of time t, also controlling for changes in our other controls (Z) and country fixed effects (FE i ). Our data set has two time periods: the year of the first assessment and the year of the most recent assessment. But it is unbalanced-that is, countries have undertaken their first and most recent assessments at different times (table 6.8).
Finally, because the estimators in equations (6.1) and (6.2) assume continuous rather than ordinal variables, we also estimate equations (6.1) and (6.2) with the independent PEFA variable, PI-14(ii), entering the equation as a series dummy variable in order to obtain a better estimate of the relationship with the tax-to-GDP ratio. Table 6.9 shows the results from estimating equation (6.1) using OLS. The sample covers the most recent PEFA assessment for 112 countries spanning the period from 2007 to 2015. Our estimates show a positive relationship between PI-14(ii) and the tax-to-GDP ratio that is statistically significant at the 5 percent level or better across all specifications. Our estimated coefficient implies that countries scoring one score higher on PI-14(ii) have tax-to-GDP ratios that are 2 percent higher on average (columns 1 to 4). When we add PI-14(i) as a control (column 5), 7 this effect declines to 1.3 percent.   This stands to reason, given that we would expect the impact of penalties for noncompliance to wane as registration controls are improved. Because PEFA scores are ordinal and OLS estimation assumes continuous variables, we also estimate equation (6.1) using dummy variables for PI-14(ii) (see annex 6B, table 6B.1). These estimates indicate that A scores drive the results for PI-14(ii) in table 6.9. Countries scoring an A have tax-to-GDP ratios that are 2.7 percent higher on average than countries scoring a B, C, or D, and this estimate is statistically significant at the 5 percent level. 8 Our estimates of controls for the structure of the economy are largely in line with a priori assumptions and previous findings. We estimate correlations for the size of the agriculture sector, our proxy for the informal economy, and natural resource rents that are negative, as expected. Similarly, our estimate for the trade share in GDP is positive, as expected. Our estimated coefficient for the voice and accountability score-our proxy for democracy-is also positive, as expected. Moreover, all of these estimates are statistically significant at the 10 percent level or better across all specifications. A confounding result is our estimate of the coefficient for income per capita, which is consistently both negative and large and statistically significant at the 10 percent level in our full specification, although this is a common finding in the literature. 9 We estimate that all three of our dummy variables for BLNS countries, SIDS, and federal states using general government data are positive, but only the BLNS dummy is statistically significant in our full specification. The size and statistical significance of estimates using data from PEFA assessments are susceptible to being driven by a small number of observations at the fringes. In annex 6B, table 6B.3, we run the same estimation procedure for smaller sample sizes to test the robustness of our estimates and find that they remain statistically significant after decreasing the sample size by the top and bottom 5-10 percent of tax-to-GDP ratio observations. Another concern is with our dependent variable data. For 30 countries in our sample of 112, the data in the GRD is for general rather than central government. PEFA assessments are carried out at the central government level. This affects our hypothesis if subnational enforcement of penalties and fines for noncompliance are administered differently at the national level. In annex 6B, table 6B.4, we find that our results for PI-14(ii) are robust to (a) dropping the 30 countries with general government data and (b) using central government data from the IMF Government Finance Statistics 10 database for 19 of those countries.

RESULTS
As previously noted, various endogeneity issues are associated with this crosssectional analysis. Omitted variable bias is a concern if tax-to-GDP ratios are determined by unobservable national characteristics. To control for this potential source of bias, we estimate equation (6.2) for a sample of 61 countries (see annex 6B, table 6B.3) that had repeat PEFA assessments. The results in table 6.10 show a statistically significant relationship between PI-14(ii) and the tax-to-GDP ratio over time. In our full specification in column 3, a one-score improvement in PI-14(ii) is associated with a 1.2 percent increase in the tax-to-GDP ratio that is statistically significant at the 5 percent level. 11 Most of the estimated coefficients of our other controls are not statistically significant. The shares of agriculture, trade, and natural resource rents take on the expected signs, but their estimated coefficients are quite small. In contrast to our cross-sectional models, our estimated coefficient for income per capita is positive and large and statistically significant at the 10 percent level in our full specification. Surprisingly, our democracy control, Worldwide Governance Indicators voice and accountability (WGIvA) score, has a negative estimated coefficient. Our estimated coefficient for PI-14(i) is also counterintuitively negative and statistically insignificant, which contrasts with our cross-sectional results. This may simply be because, in contrast to PI-14(ii), fewer countries have made progress toward an A grade between assessments on PI-14(i) (see figure 6.4). Another reason may be that improvements in de jure indicators do not reflect the political will necessary to increase revenue outcomes in line with our hypothesis. The counterintuitive estimated signage of some of our other controls may be the result of the small sample size, both in terms of the number of countries and length of the time series.

DISCUSSION
Overall our results demonstrate a positive and statistically significant cross-country relationship between the credible enforcement of penalties for noncompliance, as measured by , and DRM, as measured by the tax-to-GDP ratio, while controlling for a range of other determinants. Our cross-sectional results for 112 countries show that a one-score improvement on the PEFA scale is associated with  a tax-to-GDP ratio that is 1.3 percent higher on average, while achieving a "good practice" A score on PI-14(ii) is associated with a tax-to-GDP ratio that is 2.7 percent higher on average. We also address a major endogeneity concern associated with this type of estimation by controlling for unobservable country-specific factors that might influence both a country's PI-14(ii) score and its tax-to-GDP ratio. We do this by including country fixed effects for an unbalanced panel of 61 countries.

Changes in PI-14ii and PI-14i scores between assessments
Our results show that a 1-point improvement on the PEFA scale is associated with a 1.2 percent increase in the tax-to-GDP ratio. Although we find that improving from a C or D to a B or A score on PI-14(ii) is associated with an improvement in the taxto-GDP ratio of 2.2 percent that is statistically significant, we fail to find a statistically significant effect for improving to a "good practice" A score. This may be due to the fact that our sample period spans the great recession.
Our hypothesis and analysis of the underlying data make a plausible case for a causal interpretation of these findings. However, these results are not without important caveats. Our estimates are based on an unbalanced panel of observations over the period from 2005 to 2015, making interpretation of our coefficient for PI-14(ii) potentially less straightforward; moreover, our panel sample is relatively small and therefore lacking in variation for the independent variable. Although we have addressed issues of measurement error pertaining to the use of general government data, we cannot assuage these concerns fully. Similarly, we cannot account for the potential that the collection of penalties itself is driving increases in the tax-to-GDP ratio, although it seems unlikely. Furthermore, we cannot account for potential bias within the measurement of PI-14(ii) itself arising from self-assessment. Although the PEFA Secretariat provides detailed field guidance to assessors, it is hard to imagine that the assessment is not biased by the judgment of assessors because of the limited availability of data across tax categories and levels of government.
Further research is likely required before developing concrete policy prescriptions. This effort might include attempting to address some of the caveats noted above, taking a more qualitative look at the enforcement of penalties for noncompliance in a sample of countries, and conducting quantitative analysis using the tax administration databases of revenue administrations in low-and middle-income countries. The latter has become a burgeoning industry for experiments in quasivoluntary compliance but has thus far been relatively silent on more coercive measures of compliance. For example, shedding more light on whether the prescribed measure is the size of the penalty or the credibility of enforcement would be informative for both donors and revenue administrations themselves.
Nevertheless, our empirical findings combined with the theoretical underpinnings we have laid out suggest that PI-14(ii) may provide a much better indicator of the commitment of low-and middle-income countries to DRM under the Addis Ababa Financing for Development Agreement. Compared with the existing practice of simply observing revenue-to-GDP ratios, PI-14(ii) likely requires genuine domestic political commitment. Whereas modern tax systems focus more on voluntary compliance and risk management, donors interested in supporting DRM should not lose sight of the fact that coercive measures may also be an important indicator of the political will necessary to improve revenue outcomes, particularly in lower-income countries. Unfortunately, however, the indicator was not retained in the updated PEFA 2016 framework and appears not to have been assimilated into the Tax Administration Diagnostic Assessment Tool (TADAT). So, if the credible enforcement of penalties for noncompliance is to be monitored going forward, some other institution will have to lead the process of data collection.     T his project, based on the Public Expenditure and Financial Accountability (PEFA) data set, researched how PEFA can be used to shape policy development in public fi nancial management (PFM) and other major relevant policy areas such as anticorruption, revenue mobilization, political economy analysis, and fragile states.
The report explores what shapes the PFM system in low-and middleincome countries by examining the relationship between political institutions and the quality of the PFM system. Although the report fi nds some evidence that multiple political parties in control of the legislature is associated with better PFM performance, the report fi nds the need to further refi ne and test the theories on the relationship between political institutions and PFM.
The report addresses the question of the outcomes of PFM systems, distinguishing between fragile and nonfragile states. It fi nds that better PFM performance is associated with more reliable budgets in terms of expenditure composition in fragile states, but not aggregate budget credibility. Moreover, in contrast to existing studies, it fi nds no evidence that PFM quality matters for defi cit and debt ratios, irrespective of whether a country is fragile or not.
The report also explores the relationship between perceptions of corruption and PFM performance. It fi nds strong evidence of a relationship between better PFM performance and improvements in perceptions of corruption. It also fi nds that PFM reforms associated with better controls have a stronger relationship with improvements in perceptions of corruption compared to PFM reforms associated with more transparency.
The last chapter looks at the relationship between PEFA indicators for revenue administration and domestic resource mobilization. It focuses on the credible use of penalties for noncompliance as a proxy for the type of political commitment required to improve tax performance. The analysis shows that countries that credibly enforce penalties for noncompliance collect more taxes on average. ISBN 978-1-4648-1466-2 SKU 211466