Building State Capacity: What Is the Impact of Development Projects?

Although research has established the importance of state capacity in economic development, less is known about how to build that capacity and the role of external partners in the process. This paper estimates the impact of a typical development project designed to build state capacity in a low-income country. Specifically, it evaluates a multilateral development bank project in Tanzania, which incentivized investments in local state capacity by offering grants conditional on institutional performance scores. The paper uses a difference-in-differences methodology to estimate the project impact, comparing outcomes between 18 project and 22 non-project local governments over 2016– 18. Outcomes were measured through two rounds of primary surveys of nearly 500 local government officials and nearly 3,000 households. Over the course of the project, measured state capacity improved in project areas, but due to comparable gains in non-project areas, the project’s value-added to change in state capacity is estimated to be zero across all the dozens of relevant variables in the surveys. The data suggest that state capacity is evolving in Tanzania through endogenous changes in trust and legitimacy in the country rather than from financial incentives offered by external partners.


Introduction
State capacity is essential for economic development. 1 The hallmark of countries characterized as "developing, " which distinguishes them from "developed" countries, is the lack of some basic features of state capacity that prosperous societies possess (Besley & Persson, 2011;Herbst, 2000;Migdal, 1988). For example, countries with limited state capacity may lack the ability to raise taxes to nance basic public services, enforce contracts, or protect property rights. International development agencies have thus invested in capacity building projects to strengthen state institutions in developing countries. Although research has established the importance of state capacity in economic development, little is known about how state capacity is built and what is the role of external development partners in this regard (Besley & Persson, 2009;Acemoglu et al., 2015;Fergusson et al., 2020). 2 This paper addresses that gap, providing quantitative results from the evaluation of a typical World Bank project designed to strengthen the capacity of the local state.
Since the 1990s, the World Bank has invested in various projects across regions to strengthen the capacity of the local state (Independent Evaluation Group, 2008. The core design features are shared by multiple World Bank projects which aim to strengthen local governments across regions. Hence, while this study reports the impacts of such a project in Tanzania, the ndings may have implications for how a major global development agency approaches state capacity building projects around the world. The essential design feature is to provide incentives to local governments to undertake veriable actions that the project regards as essential for state capacity. These actions include showing 1 This statement is both axiomatic in economics as well as one that has recently been supported by rigorous evidence. Modern economics starts with the assumption that state capacity exists to protect property rights, maintain law and order, enforce contracts, and collect the tax revenues needed for these public services (Dincecco & Katz, 2016;Acemoglu, García-Jimeno, & Robinson, 2015;Besley & Persson, 2011) One strand of empirical work nds a robust positive correlation between tax to GDP ratios, and other measures of state capacity, with economic development (Acemoglu, 2005;Besley & Persson, 2009;Dincecco & Katz, 2016). Another strand nds persistent di erences in income and prosperity across places which have a history of strong state institutions versus those that do not (Bandyopadhyay & Green, 2016;Gennaioli & Rainer, 2007;Michalopoulos & Papaioannou, 2013). 2 Qualitative discussion of whether World Bank projects build state capacity is provided in de Janvry & Dethier (2012) and in Stern, Dethier, & Rogers (2005). Some previous studies have examined the cross-country relationship between World Bank projects and subsequent state capacity, nding a suggestive positive relationship for highly rated projects (Hanson & Sigman, 2019). A recent paper, Erman et al. (2021), uses a xed e ects identi cation strategy and administrative data on municipal revenues to examine the impact of a project in Mozambique. They nd that project municipalities increased own revenue collection over time more than non-project municipalities. In contrast, as we report below, not only do we nd no impact of the project in Tanzania on local tax payment and attitudes, but also a striking illustration of how context matters for project design. While the project in Tanzania had initially targeted own-source revenues as a key component of the performance grant, it dropped this component after was re-centralized . Nevertheless, we nd evidence of increasing willingness to pay property taxes-a measure of state capacity-in both project and non-project urban local governments in Tanzania. records of urban planning, internal audits, establishment of tender boards for public procurement, recruitment of key state personnel, adoption of technologies for property tax management, and increase in local own-source revenues, among others. The size of scal grants to local governments under the project are conditional on their performance in achieving institutional outcomes, as assessed by an independent third-party. On average, the scal grants represent about 15 percent of total revenues for local governments in 2011. 3 In addition to the performance grants, these projects also include mechanisms to help local governments with advice and technical assistance on how to plan, audit, procure, recruit personnel, use technology, increase revenue collection, etc. In other words, local governments receive technical support in all the areas of institutional performance on which they are being assessed.
This paper provides evidence on whether institutional outcomes targeted by the capacity building project in Tanzania are di erent across local governments that received the project and those that did not. Since the selection of project local governments is not amenable to randomization in these types of programs, 4 any evaluation design has to contend with both observable and unobservable initial di erences between treatment (those receiving the World Bank capacity building project) and comparison (those not receiving the project) Local Government Areas (LGAs). To address these issues we use a di erence-in-di erences method to estimate the project impact by comparing outcomes between project and non-project LGAs over time. The di erence-in-di erences method does not require treatment and comparison areas to be statistically identical at the start of the program, but it does require that the two groups would develop at similar rates in the absence of the project. This is often called the "parallel trends" assumption.
The evidence comes from 40 Local Government Areas of Tanzania over the period 2016-2018, of which 18 LGAs had been selected in 2013 to receive the World Bank capacity building project. The project team indicated at the outset that the selection of these 18 LGAs was purposive, based on negotiation between the World Bank and the Government of Tanzania (World Bank, 2012). Thus, we knew at the outset that these 18 LGAs were likely to be systematically di erent from other LGAs in the country on both observable and unobservable variables. To minimize these di erences, we worked closely with the government and the World Bank project team to identify comparable  LGAs were identi ed to serve as comparators) which would not be receiving the project, complementing this identi cation process with propensity score matching. The di erence-in-di erence strategy we adopted was supported at the outset by evidence that in the years preceding the project, project and non-project LGAs had similar growth in local, "own source" revenues, a key measure of state capacity. This similar growth in the absence of the intervention is the fundamental assumption behind the di erence-in-di erences estimation strategy.
We then implemented two rounds of surveys across these 40 LGAs (18 receiving the project and 22 comparators) of 474 government o cials and 2,998 citizens, two years apart, during a period in which the project was well established, had disbursed and was expected to have undertaken capacity building activities. The rst survey was undertaken in February 2016, and the second in April 2018. The surveys included a variety of questions about the capacity of local government o cials to plan, recruit personnel, manage public funds, raise revenues and deliver services, and the experience of citizens in receiving these services.
Across all the survey measures, we nd no evidence of signi cant di erences between the 18 project and 22 non-project LGAs over the two years during which the project disbursed its funds and enabled technical assistance to build local state capacity. Speci cally, many indicators improved in project LGAs, but they improved at similar rates in non-project LGAs. Some of the estimates are precisely estimated at 0, based on the responses of government o cials. For example, on questions of whether master urban plans have been updated (and can be veri ed by the interviewer), whether internal audits have been undertaken (and can be veri ed by the interviewer), whether budgets are executed in a timelier fashion, and e orts towards own revenue collection, there is similar improvement over time in both project and non-project LGAs. That is, non-project LGAs were able to make comparable improvements even in the absence of the incentives and training available under the project.
We discuss alternative explanations for our results, beyond the possibility of limited project impacts to the ongoing evolution of state capacity in Tanzania. Perhaps the two most important of these alternative possibilities are the following. First, the project's e ects might have been concentrated at the early stages of implementation, before the rst survey we undertook in February 2016. Second, the project LGAs' performance might have played the role of demonstrating to non-project LGAs the importance of investing in their own capacity or, more generally, copying project LGAs' practices. Addressing the rst point, we show that that the project's own scoring of local government performance does not indicate that all improvement across the project life-cycle happened before our baseline survey. Furthermore, approximately half of the total disbursement of the project happened between baseline (or rst round of survey) and endline (or second round), so the period between our surveys covers a substantial amount of the project cycle. Addressing the second point is more di cult-there may have been learning between LGAs that we cannot capture in our data. However, this interpretation would cast doubt on the incentive design of the project as the driver of improvements in state capacity, since the non-project LGAs did not receive these incentives but nevertheless improved over time.
These results raise further questions. What is driving the changes observed in state capacity in local governments that did not receive the project? What can we learn from this about how state capacity evolves over time? What is the role of external partners in the process? What can external partners do di erently than the project we have evaluated to bring additional value to within-country processes of endogenous change? The rich survey data we gathered allows us to o er some answers to these questions, along with ideas for how external partners might innovate for greater impact.
Prior research steers us towards looking at the role of citizens for an explanation of how improvements came about in non-project local government areas. Available research has argued that state capacity is built over time as societies become more complex and demand public goods that only a state can provide, such as defense against external aggressors (Tilly, 1992;Besley & Persson, 2009). The driving force of state capacity in this research is citizen willingness to pay taxes to a state they regard as legitimate (Besley, 2020;Fergusson et al., 2020;Weigel, 2020). Consistent with this research, the survey data show improvements over time in citizens' reports of actual payment as well as willingness to pay taxes so that local governments can develop their capacity. That is, the survey data suggests that state legitimacy to collect taxes has improved over time, as re ected in greater reports of compliance with taxes. The pattern of citizen responses also exhibit improvements in trust that local governments will deliver services and listen to citizen demands. Importantly, there is no di erence in these survey measures of increasing trust in and legitimacy of the local state across the project and non-project areas. The data thus suggest that improvements in state capacity in Tanzania stem from endogenous changes in trust and legitimacy in the country rather than as a result of nancial incentives o ered by external partners.
This interpretation can be illustrated by examining a particular project component which targeted processes of oversight by and accountability to citizens. One-fth of the performance score of LGAs under the project is allocated to indicators of consultation with citizens and disclosure of budget information for accountability (World Bank, 2012). In fact, this accountability and oversight component is the one where almost all project LGAs, 17 out of the 18, achieved the highest possible score by the time of their performance assessment in -17 (World Bank, 2018b). Yet, we nd no di erence between project and non-project LGAs in the 2016 survey (as we might expect if project communities had already improved dramatically and non-project communities had not), nor in changes between 2016 and 2018, in household responses to questions about knowledge of public budgets and consultation by the local governments. That is, when measuring citizens' self-reported knowledge of and engagement with local governments' initiatives, we do not nd evidence that the project improved citizen oversight and accountability. Citizens in non-project areas reported similar increases over time in knowledge about local government activities, and project areas showed no increases in self-reported indicators of citizen participation (e.g., whether they contacted any government o cials or participated in meetings).
The evidence of a lack of value-added of an external project to endogenous processes of change within the country has linkages with several di erent strands of the literature on state capacity. First, it adds to the growing literature on state capacity in economic development (Acemoglu, 2005;Besley & Persson, 2009;Besley & Ghatak, 2005;Acemoglu et al., 2015;Dal Bó et al., 2013;Muralidharan et al., 2016;Fergusson et al., 2020;Besley, 2020;Bisin, 2020;Bowles, 2020;Papaioannou, 2020). Research examining how the current high income countries of the world built state capacity has concluded that the impetus came from growing demand in society for public goods such as defense against external aggressors, and for municipal infrastructure during the Industrial Revolution (Tilly, 1992;Besley & Persson, 2009;Lizzeri & Persico, 2004). In the case of low income, developing countries, research has found variation within and across countries of historical institutions of state capacity which continue to have persistent impact on contemporary outcomes, long after those formal institutions have disappeared (Bandyopadhyay & Green, 2016;Gennaioli & Rainer, 2007;Michalopoulos & Papaioannou, 2013;Dell & Olken, 2020;Dell et al., 2018;Dell, 2010;Lowes et al., 2017). 5 These persistent e ects of history suggest that state capacity takes time to build because it involves not just physical or concrete investments in recruiting personnel, collecting taxes and enforcing compliance but because it needs norms to evolve within societies (World Bank, 2016cKhemani, 2019).
Second, the paper provides an empirical test relevant to prior qualitative critiques of development aid (Bourguignon & Gunning, 2018;Andrews et al., 2013Andrews et al., , 2017World Bank, 2017). Critics argue that development agencies focus on building formal institutional capacity in the image of developed countries' state institutions, which may result in developing countries "looking like a state" but lacking real state capabilities (Pritchett et al., 2013). The evidence we nd can be regarded as consistent with this critique, although the project did try to go beyond transplanting formal institutions, and into areas of citizen oversight and accountability. In general, the paper contributes micro-empirical evidence to a large cross-country literature on aid e ectiveness (Rajan & Subramanian, 2007Bourguignon & Sundberg, 2007;Bourguignon & Platteau, 2015;Brautigam & Knack, 2004;Knack, 2001).
Third, the paper links to a growing body of evaluations (many of them randomized controlled trials, or RCTs) of policy interventions on how to build state capacity. For example, one study provides evidence on how to successfully recruit state personnel (Dal Bó et al., 2013) and another shows how technology can be used to better manage state nances and establish a "leakage-free" payment infrastructure (Muralidharan et al., 2016). Indeed, because of growing evaluations in this eld, there is more evidence available to policymakers about what concrete policy actions to pursue than ever before (Banerjee & Du o, 2012). The open questions pertain to the incentives of policymakers who have the power to take up such evidence and make policy choices and investments in state capacity on its basis (Hjort et al., 2019). This paper examines whether external development agencies can create these incentives through their projects, grants and loans. Our results of no di erence in changes over time between project and non-project local government areas suggest that, at least in this context, the nancing incentives provided by external partners did not add signi cant value beyond processes of change already enfolding in the country.
Fourth, the paper links to research on decentralization or the role of devolving powers to locally elected governments (Bardhan & Mookherjee, 2000Faguet, 2003Faguet, , 2014Devarajan et al., 2009;Khemani, 2015). State building across many developing countries, especially those a icted by con ict, has focused on locally elected leaders who may have information about and standing in their communities to develop trust and legitimacy (World Bank, 2011;Myerson, 2011). The project we evaluate in Tanzania is part of a large portfolio of lending and grant-making by international development partners to strengthen locally elected governments (Independent Evaluation Group, 2018). The results we nd suggest that national governments are able to invest in building the local state even in the absence of international aid incentives and conditionalities, as research has found in other countries (Acemoglu et al., 2015). The results in the institutional context of Tanzania further link to research on how states governed by a single national political party or the military, with concentration of power at the center, choose to invest in locally elected governments to build their capacity to deliver services at the frontlines (Martinez-Bravo, Mukherjee, & Stegmann, 2017;Ferraz, Finan, & Martinez-Bravo, 2020).
Finally, the paper links to a large literature on institutions and development. Reviews of this literature all point to political institutions as key to bringing about state capacity by shaping the incentives of powerful policy makers to scale up concrete actions, such as those that have been shown to be e ective in previous evaluations (World Bank, 2017, 2016cDal Bo & Finan, 2016;Khemani, 2019;Olken & Pande, 2013). The comparative advantage or value-added of external partners has been explored in this research and yields ideas for innovation and experimentation in development projects (World Bank, 2016c;Devarajan et al., 2009). More evaluation of projects in di erent institutional contexts is needed to understand the role of external partners in building state capacity. Previous research has also shown the importance of considering the legitimacy of locally-originated program versus those led by external partners (Dal Bó et al., 2010). More experimentation or innovation is needed, drawing insights from recent advances in research on norms, trust and legitimacy as the driving forces of state capacity (Khemani, 2020;Besley, 2020).
The paper is organized as follows. Section 2 discusses the concept of state capacity, how it is measured in the research literature, and how it links to the design of capacity building projects pursued by external development partners. Section 3 describes the data that was gathered to evaluate the impact of such a project, using the opportunity available in Tanzania. Section 4 presents the methodology and results. Section 5 considers various explanations for these results, forwards our interpretation and discusses associated caveats. Section 6 concludes by o ering some recommendations for innovation in capacity building projects.

State Capacity: Theory and Measurement
Economics has long assumed the existence of basic state capacity to protect property rights, enforce contracts, and maintain law and order, as the fundamental institutional conditions which are needed for market-led economic development. This recognition of state capacity is above the debate about the size of government or where it should intervene. 6 The role of state capacity in economic development has come to the fore in recent research motivated by the observation that higher income countries have systematically higher tax to GDP ratios than lower income countries (Besley & Persson, 2009Acemoglu, 2005). As Besley & Persson (2009) observe: "A striking feature of economic development is an apparent symbiotic evolution of strong states and strong market economies. " 7 State capacity has been measured in economic research primarily as the ratio of government tax revenues to gross domestic product (Besley & Persson, 2011). The tax to GDP ratio serves as a summary statistic of sorts of the ability of governments to raise revenues to invest in the protection of property rights and the establishment of law and order, what sociologists and political scientists have termed a "monopoly over violence" (Weber, 1946;Anter, 2020). The reach of the state into local areas is also measured by the ability of local governments to collect revenues and administer state policies (Acemoglu et al., 2015). The process of building state capacity involves investments by national governments in the ability of local government agencies to administer policies (Dal Bó et al., 2013;Muralidharan et al., 2016).
The driving force of the process of building state capacity has been identi ed as citizens' demand for public goods that only a state can provide; state capacity in developed countries has been explained as a result of citizens being willing to pay the taxes needed to nance public goods and the state institutions that would provide them (Tilly, 1992;Besley & Persson, 2009;Lizzeri & Persico, 2004). In the contemporary world where some countries are clustered around high income and high state capacity and others at the opposite end of low income and low state capacity, international development partners have assumed a role in building capacity in developing countries (Jones et al., 2006;Levy & Kpundeh, 2004). Furthermore, over the past three decades, the practice of international development has moved from nancial transfers and policy advocacy as the primary way of e ecting development to increasingly focusing on building institutions and country ownership (de Janvry & Dethier, 2012). Both practice and research have revealed that when state institutions are weak, which is too often the case, developing countries are unable to put external aid to e ective use in growing their economies (Rajan & Subramanian, 2007. Since the 1990s, the World Bank, the largest development bank in the world, has designed projects to build state capacity both at national and sub-national levels of local government (Independent Evaluation Group, 2008). However, there is little research available on the impact of these capacity building projects.
The Tanzania Urban Local Government Strengthening Program (ULGSP) we examine contains essential features of how the World Bank has approached local state capacity building in its projects. For example, a project with the same features was undertaken in India and is described in World Bank (2016a). At the core of these programs are scal incentives based on assessments of institutional performance, termed the Annual Performance Assessments (APAs), for which the project mandates guidelines, scoring methodology, and the engagement of an independent third party (typically an accounting and audit rm) for its execution. The project also includes facilities for local governments to access training or advice on how to improve the outcomes measured by the APAs.
The rationale behind the design is that these performance grants will strengthen the incentives of local governments to undertake activities, and access the training needed, to increase their scores, which in turn will be equivalent to building state capacity, as de ned by the indicators in the APAs.
The speci c indicators in the APA in the Tanzania project are: (i) Urban planning system: documentation and indicators of having a General Planning Scheme in place.
(ii) Fiduciary or nancial management system: documentation of internal audit reports undertaken by a fully constituted Internal Audit Committee, and scores on a system of public procurement.
(iii) Infrastructure management: documentation and veri cation of the utilization of nancing to deliver physical infrastructure such as roads and sanitation services.
(iv) Accountability and oversight by citizens: verifying public disclosure of information about local budgets and convening of public meetings.
In addition to these four, a fth area of own source revenue generation (from the property taxes assigned to local governments) was targeted in the original evaluation design: (v) Local property tax system: increase in own source revenue from the collection of local property taxes. The goal was to enable local governments to gradually move away from dependence on scal grants to becoming self-reliant on own source revenue, of which local property taxes are typically the most important. However, in July 2016, three years into the program, the national government recentralized property tax collection, shifting it from local governments to the Tanzanian Revenue Authority (TRA). In response, after 2016 the program decided not to assign any scores to performance in generating own source revenues.
This evaluation draws on data from both the APAs and our independent surveys administered to government o cials and households. Each of these sources has value. Changes in the APAs document whether project LGAs improved on the speci c indicators that the project targeted. The survey data play three key additional roles. First, many of the survey indicators provide an independent check on data gathered via the APAs. 8 Second, the surveys with government o cials complement the APA data with broader measures of governance quality. Third, the surveys with households in the LGAs provide a further measure of governance improvements, which is whether the public observes improved local governance.
A fundamental question for the impact evaluation is whether local governments that did not receive the project, and thus access to the incentives and training under the project, made similar improvements in the areas measured under the APAs. One obvious way to measure these outcomes in non-project local governments would be to engage the same rms that are scoring the project local governments on the APA to undertake the same process of data gathering and scoring for non-project LGAs. However, due to the intensity of data gathering required for the APAs under the project, the same auditing rm could not administer APAs in comparison LGAs under the same timeline as project LGAs. Given both this and the additional cost, the evaluation draws on the survey data for the di erence-in-di erences analysis. We also report single di erences results using the APAs in project LGAs (see Table B1).
Through extensive eld testing and in collaboration with the government to ensure we measured indicators of state capacity that were locally valued, we designed the surveys to gather the following types of data grouped into modules under the di erent categories covered by the APAs. In each module, we aimed to include at least some questions in accordance with the project guidelines on how to undertake the APAs, so that the measures are as close as possible to the outcomes incentivized by the project. In the list below, we discuss the consistency between the APA questions/scoring guidelines and the questions in our survey.

On urban planning systems:
Government o cials were asked the following questions that are likely to be informative about institutional capacity for urban planning.
• Whether a General Plan had been approved since 2015, and conditional upon an a rmative answer, whether the o cial in charge of planning could show interviewers the plan. This set of questions follows the guidelines issued by the project to the rms that were contracted to undertake the APAs. Speci cally, the APA scoring guide indicates to assign points if LGAs are compliant with steps including plan preparation process, data analysis and plan adoption and approval • The respondents' estimate of the percentage of unplanned settlements in the LGA • The respondents' view of whether the LGA experiences delays in receiving guidelines for preparing their budget. This question and the following one directly measure one of the key outcomes targeted by the project-to reduce delays in communication between the President's O ce for Regional and Local Governments (PO-RALG) and the local governments.
• The respondents' view of whether disbursements from central government are timely • The respondents' view of whether the budget was executed in accordance with expected results. APAs review planning and utilization of annual plans for development budget.
To illustrate how such data can be useful in evaluating the impact of the project on institutional capacity, we can examine whether project local governments are more likely to have a General Plan and are able to produce it when compared to local governments not targeted by the program; whether government o cials in project local governments estimate a smaller share of settlements are unplanned; and are more likely to report no delays in communication and disbursements from the central governments, and in budgets being executed in accordance with expected results.
Citizens were asked the following questions that are likely to be informative about their assessment of the quality of urban planning.
• Extent to which they think the local council guarantees good use of revenues (standard 5-point scale used for such survey questions) • Extent to which they think the local council makes good investment plans • Whether they have observed problems with local government 2. Fiduciary or nancial management system: Government o cials were asked the following questions to assess both whether de jure internal audit systems are in place, and their views of how e ective internal audits are in monitoring the use of funds.
• Whether all of the positions on the internal audit committee are lled. This is consistent with the measure in the APAs on whether audit committees are in place and operational.
• Whether the internal auditor can show the interviewer copies of internal audits, and how many of these. APAs review audit reports from previous scal years.
• Views on whether internal audits are independent of political interference • Views on whether internal audits are e ective in monitoring the use of funds • Whether there has been turnover in the membership of the internal audit committee between the two survey rounds of 2016 and 2018. Again, APAs measure whether audit committees are in place and operational.
• whether there is a tender board in place, what its composition is and frequency of its meetings. APAs review existence and functioning of tender boards Citizens were asked the following questions to measure their assessment of local corruption: • The extent to which they trust the local council • The extent to which they think councilors are honest in handling public money • Whether they think most, some, or none of council members are corrupt • Their experience with bribe payments for various services 3. Infrastructure management: Government o cials were asked the following questions relevant to their experience of managing infrastructure project implementation: • Whether they think the number of engineers is adequate for the LGA's needs • Whether they think payments to suppliers were carried out on time • Whether measures have been taken to publicly disseminate information about the physical progress on infrastructure investments. APAs review whether development plans' progress is disseminated to the general public.
In the absence of data on the quality of infrastructure investments, drawn directly from measuring that quality at source (such as by taking samples of roads as in one project in Indonesia (Olken, 2007), or engineering assessments of sanitation infrastructure), the evaluation relies on citizen reports of the performance of governments in delivering urban infrastructure services: • Citizens' assessments of whether the local government maintains roads well • Whether the government keeps the community clean • Whether the government manages land well • Whether the governments maintains health standards well • Ease of access to a variety of services, such as building permits • Assessments of whether the neighborhoods in their ward are connected by paved roads, have garbage removed regularly, etc.

Accountability and oversight by citizens:
The following questions were asked of government o cials: • Whether there exists a formal mechanism for citizen feedback. APAs review whether procedures for dissemination and public participation are in place for the preparation and implementation of annual development plans, environmental and social impact assessments, and resettlement action plans.
• Whether there exists an o cial system to handle grievances • Whether the respondent has handled grievances personally • In how many public meetings have infrastructure investments been discussed. Again, APAs review whether development plans' progress is disseminated to the general public The following questions were asked of citizens: • Whether the local council provides information on budgets 5. Local property tax system: Government o cials were asked the following questions: • Whether some form of incentives to citizens to increase tax revenues had been tried.
• Whether the respondent pays property tax on residential/commercial buildings. This question may be a particularly good measure of the ability of the state to collect local taxes, regardless of whether the collection is administered by the national revenue authority or the local government. A government report on the challenge of domestic revenue mobilization from property taxes has identi ed the recalcitrance of local politicians, who tend to be property owners, as a problem (Government of the United Republic of Tanzania, 2013). Hence, whether local o cials have paid their property taxes is an indicator of the scal power of the state, the key measure of state capacity in the research literature.
• What percentage of tax invoice is collected.
• The percentage of total properties registered.
• Whether the respondent agrees there are opportunities to raise local revenues • Whether the respondent agrees that the challenge to raising local revenue is political The project, and so the APAs, dropped the component on local property tax systems in 2016 because the national government re-centralized the collection of property taxes to the Tanzania Revenue Authority. Citizens were asked the following question: • Willingness to pay taxes, as measured by the extent to which they agree with statements along the following lines: citizens should pay taxes so the local government develops; better to pay high taxes to get more services; not paying taxes is wrong and punishable; tax code and collection is fair. These types of questions are the focus of current research on state capacity (Besley, 2020;Papaioannou, 2020;Bisin, 2020;Bowles, 2020).
An agnostic approach to measuring the impact of the project involves examining all the variables described above as equally likely to be important for state capacity and allowing the data to reveal where there are any di erences in changes over time across project and non-project local governments. As we will discuss in detail below, a striking nding from the data is that only one of the many variables listed above exhibits greater improvement over time in project than in non-project local governments. We will also focus on the outcomes emphasized in the current research literature as measures of state capacity-the ability to raise revenues. Although the project dropped the component on local property tax systems in 2016 because the national government re-centralized the collection of property taxes to the Tanzania Revenue Authority, if state capacity is increasing in the country over time, we would expect to see this captured in the questions listed above on willingness to pay taxes to contribute to the development of the local state.
In addition to the variables discussed so far, our surveys included additional modules to gather data on the administrative capacity of local government o cials such as whether they have access to computers, can write emails and memos, and have received any training recently. We also estimate the impact of the project on these measures of administrative capacity of local personnel, in line with how the literature has examined local state capacity (Acemoglu et al., 2015).
Finally, other modules in the survey drew upon advances in research on the importance of informal institutions, such as culture or norms and beliefs in complex organizations, which shapes their capacity to perform (Bloom & Van Reenen, 2010;Rasul & Rogger, 2018). Khemani (2019) provides a review focusing on public sector organizations. These modules included the following types of questions posed to local government o cials: • Extent to which o cials feel peer pressure to perform their tasks well • Extent to which o cials take pride in their work • Extent to which they trust their peers • Extent to which they share values with their peers In sum, the outcome variables on which impact is evaluated are numerous and rich, with careful surveys of two types of respondents-local government o cials and households. These outcomes include all the areas targeted by the project in its assessment of institutional performance-urban planning, duciary systems, infrastructure management, revenue generation and accountability and oversight by citizens. In addition, we examine impact on measures of state capacity emphasized in the growing research literature-ability to raise revenues; administrative capacity of state personnel; and culture and norms of performance in the organizations of local government.

Data sources and timeline of data collection
Two rounds of data were collected as part of the impact evaluation. We present a timeline of the surveys' implementation and the APAs in Figure 1. The rst survey was completed in February 2016 when some project interventions had been implemented in most LGAs. As such, the rst survey can be considered as capturing the situation at early stages of project implementation. Likewise, the second survey was conducted in April-May 2018, when not all project activities had been completed, and therefore can be considered as capturing late stages of project activity. Speci cally, the temporal distance between the rst and second survey rounds is of 26 months which is around 45% of the temporal distance between the 1st APA and the 6th APA (≈60 months). During this period, about half of the project funding commitment was disbursed. 10 As part of the project, APAs were also collected at regular intervals.

Panel data
The data used for this study are built into two panel datasets (i.e., two datasets with data from the same LGAs and/or households at two di erent times): one of households and one of government o cials. Households and government o cials were interviewed in all 40 original evaluation LGAs (both in the rst and second rounds) and in 12 newly added LGAs (only in the second round). The 40 LGAs consisted of 18 project LGAs and 22 comparator LGAs. The process of selecting comparator LGAs, drawing on a mix of propensity score matching and local expert surveys, is detailed in Appendix A. Since the main goal of this paper is to assess changes in outcomes between the rst and second rounds of surveys, for the remainder of the text we will only refer to the 40 LGAs originally present in the rst round survey, ignoring the additional LGAs added at the second round.
The Household panel dataset includes 5,996 observations across the two waves -2,998 households were interviewed in each wave. The sampling of households in the rst survey round was performed in the following manner: in each of the 40 LGAs, 3 wards were selected 11 and then one enumeration area (based on the census) was sampled in each ward. All households living in that enumeration area were listed and 25 were randomly selected to be interviewed, for a total of 3,000 households. Around 80% of those were tracked in the follow-up survey and re-interviewed. For those not found, a new household was interviewed as a replacement. 12 The Government O cials panel dataset includes 948 observations -474 individuals in each of the waves. In the rst round, the following 12 government o cials were targeted to be interviewed in each LGA: Mayor/Council Chairperson; Council Director; Council Internal Auditor; Council Economist; Council Human Resources o cer; 3 Elected Ward Councilors; and 3 Ward Executive O cers. Government o cials were assured their answers would be kept con dential, and in 95% of cases the respondent was alone for the entire interview. At the second survey, over 70% of o cials still worked at the same LGAs and were re-interviewed even if working in a new position. For the remaining o cials, the interview was conducted with the individual currently occupying the position of the respondent in the initial survey. 13 For most of these ofcials, PO-RALG oversees all decisions related to appointments, transfer, promotions, etc. As PO-RALG also implemented the project and commissioned the survey, this could create some bias in government o cials' responses. However, this potential bias would a ect both treatment and comparison groups. For both households' and government o cials' surveys, our main di erence-in-di erences speci cation compares average changes in outcomes in project vs. non-project respondents, pooling together respondents that participated in both waves and replacements for those who could not be tracked. Our main estimates, in other words, treat the panel dataset as two separate cross-sectional surveys. 14

Outcomes of interest and aggregate indices
As discussed above, the survey includes indicators of institutional capacity among o cials and citizens in several dimensions that can be directly related to the ve core systems targeted by the project: the urban planning system, the duciary or nancial management system; infrastructure management; accountability and oversight by citizens; and the local property tax system. Accordingly, we structured the outcomes of interest, in both the household and government o cial surveys, in those ve areas. 11 Wards were selected in a manner that permits a balanced distribution between areas that received infrastructure projects and area that did not. Speci cally, the ward in which the LGA headquarters is based was always selected and then one ward with a recent infrastructure project and one without a recent infrastructure project were randomly sampled. No sampling weights were included in the analysis as our ward sampling strategy should produce a sample representative of the LGA level.
12 Attrition of original households was higher in ULGSP LGAs: 20% of households were replaced in those regions vs. 15% in control LGAs. 13 The share of o cials still holding the same position at endline was not statistically di erent between ULGSP and control areas (67% in ULGSP vs. 62% in control LGAs) 14 Restricting the sample to only those households/o cials that were interviewed in both surveys and estimating a xed-e ects panel model does not signi cantly change the results.
In what follows, we will present results using questions that aim to evaluate responses of state capacity in each of those areas. In addition to presenting results for an exhaustive set of individual outcomes, we also construct indices that aggregate those outcomes for each system. The construction of these indices follows closely the methodology in Anderson (2008). For each index, we rst code all components so that higher values indicate a "better" outcome, then standardize all transformed variables and nally construct the index as a weighted average of components. 15 The indices for the household survey and their underlying components are as follows: • Government o cials' survey: 1. Sta Capacity Index (6 items): Responses on sta skills in using computers and software.
2. Management Index (5 items): Responses on management willingness and ability to attract, retain, and promote sta .
3. Performance Culture Index (8 items): Responses about sta shared values, commitment to deliver, pride in serving, and ability to withstand political pressure.
• Household survey: 1. Urban Planning Systems Index (3 items): Responses about local government use of revenues and planning.
2. Fiduciary Responsibility Index (8 items): Responses about government honesty and bribe payments.
3. Infrastructure Management Index (12 items): Views on government capacity to maintain infrastructure and access to state services.
4. Accountability & Transparency Index (21 items): Views on local government transparency, participation in political meetings, ease of access to information.
5. Views on Taxation and Fees Index (11 items): Normative views on taxation and fees (e.g., is it wrong not to pay taxes?) and positive views about whether individuals are punished for not paying fees and taxes.
In Table 1 we present correlations across the summary indexes from the household survey on government performance in di erent areas. All the measures are broadly positively correlated: positive responses or evaluations in one dimension are usually accompanied by positive evaluations in similar dimensions. It is also clear, however, that these measures are not perfectly correlated, i.e., they likely capture di erent dimensions of citizens' interaction with the state.
The strongest correlation is between the Infrastructure Management index, which re ects quality of services, and the Accountability and Transparency index, which focuses on ease of access to information and transparency, at 0.33.
We also include data on the relationship between the survey measures and the APAs, in Appendix B. Since APAs were only collected at ULGSP LGAs, the sample is restricted to 18 observations. While for the household survey we observe overall positive correlations between outcomes measured at baseline, endline and changes over time using survey and APA indicators ( Figure B3 and Figure B4), results for the government o cial survey are more mixed, with overall smaller correlation in magnitude and both positive and negative relationships ( Figure B5 and Figure B6). We should treat these correlations with caution, not only due to the limited number of observations but also because in some indicators the amount of variation is quite limited. For example, almost all ULGSP LGAs had scored the maximum amount in the Oversight component of APA by our rst-round survey, so variation between our rst and second surveys is zero for the majority of them).

Evidence of successful project implementation
An important initial question regards the implementation of the program: does the survey provide evidence that government o cials were aware of the program and that they received training and funding as expected? The answer is a de nitive yes: in the rst round, o cials in project LGAs recognize the project as the main source of capacity training and budget support for infrastructure. In Figure 2, panel A, we show that among o cials in project areas, 65% report the project as the agency most supporting capacity building vs. 12% in comparison LGAs. For comparison LGAs, the main agencies reported as supporting capacity building are sector ministries (30%) and community-based organizations (18%).
Regarding budget support for infrastructure, panel B of Figure 2 shows that 50 percent of o cials in treated LGAs choose the project as the main source for recent increases in budget for infrastructure vs. 0 percent in comparison. Twenty-six percent of project o cials and 46 percent of non-project o cials indicate own revenues as the most important source of nance, whereas 41 percent of o cials in non-project LGAs indicate government grants as the most relevant source.
As discussed above, the rst survey round was elded after the 3rd APA, when government o cials would have been aware of the project. The data above suggest that o cials were not only aware of the existence of the program, but also report the project as being the most relevant source of capacity building and new nance for infrastructure. At the same time, panel B of Figure 2 shows that project LGAs still substantially rely on government transfers and own source revenues to nance infrastructure, which suggests that a redirection of central government resources from project to not-project LGAs is not happening, or at least not on a large scale.

Evidence of systematic observable di erences between project and comparison LGAs
As previously discussed, while the project was not randomly assigned to LGAs, the government and the evaluation team selected comparison LGAs that were likely to be similar to the ones receiving the program across a range of observable indicators. 16 This matching happened before the elding of the surveys, however it is not obvious that respondents in project and comparison LGAs would be identical in the main indicators given the targeting of the program. The key assumption behind our estimation stategy is not that project and comparison LGAs be identical but rather that they be developing at similar rates in the absence of the project (often called the "parallel trends" assumption). In this section we characterize the project and non-project LGAs and provide evidence supporting the parallel trends assumption.
We perform this comparison for an exhaustive range of indicators in Table 2 and Table 3, for households and o cials, respectively. The results suggest that respondents in project areas are, as expected, consistently di erent from those in non-project areas. In both tables, the rst and second columns present the average value of the indicator for project and non-project areas, respectively. The third column presents the di erence between those means, while the fourth presents the p-value of a T-test of mean equality. The last column reports the number of respondents with non-missing values for each test.
The rst panel of Table 2 documents that respondents are demographically di erent in UL-GSP and comparison areas: those in project areas come from smaller households; are 10 p.p. more likely to be literate and 15 p.p. more likely to have more than complete upper secondary education; and are 30 p.p. less likely to work in agriculture. Overall, these demographic characteristics suggest that households in project areas are wealthier than those interviewed in non-project LGAs. This is con rmed by answers on asset ownership: respondents in project areas are 4 times as likely to own a car as those in comparison areas; almost 3 times as likely to own a TV, 6 times as likely to own a computer and more than 2 times as likely to own a refrigerator.
Not only are demographic characteristics very di erent between project and comparison areas, but responses on government performance and accessibility to services are also consistently better in project areas. While 62 percent of respondents in those areas agree that the government maintains the roads well, only 47 percent of respondents in comparison areas agree. The gaps are smaller but still meaningful and statistically signi cant for responses about health standards, cleanliness and land management. Respondents in project areas are also consistently more positive about ease of accessibility to services: they are 8 percentage points and 12 percentage points more likely to agree that it is easy to access building permits and household services such as water, respectively.
Di erences among government o cials are less stark, but they still consistently suggest that project LGAs are better performers than comparison LGAs (Table 3). When asked whether the master plan is updated, for example, almost half of government o cials in project areas answer positively vs. 27 percent in comparison areas. Possibly directly related to the e orts of the project, almost all o cials in those areas say the LGA has a formal plan for capacity building vs. 62 percent in comparison areas; and conditional on having a plan, less than half of respondents in comparison areas could show the plan vs. three-quarters of respondents in project LGAs. While these di erences are statistically signi cant, we do not observe signi cant di erences in other indicators such as reporting of budget preparation delays, budget execution or share of unplanned settlements. On the topic of duciary systems, project areas also perform better: respondents are 16 percentage points more likely to report having the internal auditor position lled, and they are more likely to agree that the internal audit o ce is independent and e ective in monitoring. We do not observe signi cant di erences, however, on most indicators under the topics of infrastructure management and accountability and transparency.
Taken together, these results suggest that between comparison LGAs and those receiving the project, systematic di erences exist in the initial survey between respondents. Given the project's targeting rules, this is an expected result. Our identi cation strategy, a di erencein-di erences approach, is well placed to estimate the causal e ect of the program under preexisting di erences between treatment and comparison groups. We assess the main identi cation assumption behind a di erence-in-di erences approach by checking for parallel trends in state capacity levels before the start of the program. Speci cally, we examine trends before the rst round of data collection on a key measure of local state capacity, LGA's own source revenues. An important limitation here is that we only have available own source revenue data at the LGA level for the years 2007 to 2011 and data is available for the entire period of 5 years for only 36 out of the 40 LGAs (16 project and 20 comparison).
In Figure C1 we plot the mean per-capita revenues for both project and comparison LGAs. (We normalize for LGA population size in 2011.) The trends follow a very similar path between 2007 and 2009, starting to show some divergence after 2009. Given the paucity of data, we cannot run a more thorough exercise which would include an event study (including the periods before and after the start of the program). However, when we run a simple statistical test of the di erence between the mean per-capita revenue in project and comparison groups, we cannot reject the null of equality of the means. 17 To further explore the parallel trends assumption, we also use nighttime lights data. Nighttime lights data are highly granular and are available over a long period of time and so enables us to complement the pre-trend analysis of revenues. We focus on the period to 2000 to 2018. 18 However, while a good proxy of local economic activity, 19 nighttime light data might not be a good proxy for local state capacity. With this important caveat in mind, we show that trends in nighttime light are remarkably similar over the period 2000 to 2013 (which is when the program came into e ect) between project and comparison areas (see panel (a) in Figure C1). This nd-ing is con rmed by the event study we run and show in panel (b) of the same gure. A similar conclusion can be reached if we look (again in Figure C1) at the time period from 2000 to 2016 (which is the year of the rst round of data collection) or if we focus on the treatment period of 2016 to 2018.
To produce Figure C1 we focus on the 3 wards that were sampled in each LGA for this study. 20 In Appendix C we include graphs on di erent samples which summarize di erent ways of computing the average nighttime lights for each group (see Figure C2, Figure C3, and Figure C4). Overall, these further explorations suggest that trends are not drifting apart between the evaluation groups. At the same time, there is some indication that comparison areas might be catching up with treatment areas in terms of local economic activity over time.

Empirical Strategy
As discussed above, our goal is to assess whether LGAs that received the project improved more between the rst and second survey rounds when compared to non-project LGAs. For both the government o cials and household surveys, we use responses at the individual level and, in order to formally test for di erent improvement rates, estimate a di erence-in-di erences model of the form: where Y ilt is some outcome of interest of individual i in LGA l and period t; 1{SecondRound} ilt is an indicator that takes the value 1 if the respondent belongs to the second survey round in 2018, and 0 otherwise; 1{ULGSP} ilt is an indicator that takes value 1 if the respondent resides in a project LGA, and 0 otherwise; X ilt are individual/household characteristics used as comparisons (welfare index, household size, age, gender, marital status, and education levels) and ilt is an error term. 21 To evaluate the hypothesis that project LGAs were di erentially a ected by the program, we formally test whether β = 0; the coe cient on the interaction between the indicator for 2018 and project LGAs gives us the di erence between changes in project vs. non-project LGAs between 2016 and 2018 (i.e., this is the di erence-in-di erences coe cient). In our results, we also often present the estimates for the coe cient γ, which indicates the change in outcomes between 2016 and 2018 for the non-project LGAs.
In estimating this di erence-in-di erences model, we can only interpret the resulting estimates as causal if, in the absence of the intervention, the trends in outcomes for comparator and project LGAs would have been the same. In other words, for any given outcome (trust in local council, for example), our assumption is that were the project not to be implemented, the average change in that indicator for LGAs that actually received the project would have been the same as that observed in LGAs that did not receive the program. Some evidence on the parallel trend assumption was provided in the section above.

Results
We now present the main results of the paper: did LGAs receiving the project present a di erent trend in indicators of state capacity when compared to the comparator areas? We highlight the trends within project areas, then present how the comparator areas perform, and nally compare the performance between treatment and comparator areas to explore the impact of the program.

How did outcomes change within ULGSP LGAs over time?
We start by presenting how the responses of households about institutional capacity changed in project areas. Many indicators are presented in Table 4, Table 5 and Table 6. Columns (1) and (2) present average responses among households in project LGAs, in 2016 ( rst round) and 2018 (second round) surveys, respectively, while column (5) presents the di erence in means.
Overall, the response improved across the board. In Table 4, we see the share of households reporting that the local council makes good use of revenues has increased from 44% to 66%, for example, while the share saying that local governments make good investment plans increased by 21 p.p. Responses about government performance in delivering services such as road maintenance and cleanliness have also improved, as have responses about ease of access to building permits, household services and medical treatment. Table 5 reports indicators related to responses of transparency and accountability. The share of respondents agreeing that the local council provides information on budget, allows participation, consults other actors before decision. and handles complaints well have all increased by over 10 p.p. -for all these indicators less than 60 percent of respondents agreed with the statements by the second survey, but the performance increased signi cantly in the two years between surveys. It is also worth noting that direct political participation, by contacting o cials or participating in meetings, are both low and do not seem to be increasing: less than one in ve respondents ever contacted a village o cial, a number that remained unchanged between surveys, and only 40 percent ever participated in village meetings.
Finally, Table 6 presents results related to response on taxation and preferences over public good and government decisions. We do observe an increase in the share of citizens reporting that citizens should pay electricity and that not paying electricity is wrong and punishable, but only half of respondents say that not paying taxes is wrong and punishable, a number that does not change in 2018. We do see, however, a large increase in the number of individuals agreeing that the tax system is fair.
Beyond households in project areas, both objective measures and views reported by government o cials also improved. In Table 7, the share of respondents reporting an up-to-date master plan increased from 46% to 69%, for example. It is also remarkable that, despite starting at a high level, the share of project areas with a formal plan for capacity building attained 100 percent by the second survey -most likely a direct result of the intervention. The share of respondents agreeing that the internal audit o ce is independent and e ective both increased signi cantly between the two rounds.
Overall responses about capacity building activities also improved, as did opinions about sta capacity and norms: the share reporting that employees trust one another, share a strong set of values and take pride in their duties increased by 34 percentage points, 19 percentage points, and 23 percentage points, respectively. This broad range of improvements over time is consistent with the fact that APA measures also improved over time for project LGAs (Table B1).

How did outcomes change within comparison
LGAs over time?
The previous section documented signi cant improvements in the responses of households and o cials about state capacity in project areas, as well as objective measures such as having master plans up to date and auditor positions lled. We cannot, however, jump to the conclusion that these changes were caused by the program: they might have happened even in the absence of the intervention, if state capacity was in a trajectory of improvement. To assess whether impacts can be attributed to the project, we now present what happened in LGAs that did not receive the program.
The results are presented in the same Table 4, Table 5 and Table 6. Columns (3) and (4) present the average for each indicator among non-project respondents, for the rst (2016) and second (2018) rounds, respectively. The di erence in means is presented in column (6).
Column (6) suggests that overall improvements in responses were also observed among households and o cials in comparator LGAs. Despite remaining at lower levels than those in project areas, the share of respondents that agree that the local council makes good use of revenues has increased from 38% to 57%, for example, and those that believe the government makes good investment plans increased from 42% to 64%. Responses about the government service delivery have also improved, as have opinions on ease of access to services: the share of individuals reporting easy access to medical treatment, for example, increased by 15 p.p. Some indicators did deteriorate, notably the share of individuals reporting that councilors are honest in handling public money and reporting easy access to waste collection. But overall responses about government performance and ease of access improved across the board.
That was also true, as reported in Table 5, for a wide range of indicators on accountability of government (such as access to information on budget and handling of complaints). The share of respondents reporting that it is easy to nd out what taxes they need to pay doubled to 16%, as did those responding that it is easy to nd how the LGA spends revenues. Objective measures of political participation, like contacting o cials or reporting the existence of village meetings, recorded small decreases.
Finally, as in project villages, Table 6 shows that normative responses about paying taxes and electricity also improved between the two survey rounds: the share of individuals that agree that citizens should pay taxes so local government develops increased from 41% to 50%, and those who think it is wrong not to pay electricity improved by 9 p.p. Similar improvements were observed in the survey of government o cials, presented in Table 7 and Table 8. The share of respondents a rming the LGA has a plan that is up to date improved from 27% to 44%; the share reporting a plan approved in the last 2 years increases by 26 p.p.; and the share of LGAs with formal capacity building plans increased by 30 p.p. Improvements were also observed in responses about budget delays and execution, agreements with independence of internal audits, and responses on both sta capacity and sta morale.

The impact of the project on household responses and government o cial reports of institutional quality
In order to estimate the causal e ect of the project on areas that received it, as discussed above, we use areas that did not receive the program as counterfactuals. Under the assumption that these areas are valid counterfactuals (i.e., they would have followed similar trajectories in the absence of the program), assessing whether the project had an e ect on the outcomes of interest is equivalent to examining whether areas that received the program had a di erential change in outcomes, when compared to the comparison areas. The simplest (and most transparent) way to make this di erence-in-di erences estimation is to compare columns (5) and (6) in Tables 4 through 8: were the changes in project areas larger than those in comparison areas? Simple visual inspection suggests that, overall, this is not the case: project LGAs do not seem to consistently outperform comparison areas, which often presented larger improvements.
To formally test whether changes in project areas were di erent from changes in comparison areas, we estimate equation 1  LGAs. The right panel presents the same for coe cient β, and it is the di erence-in-di erences coe cient: it represents the di erential change in outcome among project respondents, when compared to non-project respondents.
Focusing rst on the left-hand panel, we observe that for a vast number of indicators, as discussed above, comparison areas recorded improvements between 2016 and 2018. All indicators are constructed such that higher values indicate normatively better outcomes or improved views, and very few estimates indicate worsening performance between rst and second rounds, in 22 Controls include household size, age, gender, marital status and education level. Note that we are not estimating individual xed-e ect models, so we can include time-invariant comparisons. Furthermore, as discussed above, some households and o cials are replaced in the second round survey so controls would still vary over time for some units. All standard errors are clustered at the LGA level. Due to the small number of clusters, we also report 95% con dence-interval for our estimates using wild-bootstrapping. Results are not sensitive to the inference method. 23 For ease of reading, all dependent variables were standardized before the regressions, so coe cients are to be interpreted as changes in standard-deviation units of the dependent variable. either survey. From the household survey, trust in the local council did fall, as did the share of respondents reporting that most neighborhoods are accessible by road and that had satisfactory responses from o cials regarding complaints. Among government o cials, no results suggest worsening response about state capacity, and several indicators related to planning systems, taxation, capacity building and culture of performance registered improvements.
Turning to the right-hand panel, the observation is that for almost all indicators, changes in project areas were not di erent from those observed in comparison areas (i.e., estimates of the β coe cient are often very small in magnitude and statistically indistinguishable from zero). Among households' responses about state capacity, for example, no indicators related to urban planning, duciary systems, accountability or taxation changed di erentially in project areas between the rst and second survey rounds. The only indicator that recorded a di erential change in the treated areas was the share of neighborhoods accessible by road, which decreased in in comparison areas while increasing in project LGAs.
Among government o cials, the same pattern repeats: for the vast majority of indicators we observe no di erential trend in project vs. comparison LGAs. For the few indicators that did change di erentially, we observe better performance of comparison LGAs: for several indicators related to taxation, like seeing opportunity to increase tax revenue and respondents paying taxes on their own property, performance improved more in comparison than project areas.
We also present results for the aggregate indices created, in Table 9 and Table 10. In both tables, panel A presents results for a simple di erence-in-di erences speci cation, without including any comparisons, while Panel B includes individual-level comparisons. For both panels, Table 9 shows that, except for the Accountability and Oversight index, respondents in comparison LGAs were more positive in the second survey than in the rst one -as documented by the positive coe cients of the indicator variable for the 2018 survey round. The improvement in project LGAs, however, was no di erent than that observed in comparison LGAs: the coe cients on the interaction between the project indicator and the second round survey indicator is small and indistinguishable from zero. Across all outcomes, we can reject impacts larger than 0.3-0.4 s.d. 24 This is consistent with the fact that we observed no di erential improvement in responses in the vast majority of the underlying variables used to construct these indices.
The same overall result is found for the government o cials' survey on Table 10. Here we present three aggregate indices -the sta capacity index, the management index, and the performance culture index -as well as three of the main indicators of state capacity -having an updated master plan, perceiving the internal audit o ce as independent, and capacity to raise local revenue. For all outcomes, the point estimates suggests an improvement in the second survey round, though not always statistically signi cant. The di erence-in-di erences coe cients, on the other hand, are always smaller in magnitude and never statistically di erent from zero. For several estimates, point estimates are negative, suggesting that, if anything, comparison LGAs outperformed project areas. For the standardized indices, we can reject di erential improvement in project areas as small as 0.3 standard deviations (Management), 0.4 s.d. (Sta capacity) and 0.5 s.d. (Performance culture). Once again, we nd no evidence that project LGAs registered a larger improvement in responses on performance when compared to comparison LGAs. 25

Discussion
In this section, we consider possible explanations for the lack of statistical di erence in these outcome measures and what these results imply for the design of external capacity building projects going forward. First, we observe a pattern of positive changes over time in the responses from government o cials about their experience with managing local responsibilities and delivering services, and from citizens about receiving these services and being willing to pay taxes tonance them, in both project and non-project LGAs. That pattern is consistent with overall change in Tanzania in the direction of strengthening state capacity.
Second, the additional (besides country-level processes of change) value of the project's speci c activities to strengthen incentives of local government o cials-such as through the Annual Performance Assessment (APA)-appears to be low, with no evidence of di erence in responses of government o cials across project and non-project areas. There is also no evidence that the project made a di erence for the citizen oversight and accountability channel through which the project sought to strengthen incentives of local government o cials. Citizens in non-project areas reported comparable increases over time in knowledge about local government activities, and project areas showed no increases in concrete indicators of citizen participation (such as whether they contacted any government o cials or participated in meetings) despite the APA giving the highest possible score to project LGAs on this component.
What alternative explanations could t the pattern of results that we observe? First, it is possible that the lack of di erence in project and non-project areas could be due to project LGAs demonstrating superior performance and thus having other LGAs in the country learn from and copy their practices. It is also possible that investing in local state capacity is a strategic complement across local governments (Acemoglu et al., 2015), such that as state capacity increased in project areas, other LGAs perceived greater returns from investing in their own capacity. However, the complete lack of any statistical di erence across a rich set of variables casts doubt on these explanations. If the project had such large demonstration e ects, we would expect to see at least some di erence in some of the variables, rather than have all the bene ts spill over in this short span of time. Even if LGAs were learning from each other, the signi cant improvements over time in non-project LGAs, which did not receive the project's incentives or capacity building activities, cast doubt on the mechanism design of the project-of incentives generated by performance grants. 25 We present alternative speci cations to the standard di erence-in-di erences estimator -xed e ects and a semiparametric di erence-in-di erences estimator -using government o cials and households in Appendix D: Table D1 and Table D2. Estimates are qualitatively similar and suggest an overall null impact of the intervention. Second, the project may have had its principal impact before the rst survey in February 2016, in which case the di erence between the two surveys might not capture project-induced improvements. We examine the Annual Performance Assessments (APAs) of the project to discern which components show the greatest increase in project measures of institutional performance. In Figure B1 we present the mean scores across project LGAs in each of the performance assessments. Our rst round survey was implemented in early 2016, around the fourth APA, while our second round survey happened in 2018, close to the sixth APA. We nd that the "Accountability" component of the APA shows a large increase between the 2nd and 3rd APAs, while the other components (Revenues, Infrastructure and Urban Planning) do not show a pattern of concentrated growth before our rst survey round. (We discuss the available evidence in more detail in Appendix B.) It could be that had we undertaken a citizen survey in 2013, we may have found that the 18 project LGAs had much lower levels of citizen-survey-based Accountability measures than the 22 non-project LGAs in 2013? What if the project incentivized the 18 LGAs who were reluctant to publicly post their budget information in the absence of these project incentives; what if our 2016 survey measure, compared to this hypothetical 2013 survey measure, shows that Accountability signi cantly improved, as a result of the project incentivizing the LGAs to reach out to citizens? We cannot rule this out because we do not have survey data from 2013 across both project and non-project LGAs. We nd no signi cant di erence between project and non-project LGAs in the 2016 survey measures of citizen engagement targeted by the Accountability component, nor in changes in citizen engagement between 2016 and 2018.
We argue that our two rounds of surveys appear well positioned to capture something proxying a baseline for outcomes in February 2016, before the project was disbursing substantially, and improvements over a two year period as the project is actively implemented. The 2018 Quality Assurance Review (a report that reviews the project's implementation and is led by an external consultant) indicates that disbursements of project funds were delayed in the early years of the project, with funds starting to ow and investment activities happening only towards the mid- Third, measuring experience with service delivery is subject to error and reporting biases. For example, the period 2016-2018 in Tanzania is one where a new president took o ce (in October 2015), announced and implemented several policy measures to crack down on corruption, and strengthened performance norms among government o cials. These announcements and actions may have created a perception that things are improving, coloring the survey responses equally in both project and non-project areas. However, our survey is careful to avoid pure perception questions such as degree of "satisfaction" with government performance or agreement/disagreement with whether things are "improving. " Instead, the questions probe for actual experience, and some show improvements over time only in the non-project LGAs, not in project LGAs. For example, in 2016, only 29 percent of government o cials reported paying property taxes on commercial building they own, which rose to 62 percent in 2018, compared to around 75 percent in both years in the project LGAs.
Fourth, despite our checks of pre-project parallel trends in a key measure of local state capacity-own source revenues, since the project LGAs were not selected at random, there remains a possibility that their trends would have been di erent in the absence of the program, leading to bias in our di erence-in-di erence estimates. Nevertheless, the following facts make it di cult to defend an argument that the project had impact that we are unable to discern: that the survey data are exceptionally rich, gathering data on many di erent aspects of local state capacity, from the experience of both government o cials and citizens; that we nd no evidence at all of significant improvements in the project areas that are di erent from improvements in the non-project areas; and that the improvements reported over time in both project and non-project areas are consistent with country-wide changes in Tanzania, regardless of the project. For example, the new President re-centralized the collection of local property taxes. The re-centralization of local property taxes directly a ected the project design. While at the outset, one of the main indicators of increasing local state capacity was expected to be increases in local own source revenues, the APA entirely dropped the scoring of this component in its 6th round (2017)(2018)). Yet, we nd increases in local government o cial reports of local tax e orts, and especially so in non-project areas. Local o cials' payment of their own property taxes increases substantially in non-project areas, and they also are more likely than project-area o cials to report e orts to improve local tax collection.
To o er ideas and recommendations from research for further innovation in the design of such projects in Tanzania and beyond, we closely reviewed project documents to understand the general design and theory of change on which these capacity building projects are founded. We nd that these projects share a common, global template that is being applied across di erent countries and contexts, centered on the role of an Annual Performance Assessment which is expected to verify whether local governments have certain institutional features that are found in high state capacity countries: such as, existence of planning documents, council meeting minutes, audit reports, procurement tenders and the like. 26 Qualitative research has critiqued this approach to building state capacity as "isomorphic mimicry" (Andrews et al., 2017), whereby developing countries are made to produce documents and establish protocols that resemble institutions in donor countries, but fail to e ectively perform the functions of a state. The lack of di erence in measured outcomes of state functioning in the data from Tanzania o ers quantitative evidence that is consistent with such critiques.
Alternatively, the amounts committed or the scope of the capacity building component might be too small to make a dent. 27 Further, the rationale for using a nancial incentive approach is not necessarily substantiated by strong ex-ante evidence and hence it cannot be ruled out that a more traditional development nancing approach (without explicit incentives) could be more e ective, as recent evidence from the health sector demonstrates (Kandpal et al., 2020).

Conclusion
Going forward, we recommend, rst, that projects targeted at building state capacity invest more resources in learning through policy experimentation within the project, given the lack of established knowledge about how state capacity comes and the existing critiques of donor approaches to transplant formal institutions (Bourguignon & Gunning, 2018;Andrews et al., 2017;World Bank, 2017). Second, we recommend more research to understand forces of change in countries that may be strengthening incentives of local governments to deliver services. From reviews of research available so far, it seems that greater political contestation and demands from citizens for improved governance and service delivery are behind these improving incentives, but nevertheless with several risks and pitfalls (Olken & Pande, 2013;World Bank, 2016c;Dal Bo & Finan, 2016;Bardhan & Mookherjee, 2016;Khemani, 2019). A deeper analysis of administrative data can be an essential asset for both of these recommendations. 28 Project designs going forward can use the recommendations from these reviews to strengthen political incentives for service delivery, and thus enable the emergence of state capacity along similar lines as how such capacity emerged in today's rich countries (Besley & Persson, 2009;Fukuyama, 2004;World Bank, 2016c;Khemani, 2019). For example, performance assessments could focus on rigorously measuring service delivery performance-e.g., road connectivity, garbage collection, coverage of drainage and sewage systems-rather than on primarily reviewing documents and protocols as is currently being done through the APAs. Project design could focus on the communication of, and deliberation around, these performance assessments, with citizens, especially through mass media whose role in strengthening institutions has been recently emphasized in research (World Bank, 2016c;La Ferrara, 2016). Investing in communication and deliberation is not a soft option but rather one that could be applied more scienti cally through dedicated projects aimed at building state capacity, Acemoglu, D., García-Jimeno, C., & Robinson, J. A. (2015)   Has not seen problems with how loc gov is run    Infrast.

Seldom experience delay in budget Percentage of settlements unplanned
Master plan up to date Master plan last approved previous 2 years Disbursements from central government timely Budget executed as expected Seldom experience delay in guidelines Has formal plan of capacity building Showed formal plan Auditor position filled Show copy 1st quarter audit Show copy 2nd quarter audit Show copy 3rd quarter audit Show copy 4th quarter audit Agrees office of internal audit is independent Agrees office of internal audit is effective Number of grievances handled (year) Grievance handled through formal system Have handled grievances personally Exists official system to handle grievances Pay property tax land Challenge to raise local revenues is political Pay property tax on mixed use building Percentage of tax invoice collected Some incentive for tax collection Pay property tax on residential building Sees opportunities to raise local revenues Pay property tax on commercial building Number engineers adequate Number meetings discussing infrastructure Payment to suppliers on carried out on time      LGAs, and the di erence between the means. P-value is reported for di erence of means, clustered at LGA-level. LGAs, and the di erence between the means. P-value is reported for di erence of means, clustered at LGA-level.       Note: This table reports regressions using each of the described indices as dependent variable. Non-reported comparators in the second panel include welfare index, household size, age, gender, marital status and education levels. Standard errors clustered at the LGA level are reported in parentheses (* p<0.1, ** p<0.05, *** p <0.01), while 95% con dence-intervals constructed using wild-bootstrapping are reported in brackets.

A Selection of the Comparison Group
Because of the targeted nature of the program (i.e., the program was designed to focus on the 18 LGAs that were among those with faster rates of urbanization at the time the program started), nding a group of comparable set of LGAs was expected to be a di cult task. Additionally, administrative data prior to the program's inception was incomplete for many potential comparison LGAs; some potential comparison LGAs were newly formed and so no historical administrative data were available at all. Against this backdrop, the research team developed a protocol to select 3. If Step 2 did not hold, then if the top comparison LGA from either list was already selected, we chose the top one from the other list. Otherwise we chose the top comparison LGA from the survey list.
Using the comparison group selected as described above, we nd that respondents in project and comparison LGAs are not statistically identical in the main indicators at round 1 ( Table 2 and Table 3). This is an unsurprising result, given the targeted nature of the program. Our identi cation strategy does not rely on project and comparison LGAs being identical but rather that they be developing at similar rates in the absence of the project (often called the "parallel trends" assumption). This is discussed in detail in subsection 3.5.

improvements in project LGAs
We use information from the Annual Performance Assessment (APA) of the 18 local governments receiving the Urban Local Government Strengthening Program (ULGSP), compiled in the Quality Assurance Reviews (QAR).
We explore two main indicators available throughout the period. Disbursement Linked Indicator (DLI) 2 refers to whether "ULGAs have strengthened institutional performance as scored in the annual performance assessment" and is comprised of ve sub-indicators (as described in Section 2 of the paper): Improved urban planning system; increased revenues from property taxes; e cient duciary system; improved infrastructure, implementation and O&M; and strengthened accountability and oversight systems. DLI 3 refers to whether "Local infrastructure targets as set out in the annual action plans are met by ULGAs using program funds. " In Figure B1 below we present the mean score for four of the sub-components of DLI2 -we exclude the sub-component on e cient duciary system since maximum scores changed over time. As referenced in the timeline presented above, our rst round survey was implemented around the same time of the 4th APA (early 2016), and the follow up around the 6th APA (early 2018). The evolution of mean scores is uneven over time. With the exception of the accountability index, that shows a remarkable increase between the 1st and 2nd APAs and then attens out, the other sub-components are broadly stable over time, with ups and downs and no indication that all improvement happened in the very beginning of the ULGSP program. 29 29 We also present the overall changes in APA subcomponents in Table  Note: This gure presents average scores on each DLI2 component, including 95% CI for the mean. It excludes the "e cient duciary system component" since in the 2nd and 3rd APAs the maximum score was lower due to non-inclusion of certain sub-components. The score for "increased revenues from property taxes" was not computed in the 6th APA.
In Figure B2 we present the evolution of average DLI2 and DLI3 components. Focusing rst on DLI3, we observe a substantial increase between APAs 4 (early 2016) and 5 (late 2016), with slight improvements in between other APAs. Trends in the DLI2 component are harder to interpret since the maximum score was lower in the 2nd and 3rd APAs (due to lower maximum score on the "e cient duciary system" sub-component) and in the 6th APA (due to the absence of "increased revenue from property taxes" sub-component). We use weights to adjust for those, but the composition of the index is not strictly comparable over time. Note: This gure presents average scores on DLI2 and DLI3 components, including 95% CI for the mean. It should be noted that DLI2 is not strictly comparable over time: the maximum score for the "e cient duciary system" sub-component is lower in 2nd and 3rd APA; and the "increased revenue from property taxes" is absent in the 6th APA, reducing the maximum aggregate score. We use "adjusted scores" provided by the QAR for the 2nd and 3rd APAs, that multiply the nal score by 10/9 to expand the maximum score from 90 to 100; and apply the same adjustment ourselves to the 6th APA, multiplying scores by 100/75 to take into account the maximum score of 75 points. LGAs receiving the ULGSP program. Column (1) reports the total change in score between the 2nd and 6th APAs, while columns (2) and (3) report changes between 2nd and 4th APAs (before rst round survey) and between 4th and 6th APAs (between rst and second round surveys), respectively. Columns (5) and (6) express the change in each period as a share of total change. We do not report results for the "E cient Fiduciary systems" component since the underlying indicators changes throughout the period. The subcomponent "Increased Revenue" was not computed for the 6th APA, so we use the 5th APA as the nal round.  (1)) and survey indices between rst and second rounds (column (2)), for APA component and similar survey index. Changes are normalized to be interpreted as standard deviations of baseline. Indicator for "Increased revenue" is not presented since it was not collected on the 6th APA.
We present how variations in APA scores and equivalent household survey measures vary between the 2016-2018 period in the 18 LGAs receiving ULGSP in Figure B3. For the infrastructure and urban planning dimensions, we observe a positive correlation -although with a fair amount of noise as should be expected from 18 observations. The linear relationship between the two variables is also positive for accountability, but the scatter plot makes clear that most LGAs do not observe any variation in the DLI scores. Finally, for the duciary responsibility dimension we actually observe a negative correlation between changes in DLI and survey measures. Additionally, we present correlations for baseline, endline and changes in subcomponents for household survey measures in ??.
We also present correlations between APA scores and government o cials' survey indices in Figure B5.

(d) Accountability
Note: This gure presents a scatter of changes in DLI subcomponent scores (x-axis) and survey indices (x-axis) by LGA. DLI scores are normalized so that changes can be interpreted as standard deviations in the distribution of the 4th APA scores.

(d) Accountability
Note: This gure presents a scatter of changes in DLI subcomponent scores (x-axis) and survey indices (x-axis) by LGA. DLI scores are normalized so that changes can be interpreted as standard deviations in the distribution of the 4th APA scores. Note: This gure presents correlations across the 18 ULGSP LGAs for baseline, endline and changes in outcomes. Endline and changes are not reported for the taxation subcomponent since it was not collected in the 6th APA.

C Pre-trend checks
The key identifying assumption of our di erence-in-di erences strategy is that, in the absence of the ULGSP, outcomes in treated and reference units would have evolved similarly. While this parallel trends assumption is not directly testable, in this section we use nighttime light (NTL) data as proxy for economic activity and assess whether treatment and reference units presented similar trends before our rst round of interviews.
In order to cover a longer time period before the intervention, we use the recently harmonized NTL dataset developed by Li et al. (2020), covering the 1992-2018 period. This is a synthetic dataset that uses the original Defense Meteorological Satellite Program (DMSP) data for 1992-2013 and calibrate data from the Visible Infrared Imaging Radiometer Suite (VIIRS) for the period 2012-2018. Following recommendations from the authors, in our main speci cation we drop any pixels with NTL value lower than 7. 30 We present results in Figure C1. In panel A we present average NTL, with 95% CI, for ULGSP and comparison wards included in our survey sample. Consistent with the fact that ULGSP was targeted at more urban districts, the average nighlights are higher for those wards when compared to comparison. The trends before the rst assessment, however, do not suggest di erential pre-trends. We test this formally using a di erence-in-di erences model of the following form: ntl wy = α + γ w + θ y + 2018 y=1999 β y (ULGSP * year) wy + X wy + wy where ntl wy are measures of nighlight in ward w in year y, γ w and θ y are ward and year xede ects, and the coe cients β y measure the di erential nighlights in ULGSP wards in each period. We allow for di erential linear trends by region, included in the time-varying vector X wy . Since treatment is de ned at the LGA(district)-level, we cluster standard errors at that level. We use the entire sample period 1992-2018, but pool all years before 2000 in one coe cient.
We present the β y coe cients in panel B of Figure C1. While results are somewhat noisy, they do not suggest a di erential trend between ULGSP and comparison before the 1st ULGSP assessment in 2013. While there is a temporary decrease in nighlights in ULGSP wards relative to comparison ones in 2014, that di erential is temporary and quickly disappears in the following years.
We also present results using not only wards included in our survey, but measuring average nighlight at the entire district in Figure C3. Here we observe some degree of convergence between 2000 and 2013: the di erence in mean nighlights in comparison and treatment districts falls by about 2 NTL points in the period (in 2011 the standard deviation (s.d.) of NTL across the approximately 800 wards included in study districts was 6.2, meaning that the estimated convergence in NTL was smaller than 0.3 s.d.). This is re ected in the DiD coe cients in Panel B. A similar pattern is observed in Figure C4, where we plot levels of nighlight and DiD estimates for the ward with strongest nighlights in 2011, as a proxy for the urban center in each district. We also observe some degree of convergence, particularly in the early 2000s.   (Abadie, 2005). The smaller number of observations compared to the tables in the main text is due to the fact that once we use xed e ects we drop all observations for which either baseline or endline indices are missing. Regressions in panel B use the change in outcome as dependent variable, so the number of observations is half that of the panel data. Selection into treatment is balanced for respondents' age, gender and wealth index. Standard errors clustered at the LGA level are reported in parentheses (* p<0.1, ** p<0.05, *** p <0.01)  (Abadie, 2005). The smaller number of observations compared to the tables in the main text is due to the fact that once we use xed e ects we drop all observations for which either baseline or endline indices are missing. Regressions in panel B use the change in outcome as dependent variable, so the number of observations is half that of the panel data. Selection into treatment is balanced for respondents' age, gender and wealth index. Standard errors clustered at the LGA level are reported in parentheses (* p<0.1, ** p<0.05, *** p <0.01)