Are There Lasting Impacts of Aid to Poor Areas? Evidence from Rural China

The paper revisits the site of a large, World Bank-financed, rural development program in China 10 years after it began and four years after disbursements ended. The program emphasized community participation in multi-sectoral interventions (including farming, animal husbandry, infrastructure and social services). Data were collected on 2,000 households in project and nonproject areas, spanning 10 years. A double-differenceestimator of the program&apos's impact (on top of pre-existing governmental programs) reveals sizeable short-term income gains that were mostly saved. Only modest gains to mean consumption emerged in the longer term-in rough accord with the gain to permanent income. Certain types of households gained more than others. The educated poor were under-covered by the community-based selection process-greatly reducing overall impact. The main results are robust to corrections for various sources of selection bias, including village targeting and interference due to spillover effects generated by the response of local governments to the external aid.


Introduction
Publicly-supported grants and loans to poor areas have long been an important vehicle for development assistance. For example, China's anti-poverty policies have emphasized such poor-area programs since the mid-1980s, 2 motivated by the observation that the country's success against poverty over the last 25 years has been geographically uneven, with marked disparities in living standards emerging. 3 Advocates of such programs claim that credit constraints in poor areas perpetuate their poverty and that targeted aid can relieve those constraints. By this view, capital-market failures in poor areas entail that the investments made under such a program would be infeasible otherwise, implying both efficiency and equity gains.
It remains an open question how much impact can be expected. While not perfect, capital markets may still work well enough to assure that marginal products of capital come into rough parity between poor and non-poor areas in steady state. Then the problem of lagging poor areas is not so much lack of capital as low productivity of capital, such as due to poor natural conditions, lack of complementary knowledge or skills, or poor policies.
And even with credit constraints, some people are clearly more constrained than others. If those selected are not credit constrained, their participation is voluntary, and the interest rate is no different from other credit sources, then there will be no net gain from the extra availability of credit. Heterogeneity in the impacts of such programs can also arise from inequalities in the complementary skills or knowledge needed to derive benefits from the extra investment. Beneficiary selection will then be crucial to the outcomes. However, it is not obvious that the selection procedures found in practice would "pick the winners." Beneficiary selection for local development programs has come to rely heavily on local community groups. This practice may well achieve greater equality in access to the aid within villages, but possibly at the expense of assuring that the aid goes to those who would benefit the most. 2 See (inter alia) Leading Group (1988), World Bank (1992, Jalan and Ravallion (1998) and Park et al. (2002). This paper provides the first rigorous assessment of the longer-term impacts of a poor-area program. The program is the World Bank's Southwest China Poverty Reduction Project -the Southwest Program (SWP) for short. This comprised a package of multi-sectoral interventions targeted to poor villages using community-based participant and activity selection. The aim was to achieve a large and sustainable reduction in poverty. The paper reports results from an intensive survey data collection effort over 10 years, initiated by two of the authors and done in close collaboration with the Rural Survey Organization of China's National Bureau of Statistics.
Assessing development aid effectiveness at the project level raises a number of challenges. A long-term commitment to collecting high-quality survey data is crucial, but it is not sufficient. Impact can only be meaningfully assessed relative to a counterfactual; our counterfactual is the absence of the SWP, which means that we assess the incremental impacts, on top of pre-existing governmental spending. As in any observational study, there are concerns about selection bias, i.e., differences in counterfactual outcomes between SWP participants and non-participants. Our data collection effort allows us to "difference out" the time-invariant component of the selection bias (arising from non-random placement). However, it is not obvious on a priori grounds that the bias would be constant over time, given that the initial village characteristics that attract the program (such as poor infrastructure) may also influence the growth rate under the counterfactual. We use both propensity-score weighted regression and kernel-matching methods to balance the observable covariates between sampled SWP and non-SWP villages.
A further problem is that aid-financed poor-area development projects are likely to violate the common assumption in impact evaluations (both experimental and nonexperimental) of no interference with the comparison units. 4 A plausible source of interference in this setting is through local public-spending spillover effects to non-SWP villages. The local government cuts its own development spending in the villages targeted for external aid, and the spending is diverted in part at least to the non- 4 This assumption is often implicit in impact evaluations but it was made explicit by Rubin (1980), who dubbed it the stable unit treatment value assumption (SUTVA). SUTVA is known to be implausible in certain bio-medical evaluations. participants used to form the comparison group. We propose and implement a test for spillover effects and we construct a bound for the bias.
The paper's principle finding is that there were sizeable income gains from the SWP during its disbursement period, but these gains did not survive four years later. The longer-term impact on mean income is neither large nor statistically significant.
However, we do find significant gains for some sub-groups, notably those among the poor with better schooling. Our results point to substantial losses from the communitybased beneficiary selection process.
The following section describes the SWP while sections 3 and 4 describe our data and methods. Section 5 presents the main results while Section 6 draws some lessons for future evaluations.

Background on the program
In 1986, the Government of China designated that about 15% of the country's 2,200 counties were "poor counties," which would receive extra assistance, mainly in the form of credit for development projects. Past research has suggested that the designated poor counties are in fact poor (by a range of defensible criteria) and that they have seen higher growth rates than one would have otherwise expected (Jalan and Ravallion, 1998;Park et al., 2002). The gains have not been sufficient to reverse the underlying tendency for growth divergence (whereby poorer counties tend to have lower growth rates) and there is evidence that the impacts on economic growth may have declined in the 1990s (Park et al, 2002). Within these designated poor counties, geographic pockets of extreme poverty have persisted to the present day, mainly in upland areas.
The SWP was introduced in 1995 with the aim of reversing the fortunes of selected poor villages in the designated poor counties of Guangxi, Guizhou and Yunnan.
About one-quarter of the villages were selected for the SWP (1,800 out of 7,600 villages). The aim was to choose relatively poor villages within these counties, with selection based on objective criteria, although not formulaic. The selection was done by the county government's project office in consultation with provincial and central authorities and the World Bank.
The total outlay on the SWP was US$464 million, which was financed by World Bank loans and counterpart funding from China's central and provincial governments.
The total investment per capita under the SWP was only slightly lower than mean annual income per capita of the project villages.
As in other World Bank projects, there were numerous appraisal and supervision missions by Bank staff and consultants, and these missions often probed quite deeply into the project's local operations, including numerous visits to participating counties and villages. Two of the authors (Chen and Ravallion) participated in some of these missions and also revisited a number of the sampled villages over two weeks in May 2005 (including some that they had visited 10 years earlier) and had informal discussions about the SWP with numerous ex-participants.
Within the selected villages, virtually all households were expected to benefit from the infrastructure investments, such as improved rural roads, power lines and piped water supply. Widespread benefits were also expected from the improved social services, including upgrading village schools and health clinics, and training of teachers and village health-care workers. Those with school-aged children also received tuition subsidies (conditional cash transfers). Over half of the households in SWP villages also received individual loans (accounting for about 60% of disbursements). The interest rate was set at the same level as for loans from the government's poor-area programs and the Agricultural Development Bank of China, although this is a lower rate than for commercial sources of credit. The loans financed various activities including initiatives for raising farm yields, animal husbandry and tree planting. There was also a component for off-farm employment, including voluntary labor mobility to urban areas and support for village enterprises. The selection of project activities aimed to take account of local conditions and the expressed preferences of participants, although it is unclear how well this worked in practice; there have been reports that farmers' preferences were sometimes over-ruled by local cadres (World Bank, 2003).
Household selection into the SWP was a less transparent process than village selection, which could be based on data and field observations. The household selection was typically done by the pre-existing "farmers' committee" in each village and was not subject to rigorous monitoring. From our discussions in field work, it appears that credit-worthiness criteria and successful past experience with similar project activities played an important role. No doubt local level connections also played a role.
In common with other development projects, the SWP provided the capital and technical assistance, but it did not provide insurance, and many of the activities are likely to entail non-negligible risk; the income gains will depend on a number of contingencies, including the vagaries of the weather, uncertain demand for the new products and risks associated with out-migration.
The ex ante expectation was that the SWP would virtually eliminate poverty in the selected villages over the longer term. The World Bank's Implementation Completion Report (ICR) -the final document giving the ex post "self-assessment" of a lending operation by the relevant operational unit -claimed that the SWP had a substantial impact on poverty, citing survey data indicating that the poverty rate had been more than halved in the project areas over -2001(World Bank, 2003. 5 However, the attribution of these gains to the SWP is questionable. The evaluative claims in the ICR are reflexive comparisons, which only reveal the true impact under the assumption that there would have been no progress against poverty in the absence of the project. That assumption must be deemed highly implausible in this setting. Ravallion and Chen (2005) studied the impacts of the SWP over the disbursement period, 1995-2000, using survey data for 2,000 randomly sampled households in both SWP and observationally similar non-SWP villages that had first been surveyed in 1995 (at the beginning of the project) and then annually until project completion. On comparing income changes in SWP villages with those in the matched non-SWP villages, they found an average income gain over five years of around 10% of baseline mean income, representing an average rate of return of 9%. The gains are not as dramatic as suggested by the reflexive comparisons in the ICR, but they are still sizeable.
However, Ravallion and Chen found that a large share of the income gain was saved. On comparing the final year of disbursement with the first, Ravallion and Chen found only a modest impact on mean consumption or consumption poverty. The savings rate from the project's income gains was well above the pre-intervention savings rate.

5
This was confirmed by researchers at the Chinese Academy of Social Sciences, who also pointed to a substantial increase in primary school completion rates and a decline in the infant mortality rate which they attributed to the SWP (Guobao et al., 2004). Why was there such a high savings rate from the initial income gains? A number of explanations can be suggested, carrying rather different implications for the long-term impact of the SWP. Possibly households saved more to assure they could repay the loans. That depends on the extent to which repayment was enforced. While the World Bank's loan is made to the (central) Government of China, and repayment is virtually certain, that is not the case for the loans made at local level, where enforcement problems are common. Indeed, local repayment rates on loans for poverty reduction under the government's own program were less than 25% in the three provinces covered under SWP. 6 However, it may be that the necessity of the center repaying the World Bank "trickled down" in the form of greater local enforcement of SWP repayments than for the loans made under the government's own poor-area programs.
Another possibility is that the high initial savings rate reflected a perception on the part of participants that the longer-term income gains from SWP would be modest or uncertain at best -raising concerns about the sustainability of the program's impacts. 7 When interpreted in terms of the Permanent Income Hypothesis, the Ravallion-Chen findings imply that participants felt that a large share of the income gain was transient, and (hence) it was saved. While this would happen even without uncertainty about the future income gains, such uncertainty is likely, and would probably lead to precautionary saving in response to the project. 8 In this regard, it is instructive that Ravallion and Chen found large year-to-year differences in impact, which were primarily due to variability in the annual returns to the program's investments rather than the level of investment. This variability in the returns suggests that participants would have had a hard time assessing the program's impact on permanent income.
The transient-income explanation suggests that the income impacts of SWP would diminish appreciably after disbursement. Precautionary saving would also start to fall as 6 The repayment rates on loans for poverty reduction in 1997 ranged from 8% in Yunnan to 23% in Guizhou. Repayment rates were somewhat higher for other types of loans but the overall average was still only 30% (Government of China, 1998). 7 The ICR rated "sustainability" as "highly likely." The Bank's internal evaluation of SWP by its Operations Evaluation Department pointed to the need for further evidence on the longerterm sustainability and impact of SWP. 8 There is evidence of precautionary savings in response to uninsured risk in the same region of rural China; see Jalan and Ravallion (2001). participants learn more about the impacts. Consumption gains should become evident in due course, consistent with the project's underlying impact on permanent income.
There is another explanation for the high savings rate from the short-term income gains. This postulates that the SWP systematically alters the returns-to-saving in the participating villages. By this view, the project provided local public goods that increased the marginal product of private capital, and so stimulated higher savings to support the desired private investment, which would yield longer-term income gains beyond the life of the project. 9 This assumes that there are capital-market imperfections, which entail that investment depends on own-savings and that the marginal products of private capital are not equalized across locations. With the poor facing severe constraints on access to credit and yet having higher marginal products of capital in their own (farm and non-farm) enterprises (given low capital stocks and concave production functions) one might expect to see a sizeable (and pro-poor) investment response. Clearly, this explanation offers a more positive view of the prospects of a sustained impact on poverty from the SWP, in that it suggests that income gains will persist well beyond the disbursement period (as the returns to investment start to be realized) and that sustainable consumption gains would emerge.
By re-surveying in 2004/05 the same sample studied by Ravallion and Chen (2005) we hope to throw light on which of these explanations is most plausible.

Data
The original plan for the impact evaluation of SWP was to do a baseline survey in 1995 and to only do follow-up surveys during the Bank's disbursement period, up to 2000. However, we decided to re-survey the original sampled households in 2004/05, to try to resolve the issues about longer-term impact raised by Ravallion and Chen (2005) and discussed in the previous section. SWP. All villages were in counties covered under the government's poor-area program, to assure that we will identify the impact of the SWP, on top of the government's program. There are 112 SWP villages and 86 non-SWP villages. 10 The SWP villages were a random sample from all project villages, while the non-SWP villages were a random sample from all other villages in the designated poor counties. Ten randomly sampled households were interviewed in each village. We follow Ravallion and Chen (2005) in using 1996 as the baseline. There are serious comparability problems between the 1995 survey and later surveys. 12 As a baseline, the 1996 data are not free of contamination; 17% of the program's total disbursement on household projects had been made by the end of 1996. We check robustness to using 1995 as the baseline.
Relative to other household surveys, unusual effort went into obtaining accurate estimates of consumption and income from the 1996-2000 and 2004/05 surveys. While the community, individual and project activity surveys used conventional one-time interviews, the household survey was quite different. The household surveys from 1996 onwards were closely modeled on NBS's Rural Household Survey (RHS) (which is 10 In the 2004/05 survey, two villages (one SWP and one not) were inadvertently replaced by two different villages in the same township.

11
A project activity survey in 1998 also gathered information about the scale and the starting year of each SWP sub-project at village and household level, as well as data on other funding these villages and households received from the government and other sources. 12 Because of delays in NBS being told the locations of SWP villages, the first survey in December 1995 had to use a one-time interview method, asking recall over the full year. The use of this long recall period is likely to lead to underestimation of income and consumption (though this is of less concern for village-level characteristics). The subsequent surveys used the dailydiary method over the full year, allowing more accurate income and consumption data. described in detail in Chen and Ravallion, 1996). This is a good quality budget and income survey, notable in the care that goes into reducing both sampling and nonsampling errors. Similarly to the RHS, sampled households maintain a daily record on all transactions plus log books on production. Local interviewing assistants visited each household at two-three weekly intervals, at which time inconsistencies found at the local (county-level) NBS office are checked. Other trained interviewers also visited at regular intervals to collect additional data. This intensive interviewing method is a marked contrast to most surveys in which the respondent is visited only once or twice.
The consumption aggregate is built up from very detailed data on cash spending on all commodities and imputed values of in-kind spending, which is mainly consumption from household production, valued at local selling prices. Living expenditures exclude spending on production inputs (which are accounted for in net income from own-production activities). They also exclude transfer payments, though these only account for a small share of total spending (3.7% over the whole sample in 1996). The income aggregate includes cash income from all sources and imputed values for in-kind income. Income is measured net of all production costs, including interest on debt (including loans from the SWP). The migrant workers were not tracked, although the income aggregate includes remittances received from family members who migrated, including those supported by the SWP. Remittances are expected to be the main means by which the out-migration component reduced poverty in the short run.
Given the unusual effort that went into data collecting and checking the consumption and income data, we expect that subtracting consumption from income will give reasonably accurate estimates of savings. We also look into what forms the savings took. There are many forms of saving in this setting, including money balances and investment in own-production activities. The survey was not designed to allow a complete independent accounting of all forms of saving. Some data were collected on assets and liabilities, although the reliability of the reported values is questionable. We also study impacts on holdings of specific assets.
For the 2004/05 follow-up survey we used exactly the same survey instrument as for the 1996-2000 surveys, augmented with a module to elicit perceptions of both welfare and the project's impacts. The module asked respondents to assess whether various aspects of their lives had improved over the preceding 10 years. (The questions in this module were asked in 2005.) These involved a long list of aspects of well-being and in each case the respondent was asked whether this item had improved or not over the last 10 years, on a 10 point scale (from "extremely worse off" to "extremely better off"). 13 (The sample was restricted to adults who were at least 28 years of age at the time of the interview.) Our idea here is to see whether such a rapid appraisal tool -which does not require any prior surveys, including a baseline -gives similar results to our more costly longitudinal survey-based method.
Over 1996-2005, the attrition rate was 12% (6% over . Using a probit model for attrition over 1996-2005 we found a number of significant predictors, including age of head, share of children, landholding and some geographic variables. 14 (Being an SWP village was not a significant predictor of attrition.) NBS survey teams were instructed to find replacement households as similar as possible to those that dropped out. We also tested how well this replacement worked, using a regression for the probability of being a replacement household estimated on the pooled sample of replacement and "drop-out" households. 15 Among the same set of covariates for attrition, no regressor was a significant predictor for replacement and the regression has very low overall explanatory power. It appears that the sample with replacements can be considered representative of the population.
We checked the robustness of our results to several potential data problems. One problem concerns the aggregation of total living expenditures. It appears that in processing the 2004/05 survey data, living expenditure in one county may have failed to include in-kind consumption. 16 The data for three other households whose in-kind income was more than six times larger than their total living expenditure seem to have a similar problem. We re-estimated the impacts on consumption and income, dropping this 13 The Chinese and local language versions of the module were refined over time on the basis of field tests in poor villages in a number of locations. 14 A statistical addendum is available from the authors giving full details. 15 Note that we have baseline data for the "drop-outs" and the current year's data for the replacements. To deal with the time difference we did a pro-rata adjustment of the data on dropouts to 2004 values according to the ratios of the means over time for each variable, based on the balanced panel. In caculating the ratios, we also weighted by the attrition probability. 16 We suspect there is a problem because the total living expenditure of 68% the sample in that county is equal to cash expenditure, whereas net in-kind income is about half of overall total. one county and the three households. The results reported below were robust to this change (details available from the authors).
Another potential data problem is related to the coding of SWP projects. We find in the village-level project data base that all ten villages in one county claim to have a project funded by the SWP, even though six of them were officially designated as non-SWP villages. 17 It may well be that there was significant SWP participation by villages that had not been selected for the project in this particular county (although we cannot rule out coding errors). On deleting this county we found that our main results were robust (details are available from the authors).

Estimation methods and sources of bias
Our aim is to estimate average treatment effects on the treated. The doubledifference ("difference-in-difference") method identifies a project's impact under the assumption that the selection bias (the counterfactual difference in outcomes) is constant over time and additive in its effect on outcomes. In the present context, we point to two sources of time-varying selection bias: (i) outcome changes are correlated with initial differences between the participating and non-participating areas, and (ii) spillover effects, whereby the project itself alters the subsequent path of outcomes for the nonparticipants.

Biases due to targeting
Let us begin with the classic evaluation problem. We have data on an outcome measure for the i'th unit observed at dates, t=0,1. Each unit is observed to be either a ). We can write the outcome measure as: where is the gain ("impact"), is the outcome under treatment and is the counterfactual outcome. is not directly observable for any i (or in expectation) since we do not know for . The selection bias is the mean difference in counterfactual outcomes (dropping the i subscripts): 17 There are scattered minor reports of SWP activity in non-SWP villages elsewhere, but these appear to be random, and are probably coding errors.
We call this the unconditional bias, given that we have not yet allowed for control variables. Given the purposive targeting of the SWP it must be presumed that .
The standard double-difference estimator assumes that 0 1 B B = , implying that the change in mean gains for period 1 participants is consistently estimated by: for all i (by definition), then for all i, and so , i.e., mean impact on the treated units.
is implausible for poor-area development programs. The targeted poor areas typically lack infrastructure and other initial endowments, which could (in turn) affect the subsequent growth rates. DD will then be a biased estimator, since the subsequent outcome changes are a function of initial conditions that also influenced the assignment of the sample between the two groups. In other words, the selection bias will not be constant over time. 18 The direction of bias in DD depends on whether the underlying growth process is convergent or divergent. For the government's poor-area programs in southwest China in the 1980s, Jalan and Ravallion (1998) found that failure to control for the initial heterogeneity between the targeted counties and non-participating counties yields a downward bias in a DD estimator, consistent with growth divergence. 19 However, it is unclear whether this also holds across villages within the same (poor) counties; indeed, the results of Jalan and Ravallion (2002) (also for southwest China) suggest that intercounty divergence can occur side-by-side with intra-county convergence.
We address this issue by balancing treatment and comparison units in terms of the initial conditions that may have influenced program placement. These variables are represented by the vector X. Our key identifying assumption is that the selection bias is time-invariant conditional on X, i.e., that: This echoes more general concerns about the importance of correcting for selection bias based on observables (Rosenbaum and Rubin, 1983;Heckman et al., 1998). 19 Also see Jalan and Ravallion (2002) who find evidence of divergence at the county level.
On applying a result due to Rosenbaum and Rubin (1983), if outcome changes are independent of participation given X, then they are also independent of participation given the propensity score: . This justifies balancing on P(X) to remove selection bias based on X. Note that this only addresses time-varying selection bias based on observables; a bias will remain if there are any latent (time-varying) factors correlated with the changes in counterfactual outcomes. As discussed later, a remaining bias due to unobservables appears to be more likely for household selection than village selection.
We use various methods for assuring balance on P(X). One method is to limit comparisons to a trimmed sub-sample with sufficient overlap in propensity scores. For our data, the region of common support (minimum score for treated, maximum score for untreated) is (0.11, 0.95). For our "trimmed sample" we chose a slightly tighter interval (0.1, 0.9), which are also the efficiency bounds recommended by Crump et al. (2006) for estimating average treatment effects with minimum variance. 20 We also use the weighted-regression method proposed by Hirano, Imbens and Ridder (2003). Thus we estimate the DD from the following regression: . This is estimated with weights of unity for treated units and for controls, where is a consistent estimate of P(X) and . Hirano et al. show that weighting the controls this way yields an efficient estimator.
To interpret (5) note that, in a balanced panel, we could instead estimate the equivalent regression in the more familiar "fixed-effects" form: Here the fixed effect is (5) picks up differences in the mean of the latent individual effects, such as would arise from initial selection into the program. The advantage of (5) is that it does not require a balanced panel, and hence it gives estimates that are robust to selective attrition (recalling that the replacements appear to have preserved the sample's ability to represent the population).
As a robustness check, we compare these estimates with matching on the propensity score. Note first that the sample estimate of mean impact can be written as: where is the number of SWP participants, is the number of control observations, and W is the propensity score-based weight given to the T N ij C N j'th non-participants in making a comparison with the i 'th participant. How many non-participants to include in the control group and how to assign weights to each non-participants are practical questions in implementing PSM. One option is to use the popular method of nearest-neighbor matching. However, because of the non-smoothness of nearest neighbor matching, the conventional bootstrapping method is inappropriate for estimating the standard errors (Abadie and . In order to assure valid bootstrapped standard errors, we choose to apply nonparametric kernel matching in which all the non-participants are used as controls and weights are assigned according to a kernel function of the predicted propensity score (following Heckman et al., 1997Heckman et al., , 1998 function and a bandwidth parameter. We use the normal density function as the kernel and the odds ratio (rather than propensity score) because SWP villages are over-sampled relative to their frequency in the population eligible for the project. The conditional independence assumption motivates a specification test of whether there are differences in observables between the project and non-project villages after conditioning on through matching or re-weighting. Following Rosenbaum and Rubin (1985) and Abadie and Imbens (2006) we test for covariate balancing using differences in standardized means between the SWP villages and matched or re-weighted non-SWP villages. To achieve a better balance of covariates and to allow for a more flexible estimate of propensity scores, we also include polynomial terms for the initial income levels (see, for example, Smith and Todd, 2005). We will show that the matching and re-weighting procedures produce a satisfactory balancing of the observables between SWP and comparison villages.
Biases due to spillover effects All the methods described above assume that an observationally similar comparison group pre-intervention reveals the counterfactual of what would have happened over time to mean outcomes for the treatment group in the absence of the intervention. This will clearly not be the case if there are any spillover effects, whereby the intervention changes outcomes for non-participants.
Spillover effects due to residential mobility between villages are unlikely in this setting given the village-level administrative land allocation. Under China's rural land laws, a migrating household would have little prospect of getting a share of the land available (and almost certainly cultivated) at the destination and would also risk losing their land at the origin.
Another source of spillover effects is inter-village trade (possibly via urban hubs).
To the extent that the project has an impact on local incomes and prices, trade-induced general equilibrium effects will entail spillover effects to the non-SWP villages used to infer the counterfactual. We will test for impacts on prices as well as incomes, distinguishing cash-incomes (as derived from inter-village trade) and incomes-in-kind.
Local public spending responses to project aid can also be confounding. Recall . The local government has a preference ordering over its spending allocation across the two sets of villages and its spending on all other activities, denoted Z. The preference ordering is represented by: and this function is strictly increasing in all three elements, and strictly concave in all three; it simplifies the analytics if we also assume that the function is additively separable, though this can be weakened. The local government maximizes W subject to its local revenue constraint, which creates an upper bound on GOV+Z. Under these assumptions we have the following result (that is proved in Appendix 1): Proposition: The external aid will displace local government spending in the project villages, increase spending in the comparison villages, but decrease total local government spending across both sets of villages.
The implication for our evaluation is plain: Comparing outcome changes over time between SWP and (matched) non-SWP villages in the same counties will under-estimate the project's true impact.
We will test for spillover effects. The presence of non-SWP development projects in the SWP villages provides the clue. We use the same evaluation methods described above, but the "outcome variable" becomes the extent of non-SWP project activity in the SWP villages. The theoretical result in the above proposition will be exploited in determining an upper bound to the bias induced by spillover effects,  Chen, 2005, at 1993 purchasing power parity) as well as poverty lines above and below this figure. We see that the income gains in SWP villages between 1996 and 2000 were larger than among non-SWP villages, but that this reverses between 2000 and 2004/05. Ten years after its commencement, the SWP does not appear to have allowed the selected poor villages to catch up with the rest of these (poor) counties. Table 1 suggests that SWP had little or no impact on income and consumption.

Estimated impacts
However, before accepting that conclusion we need to probe more deeply into the potential sources of bias described in the previous section. We begin with selection bias due to non-random placement of the SWP. At the end of this section we test for bias due to contaminating spillover effects. Table 2 gives probits for whether a village was selected for SWP, as used to estimate the propensity scores. The variables were chosen to reflect the selection criteria used by the project staff (based on our interviews at the time).

Probits for selection into the SWP
We find that project villages tend to be in more hilly/mountainous areas, are less likely to have electricity, less likely to have a school in the village or nearby, though more likely to have a health clinic within the village relative to nearby. 22 The SWP villages also tend to have larger populations, with lower mean income in 1995 (from the village-level data), lower mean consumption in 1995 (from the household survey) and more land per capita. The latter characteristic probably reflects lower population density and lower land quality in the project villages. In most respects, the results of Table 2 suggest that the SWP villages tend to be poorer than other villages within the project counties, consistently with Table 1.
Using the propensity scores based on Table 2 to re-weight the data we were able to obtain a close balancing of the characteristics of the two samples (including in the 22 Remote villages are more likely to have a very basic health clinic, to compensate for the inaccessibility to more comprehensive township facilities. means of the initial outcome variables), particularly after trimming the samples, as discussed in the previous section. Appendix 2 provides details on the balancing tests, which pass comfortably; this was also the case for a full set of covariates in Table 2, for which the balancing tests are reported in the Addendum available from the authors.

Double-difference estimates of average impacts
In assessing impacts on mean consumption and income, we begin with the simple DD estimates of the mean impacts for income, consumption and saving, as given in Table   3. We give estimates for both 2000 (at the end of disbursements) and 2004/05 and for both the levels and the logs; the latter gives higher weight to the gains to poorer households. The baseline is 1996 in both cases.
Focusing first on the disbursement period, we see a sizeable and statistically significant impact on income but not consumption; the bulk of the income gain was saved. (The same pattern was found using 1995 as the baseline.) On decomposing income (as wage income, farming, animal husbandry, fishery, forestry, non-farm enterprises, transfers and asset income), the only component that showed a statistically significant impact was animal husbandry, for which the simple DD impact on net income was 90.85 yuan (t=2.92), which rose to 117.26 (t=3.37) and 136.15 (t=3.55) using weighting and matching (respectively) to correct for selection bias (Table 4). 23 Another way of disaggregating income is into cash or kind (which will be relevant when we consider trade spillovers in section 5.4). We found that the bulk of the shortterm income impact was income in-kind from animal husbandry, as is evident from Table   4. This is puzzling, as a sizeable share of income-in-kind from husbandry in a rural economy is also consumed directly, and should then show up in consumption. However, the income in-kind that is being affected by the project appears to be small nonproductive animals and new litters of productive animals, which are counted as income in kind but are held over for consumption or sale at a later date rather than consumed. 24 We will return to this point when we discuss the longer-term impacts. 23 We only report the results for husbandry, and summarize those for other components; a statistical addendum is available with full details.
We can also disaggregate consumption expenditure. On separating food staples (rice, wheat etc) from non-staples and other foods we found significant impacts in 2000 for non-staple foods (meat, vegetables etc); the simple DD for this category was 26.26 yuan (t=1.68) though rising to 40.64 yuan (t=2.69) and 42.58 yuan (t=2.70) for the PS weighted and kernel matched estimators respectively. This is likely to entail nutritional gains through higher protein and more micro-nutrients.
The results change dramatically when we track the impacts through to 2004/05, as is evident when we return to Table 3. We find no significant impacts on mean income or consumption over the longer observation period. 25 (This also was also true for staples and non-staples separately.) Table 3 also gives the DD estimates for mean income using the propensity scores to balance project and non-project villages; we give results using both weighting and matching, for both end dates, and for both the trimmed sample and total sample. The basic pattern in the simple DD estimates is still evident. The results are robust to using kernel matching instead of the re-weighted regression method. 26 While there is clearly some sensitivity to the choice of estimation method, the pattern is still reasonably robust, indicating significant and sizeable income gains during the disbursement period but much less in the longer term. The estimated income gains in 2000 tend to be larger when we correct for purposive selection of SWP villages; this is consistent with a divergent growth process between villages. However, no such pattern is evident for the 2004/05 impacts.
We did find significant longer-term impacts on income in-kind. On breaking up income in-kind by source, we found that both farming and husbandry accounted for almost all these long-run impacts, though only husbandry was significant ( The same pattern was evident using 1995 as the baseline, although impacts were somewhat lower. 26 The results were also robust to deleting the troublesome county and the observations with problematic data (section 3).
In contrast to the period up to 2000, we find consumption gains in the postdisbursement period. The impact on total consumption in 2004/05 is not statistically significant (Table 3). However, when we break this up according to cash or kind, we do find signs of larger impacts on consumption in kind. The simple DD estimate for consumption in kind in 2004/05 is 118.40 yuan (t=2.54), although this drops appreciably when we correct for selection bias; using PS weighting the impact is 74.46 (t=1.50). The longer-term impacts on consumption in kind probably include consumption of the income in-kind from animal husbandry that we observed in the SWP disbursement period.  Table 3, although this is not true for the kernel-matched DD for which the postdisbursement consumption gains equal the increment to permanent income at a rate of interest of about 10%. Statistically, however, we cannot reject the null hypothesis that the post-disbursement consumption gain equals the increment to permanent income (at reasonable interest rates) treating the SWP income gain as transient.
The PIH interpretation begs the question as to why we saw no consumption gains in the disbursement period. If SWP participants knew at the outset that the project would entail only a transient income gain then consumption would have immediately reflected the implied gain to permanent income. However, from what we know about the SWP, it is unlikely that participants could have formed a reliable estimate of the gain to permanent income due to SWP until at least project completion. As noted in section 2, there was considerable uncertainty about the income gains, and high initial savings may have been a short-term precautionary response.
We found no evidence of impacts on interest payments on loans or the proportion of households paying interest or paying back loans, for either 2000 or 2004/05. 27 So we find no support for the idea that either the high savings from the short-term gains or the 27 Again we only summarize the results here; the addendum gives full details. lower longer-term impacts on incomes stem from greater enforcement of interest or repayment requirements under the SWP, compared to other credit sources.
With weak enforcement of the SWP loan repayments, it might be conjectured that taxes on SWP areas would increase, to help local authorities pay back the SWP loans to higher levels of government. However, we did not find any evidence of impacts on taxes or fees paid per capita, in either 2000 or 2004. It appears that higher levels of government treated the SWP as, in large part, a transfer payment to lower levels.
In testing for impacts on agricultural productivity, we used total farm income per unit area. 28 We found no evidence of impacts. Nor did we find much evidence of impacts on holdings of productive assets and wealth (including housing). This was true for both the disbursement period and the longer-term. An exception is that the village data base revealed a significant impact on livestock holdings, notably cows and goats. 29 There is some sign of a demographic impact. Household size fell in both SWP and non-SWP villages over 1996-2000, but more so in the former. The simple DD for household size is -0.13 persons (t=-1.75) and it is slightly larger with the corrections for selection bias (the PS-weighted estimate is -0.16, t=-1.64, and it was similar for kernel matching). The demographic effect was associated with slightly fewer children.
However, the demographic impact was not evident in 2004.
Nor did we find any evidence of impacts on remittances received from family members migrating out, or on the probability of a family member migrating. 30 We did find significant impacts on school enrolment rates during the disbursement period; our PS-weighted DD estimate was 0.074 (with a t-ratio of 2.20), i.e., a 7.4% point increase in the school enrollment rate of children aged 6-14 by the year 28 Ideally we would use physical output for a given crop per unit area under its cultivation. However, only total land area under cultivation was collected. Instead we used an overall farm productivity measure, obtained by dividing total net income from farming by total cultivated area; this can be interpreted as a mean crop-specific yields weighted by both prices and shares of land. 29 The simple DD for cows per person in 2000 was 0.05 (t-ratio=2.47); with scoreweighting it rose to 0.07 (t=3.54) and it was the same with kernel matching (t=4.33). By 2004 the impacts were slightly higher and equally significant statistically; the simple DD estimate was 0.07 (t=3.69) while the score-weighting the impact was 0.09 (t=4.05) and with kernel matching it was 0.10 (t=3.92). Significant impacts were also evident for sheep, although with lower t-ratios. 30 Out migration in the previous year is only measured for those present in the village at the time of the interview, although NBS made an effort to ask the individual questions at times of the year when migrants are more likely to be present. Remittances may well be the better indicator. the corresponding DD estimate fell to 0.032 (t=1.00). The transient schooling impact probably reflects the fact that the tuition subsidies ended with other SWP disbursements.
Of course, even though the non-SWP village caught up substantially with the SWP villages in schooling by 2004/05. Thus there were children in SWP villages who entered school earlier than without the SWP and this will probably yield future income gains.
There was almost no sign of impacts on the prices of agricultural outputs and purchase prices for inputs for 13 items. 32 We found positive impacts during the disbursement period for a number of types of infrastructure, although they are generally not statistically significant. We found little sign of impacts in the 2004/05 data. The exception was TV reception, which showed significant impacts in the longer-term as well as during the disbursement period. poverty. Again we give estimates using the poverty line of 808 yuan per person per year as well as selected poverty lines above and below this figure. 33 The poverty impacts in the SWP disbursement period are broadly consistent with our findings for the impacts on the mean income and consumption in Table 3. 34 In Figure 1 we also give the results graphically, by plotting the DD estimate of the impact on the headcount index of poverty (for income and consumption poverty in panels (a) and (b) respectively) against the poverty line, which we vary over virtually the whole distribution. Impacts on the income poverty rate are largest just below the 808 poverty line, for both end dates. The impacts on consumption poverty echo our results for mean consumption around the middle of the range of poverty lines, where 2004/05 consumption-poverty impacts exceed those for 31 The uncorrected DD was 0.046 (t=1.41) and the kernel matched DD was 0.072 (t=2.40). 32 The only exceptions were that diesel oil had a significantly higher price in the SWP villages by 2004/05 and edible oil crop had a slightly lower price. 33 The table only gives results for the trimmed sample, which is better balanced. However, although the precise estimates differ between the two samples, the basic pattern was the same, and our main conclusions do not depend on this choice. 34 The results were also robust to deleting the county in which some SWP activity was recorded in non-SWP villages. We found an impact on extreme consumption poverty in 2004 after deleting the consumption outliers ( The weighted DD at 500 consumption poverty line is -8.06 with t-ratio of -1.72; the weighted DD at 600 is -9.20 with t-ratio of -1.67.) 2000; the results imply a sizeable nine percentage point drop in the consumption poverty rate at poverty lines around 600 yuan. However, this is not true at lower and higher lines, where impacts over the two time periods agree fairly closely.
For all of the above impact estimates, the counterfactual is the absence of the SWP. There is an alternative counterfactual of interest, namely the absence of direct participation in any anti-poverty program, including the government's programs. For identifying this counterfactual we can use those households in non-SWP villages who did not participate in any other program; this applied to 69% of the households in non-SWP villages. So we repeated the above calculations dropping those who recorded any direct participation in other programs. (The balancing tests passed comfortably.) The impacts for 2000 were similar to those above. However, the long-run impacts on mean income and consumption were larger. For example, the simple DD estimate of the impact on mean income in 2004 rose to 125 yuan per person (as compared to 45 yuan in Table 5) although this fell to 99 yuan when we corrected for selection bias using PS weighting.
Nonetheless, the impacts relative to this alternative counterfactual were still not significantly different from zero; for example, the t-ratio on the simple DD for mean income was 1.47, which dropped to 1.13 with PS weighting.

Heterogeneity in impacts
We tested for differences in impacts according to the initial values of income, education and ethnicity. 35 The score-weighted DD's were not significantly different for any of our outcome variables when we stratified by education or ethnicity. However, we found a notable difference when stratified by initial income (above or below the median), with significant longer term gains for the low-income group. When we interacted income with education we found that the longer-term gains were strongest for the relatively well educated (at least junior high school) amongst the low-income households, as can be seen in Table 6.
The heterogeneity in returns suggests that a different assignment of the loans would have increased overall impact. The household participation rate was slightly higher for the group of relatively poor but well educated households; 61.1% of this group in 35 We distinguish Han Chinese from all other ethnic minorities. The ICR points to concerns about how well ethnic minorities were reached by the SWP (World Bank, 2003). SWP villages participated, as compared to 58.8% of those with above median income and higher education, 50.0% of those with high income but low schooling, and 47.8% of those with both low income and schooling. (The program slightly favored better educated households both above and below median income.) Suppose that beneficiary selection had focused solely on the relatively well-educated poor, and saturated this group, with no change to conditional mean impacts by subgroup, which were zero for other groups (consistently with Table 6). Then the impact of the program as a whole would have risen substantially, from a mean impact of about 40 yuan per person to about 150 yuan. 36 To achieve this outcome, the program would have had to over-ride the community-based selection process, which evidently put too little weight on reaching the educated poor, even though this group was already favored in the selection process.
While we found no impacts on average remittances and out-migration, significant positive impacts were evident when we stratified by initial income and education; the impacts were significant for those who were initially above median income and (among those with above-median income) were larger for those with more schooling.

5.4
Are we underestimating the impacts due to spillover effects?
Biases in long-term impact estimates can arise from interference due to spillover effects, as discussed in section 4.2. Our results do not offer much support to the idea of trade-induced spillover effects. We have seen that there were no significant impacts on prices, although it might be argued that arbitrage eliminated any price differentials. More damaging to the notion that there were significant trade-spillovers across villages is the fact that we did not find significant impacts on cash income, even during the disbursement period; the short-term income gains were in kind, and mainly from animal husbandry. Since inter-village trade is likely to involve cash, there must be a presumption that such trade was affected rather little by SWP.  (Table 5), scaled down by 25% to reflect the number of households in this group, which would then represent 75% of the total number of SWP participants. also give counts of the total number of beneficiary households. However, we cannot tell what happened in the post-disbursement period since it was only possible to collect the project data we use for these calculations during the SWP disbursement period. Table 7 gives the results for various project activities. 37 Large displacement effects are evident for virtually all non-SWP activities. 38 For most categories, the mean in SWP villages is half or less that in non-SWP villages, implying that 40% or more of the non-SWP spending allocation to SWP villages was cut, and re-allocated to non-SWP villages. 39 Such large displacement effects would imply that the benefits of the SWP are likely to have spilled over to our comparison villages, leading us to under-estimate the impacts of SWP.
How large is the bias in our estimates of the impact on income due to these spillover effects? We shall assume that the displacement is entirely within the same county; that is plausible given that the county government is the key decision maker in the sub-county allocation. Invoking the theoretical result in section 4.2, we expect that total government spending (in both project and comparison villages) will also fall. In other words spending is expected to rise in the comparison villages by less than the amount that had been displaced in the project villages. To determine an upper bound to the bias we can assume that the increase in spending in the comparison villages exactly equaled the displaced spending in the project villages. In this case we will be overestimating the bias due to spillover effects.
To help throw light on the likely magnitude of bias due to spillover effects, let GOV denote the spending done under the government's own program, expressed as spending per capita of the total population. Some of this spending is done in SWP villages and some is in the non-SWP villages; where w is the population share of the SWP villages while and denote the The main activities excluded are minor infrastructure projects none of which showed any significant displacement. When there is no response from a village for a specific activity we treat it as a zero; this is plausible, although we test robustness to treating it as a missing value. 38 We repeated these tests using the total samples and treating all cases in which no entry was made as missing values. The results in Table 9 were reasonably robust. (The effects tended to be stronger under the alternative treatment of "no response" entries.) 39 Recall that about one quarter of villages in SWP counties received the aid project, so that a non-SWP village will receive, on average, one third of the displaced spending. observed (post-SWP) levels of government spending in SWP and non-SWP villages respectively (per capita of the relevant population). We assume that in the absence of the SWP there would be no difference in the level of the government's spending between these two types of villages. The amount of displacement of non-SWP spending in SWP villages that is attributed to the SWP is then .
The bias in the double-difference estimate is where is the income rate of return to the government's projects.
The true impact is thus: On noting that where is the true rate of return to the SWP and the external aid-financed investment is per capita in the SWP villages, we can then derive the following formula for the proportionate bias: and where and There will be no bias if there is no displacement (k=1), or the SWP is negligible in size (w=0) or the rate of return to the displaced government investment is zero ( ). 0 = NSW R DD DD R SW / * However, this is still not a usable formula for determining an upper bound for the bias since the measured rate of return to SWP spending will also be contaminated by the spillover effect. (We assume that the bias due to the local-spending spillover effects induced by the external aid only contaminates estimates of the rate of return to that aid.) The true rate of return is . Substituting into (9) and solving we have: Note that if is the per capita government spending displaced from SWP villages then is the corresponding gain (per capita) in the non-SWP villages. and Note that and are the measured income gains where the * denotes the values without the spillovers. Also note that and . The following result is then easily derived.
What are seemingly plausible values for the parameters of (10)? Jalan and Ravallion (1998) estimated an average rate of return of 12% for the Government's poor area development program in the same region of China over 1985-90. Using different methods, Park et al., (2002) also estimate a rate of return to the Government's national poor-area program of 12% in the period 1992-95. Using the same data, and similar methods to the present study, Ravallion and Chen (2005) estimated that the rate of return to the SWP spending during the disbursement period was . So we set . One-quarter of villages in the poor countries participated in SWP, so w=0.25. Based on Table 7 we can take k=1/3 to be a reasonable lower-bound (noting that is strictly increasing in k). yuan per person, from 130 yuan. In principle, the consumption gains could also be biased, although, given that we find virtually zero (indeed negative) consumption impacts in the disbursement period, our conclusion that the income gains were fully saved remains unaffected.
The more interesting question concerns the post-disbursement period. Recall that the tests for displacement in Table 7 do not cover the post-disbursement period. It might be expected that the local spending balance between the treatment and comparison villages would be restored once the external aid ceased. Although the data used in Table   7 are not available for 2004/05, we can at least test for long-term impacts on new loan activity from non-SWP sources, as an indication of whether the SWP displaced other 42 Using the project data base to comparing average loan amounts for non-SWP in SWP villages with those in non-SWP villages gives k=0.58. sources of finance in the post-disbursement period. (In 1995 we know who had received SWP loans so we can net this out of total loans received. Of course, in 2004/05 there were no new SWP loans.) By these calculations, we found no significant impacts on non-SWP loans in 2004/05. This does not suggest there was long-term displacement of other sources of finance.
While the displacement effect is presumably greater in the disbursement period, it cannot be ruled out post-disbursement. If there are in fact longer-term gains from the SWP and this is known locally then continuing positive displacement will be expected, making it harder to identify those gains. However, even the upper bound to the bias derived above of is well short of being sufficient to imply a significant long-term impact on mean income; assuming that the standard error is not biased by the spillover effect, one would need to quadruple the income gain in 2004 before it could be deemed statistically significant.

Conclusions
The longer-term impacts of aid to poor areas depend crucially on why these areas are poor in the first place. If persistently poor areas arise from generalized capital-market failures then external aid can relieve the credit constraints and so enhance long-run growth. If instead the credit market failures are specific to certain (liquidity-constrained) subgroups of the population then the aid will need to be targeted to those groups.
However, persistently poor areas can arise from other causes, such as governance failures or (possibly policy-induced) distortions in other markets (including labor, such as due to restrictions on migration). Heterogeneity in impacts can also interact with the beneficiary selection process in a way that attenuates the aggregate impact.
So the benefits from extra aid to poor areas may well be modest. Unfortunately, the absence of rigorous studies of the long-term impacts of aid to poor areas has left a gap in our knowledge about both the causes of geographically concentrated poverty and aid effectiveness.
To help fill this gap in knowledge, we have used a specially designed set of highquality surveys collected over a 10 year period to study the impacts of a World Bankfinanced poor-area development program in southwest China. We find a sizeable and statistically significant impact on mean household income in the participating villages during the disbursement period. However, there was a much smaller impact on consumption during that period; the short-term income gains were largely saved (although with some improvements in diet quality). Four years after disbursements had ended, both project and non-project villages had seen sizeable economic gains, with only modest net gain to mean income attributed to the project. Indeed, we cannot reject the null hypothesis that the longer-term average impact was in fact zero, although we do find evidence of longer-term impacts on income in-kind from animal husbandry.
The most plausible interpretation of our findings appears to be as follows. The We highlight three findings that raise broader issues for development programs.
First, heterogeneity in impacts can play an important role in explaining poor overall outcomes. We find that there were significant and lasting income gains among the subset of households who were initially poor and relatively well educated. Presumably these households had more productive investment options, which could not be financed otherwise given the liquidity constraints facing the poorest. The program's communitybased selection process favored the better educated, but expanded coverage of those who were also poor could have greatly enhanced the program's overall impact. Given the heterogeneity in returns, the implied (ex-post) deficiencies of the community-based selection process help explain the program's disappointing overall impact. While the program performed well in selecting poor villages, overall impacts were greatly attenuated by inadequate coverage of the (educated) poor within poor villages.
This finding points to a potentially serious trade-off facing such programs. The desirability of more participatory processes of local beneficiary selection may well come at a large cost to overall impacts, including on poverty. To assure larger impacts one would need to over-ride this process by dictating the types of households that should be targeted, based on the likely benefits to them. (In the program studied here, it appears that the presence of complementary skills and knowledge, as proxied by education, was crucial to the impact.) Whether that is feasible or not in practice is a moot point.
Second, our results point to the importance of taking account of the participants' inter-temporal behavior, such as in response to the uninsured risks often associated with a development project. Those responses can cloud impacts in both experimental and nonexperimental evaluations. An evaluation that focused solely on the income or consumption gains during the disbursement period (as is commonly the case) can give a deceptive picture of the true impacts.
Third, our findings illustrate how the responses of local development agents can cloud identification of the long-term impacts of geographically-placed projects (whether randomly placed or targeted). We found evidence of positive spillover effects on the comparison villages through the displacement of other development spending during the program's disbursement period. Such interference suggests that the classic impact evaluation methods will systematically underestimate the impact. In our case, the biases could well be substantial, although it is unlikely that these effects are imparting a sufficiently large bias on our impact estimates (under seemingly plausible assumptions) to overturn our main qualitative results. But this may well be a bigger problem in other settings.

Appendix 1: Proof of the proposition in Section 4.2
The problem is to maximize ) , , , where R is the local government's revenue. The first-order conditions for an optimum require that: