World Bank Employment Policy Primer December 2002 No. 2 Impact Evaluation Techniques for Evaluating Active Labor Market Programs * Background There are many different types of evaluations: Over the past 40 years, "active" labor market pro- grams (ALMPs) have emerged as an important employ- process evaluations focus on how a program oper- ment policy tool. Their objective is primarily economic ates and on activities undertaken in delivery; ­ to increase the probability that the unemployed will performance monitoring provides information on find jobs or that the underemployed will increase their the extent to which specific program objectives productivity and earnings. ALMPs include job search are achieved (e.g. number of unemployed assistance, training and retraining, and job creation pro- trained); and grams (public works, micro-enterprise development, impact evaluations focus on the issue of causality and wage subsidies). With economic reform, increasing to see whether a program has its intended impact liberalization of markets and growing concerns about (e.g. percent increase in employment and wages the problems of unemployment, ALMPs have increas- attributable to program) and which characteris- ingly become an attractive option for policymakers. tics of the program led to the impact. Expenditure on these programs has, however, not increased substantially over the 1990s, remaining fairly This note will focus on impact evaluations of constant at around 0.7% of GDP. This reflects to some ALMPs. It will discuss the objectives and importance of extent the ambivalence of policymakers about the effec- rigorous evaluations, highlight commonly used impact- tiveness of ALMPs. A frequently asked question is, "Are evaluation techniques, and discuss who should conduct these programs effective?" Attempts have been made in evaluations. OECD countries to answer this question through rigor- ous evaluations that compare outcomes for individuals Uses of Impact Evaluations who participate in the program (treatment group) with The purpose of evaluations of ALMPs is to examine those of a similar group of individuals who did not the effectiveness of programs against their stated objec- receive the program (control group). However, such tives. On the basis of this, evaluation can then be used to: analysis has been mostly lacking in developing coun- tries. Part of the problem lies in the lack of an evaluation Help design new programs culture in many countries, often due to low capacity for Refine program design evaluation. Policymakers may not be conversant with Improve program targeting the importance of conducting evaluations and the tech- Identify ineffective programs niques used to conduct such evaluations. *This note was prepared by Amit Dar and edited by Tim Whitehead. The World Bank Employment Policy Primer aims to provide a comprehensive, up-to-date resource on labor market policy issues. The series includes two products: short notes, such as this one, with concise summaries of best practice on various topics and longer papers with new research results or assessments of the literature and recent experience. Primer papers and notes are available on the labor markets website at or by contacting the Social Protection Advisory Service at (202) 458-5267 or by email at . I m p a c t E v a l u a t i o n Improve Program Targeting. Evaluations can enable BOX 1: EVALUATION OF PILOT PROGRAMS policymakers to make informed decisions about which target groups benefit most from programs, resulting in Pilot demonstrations have proven extremely effective targeted programs and enhanced program performance. in the U.S. for testing new programs. Many policies and programs were first tested in a small number of sites For example, in the Czech Republic and in Turkey, an before being promoted by policymakers for national evaluation was designed to test the efficiency of voca- implementation. Evidence from these pilot tests (usu- tional training for the unemployed. Evidence from the ally based on experimental design evaluations) is often evaluation indicated that vocational training was more used to convince legislators to approve the national implementation of new programs. effective for women than for men ­ especially in relation to earnings. This led to the program being more tightly One example of the successful use of experimental pilot demonstration in developing a new program is the U.S. targeted towards women. It is worth noting that political Department of Labor's Self-Employment Demonstra- and other considerations may dictate the ultimate deci- tions. An experiment was conducted at a number of sion on targeting of ALMPs. However, the responsibility pilot sites in two states. Unemployed individuals were assigned to a treatment group (that received new self- of the evaluator is to carry out rigorous evaluations and employment services) or to a control group (that did present accurate findings to policymakers. not receive such services). The quantitative evalua- Identify Ineffective Programs. Some programs are tion results were so convincing that the U.S. Congress ineffective and should be eliminated or changed; rigor- approved legislation to authorize the national imple- mentation of a self-employment program. ous evaluations will help policymakers to identify them and allow resources to be redirected to programs that are more cost-effective. The evaluation of the Job Train- Help design new programs. Ideally, policymakers ing Partnership Program Act (JTPA) program (Box 2) is should assess effectiveness of programs by first imple- one example of the use of quantitative evaluations to menting and evaluating pilot projects. Evaluators might adjust budget allocations. design a demonstration with one group participating in a program and a similar group of non-participants. Impact Evaluation Techniques Comparing the performance of the two groups over Impact evaluations attempt to measure whether ­ time would reveal the effectiveness of the program and by how much ­ participants benefit from an ALMP. (Box 1). Based on these evaluations, policymakers can Outcome measures can vary depending on the evalua- design and target programs more effectively. tor's choice. The most common are earnings and Obviously, program efficiency can be (and is) regu- employment rates, but evaluations have also been used larly evaluated through the life of the program in order to: to measure other employment-related outcomes.1 While Refine Program Design. In many countries, govern- information on program participants is usually avail- ments undertake rigorous evaluations of ALMPs to able, the challenge for a good evaluation is how to accu- learn what works best so they can implement the most rately represent the counterfactual ­ that is, how to con- effective program design. For example, in Poland in the struct an appropriate control group. mid 1990s, public works were considered a costly inter- In many countries, the most commonly used evalu- vention with few program participants going on to get ation technique is that which does not use a control regular wage employment. An impact evaluation found group. These techniques rely instead on statistics com- that a much higher rate of re-employment in non-sub- piled by program administrators (e.g. number of gradu- sidized jobs was achieved if public works were managed by private companies. This led authorities to change the design of the program ­ regulations were altered to favor private companies running public works projects ­ lead- 1In many OECD countries, where these programs are offered as a ing, over time, to improved cost-effectiveness of the pro- substitute to (or even complement) welfare payments, outcomes are also measured in terms of savings in welfare payments and the like- gram. While this particular effect is not generalizable lihood of being on welfare. Some evaluations also attempt to meas- across countries, it demonstrates the importance of con- ure social outcomes, e.g. changes in criminal behavior, drug use and ducting such evaluations. teenage pregnancy. 2 I m p a c t E v a l u a t i o n Techniques that use a control group BOX 2: ELIMINATING INEFFECTIVE PROGRAMS These techniques are of two types: experimental and quasi-experimental. Experimental evaluations In 1986, USDOL initiated the National JTPA Study, a multi-year experimental evaluation of the effectiveness require selection of treatment and control groups prior of programs funded by the Job Training Partnership Act. to the intervention. In quasi-experimental studies, The study used a randomized experiment to estimate treatment and control groups are selected after the program impacts on earnings, employment and welfare intervention. To compute program effectiveness, statis- receipt of individuals served by the program. This study produced one of the richest databases for the evalua- tical techniques correct for differences in characteristics tion of training program impacts. between the two groups. A rigorous evaluation of this experiment indicated Experimental Evaluation Techniques. This technique that the program had very different results for adults is based on the principle that, if large samples are ran- and for youth. For adults, the program was successful domly assigned to treatment and control groups, in raising earnings by 7-11% and providing benefits of observable and unobservable characteristics of the two about $1.50 for every dollar invested. For youth, how- ever, the program was not successful: there was no sta- groups should not differ on average, and so any differ- tistically significant impact on earnings, and costs ence in outcomes can be attributed to program partici- exceeded benefits to society. These results clearly sig- pation. The main appeal here lies in the simplicity of nalled that training services provided to youth were ineffective. interpreting results ­ the program impact is the differ- ence in the means of the variable of interest between the Following the release of the results in 1994, Congress cut the budget for the youth component of JTPA by sample of program participants and control group. (For more than $500 million (80%); the budget for the adult component was increased by 11%. By adjusting the JTPA budget among these components, Congress shifted funds from an ineffective program to an effec- BOX 3: THE IMPORTANCE OF CONTROL GROUPS -- tive program. A HYPOTHETICAL EXAMPLE In the town of Abca, 1,000 mineworkers were laid off as a result of the closure of the ABC Mining Company. Based on random selection, 500 were given a sever- ates, employment rate of graduates) or on the benefici- ance package while the other 500 were put through an aries' assessment of programs. These evaluations are of intensive retraining program in computer skills. All little use. Without a control group, it is difficult to attrib- 1,000 individuals were monitored over time. ute success or failure of participants to the intervention, Three months after the completion of the training, it as these effects are contaminated by other factors, such was observed that 400 trainees were employed. This employment rate of 80 percent for the "treatment" as worker-specific attributes. Moreover, they don't con- group was touted by many as the impact of the train- trol for how well the participants would have done in the ing program. absence of the intervention (Box 3). However, Abcan evaluators cautioned against using In some cases, these evaluations provide informa- only this figure to judge the success of the program. tion on deadweight loss, as well as substitution and dis- They wanted to compare this employment percentage placement effects.2 This may be useful in targeting pro- to that of the "control" group ­ those who did not go through training. It was found that 375 of the control grams towards certain areas/groups. Nonetheless, it is group of 500 were also employed three months after difficult to judge the robustness of the results, as this the "treatment" group completed its training ­ an depends on how the sample was chosen and how employment rate of 75 percent. Hence, Abcan evalua- tors judged that the true impact of the training pro- respondents were interviewed. Hence it is more appro- gram was five percent and not 80 percent. priate to conduct impact evaluations using techniques While this example makes many generalizations ­ no that use a control group. selection or randomization bias, those who got a severance package did not enroll in any training or other related labor programs, etc. ­ it serves to illus- trate the importance of using control groups when 2Annex 1 provides a glossary of commonly used terms in the evaluating the impact of labor programs. impact evaluation literature. 3 I m p a c t E v a l u a t i o n example, if the mean employment rate for participants Quasi-experimental evaluations are of three differ- in a training program is 60% and that for non-partici- ent types: pants is 50%, then the program impact is 10%.) The random selection of participants is likely to lead (i) Regression-adjusted for observables. When the to the absence of (or significant reduction in) selection observable characteristics (e.g. age, education) of the bias among participants. However, it is often difficult to participant and the control or comparison groups dif- design and implement an experimental evaluation fer, regression techniques can be used to compute esti- because of the following problems: mates of a program impact. This is appropriate when the difference between the participant and compari- Failure to assign randomly. This could simply be son samples can be entirely explained by observable due to nepotism or could involve the exclusion of characteristics. high-risk groups in order for the program (ii) Regression-adjusted for observed and unobserv- administrators to show better results; able variables (selectivity-corrected). Simple regression Ethical questions about excluding some people techniques cannot, for obvious reasons, correct for from the intervention. This is somewhat related unobservable differences between the participant and to the issue above. Program administrators may control groups. When selection into programs is not ran- resist implementing the programs on the grounds dom ­ that is, when participation is due to both observ- that services are denied to the control group; able and unobservable characteristics ­ impact estimates Changed behavior upon learning of assignment. derived from the technique in (i) above are likely to be This could happen because individuals in an biased. The problem is that the unobservable differences experiment know that they are part of a treatment between the two groups might have caused the non-par- group and act differently as a consequence; and ticipants to have different responses to the program if Extensive data requirements. Other than being they had participated. Econometric techniques have been very costly, this can often be impractical as in developed to try to control for these differences (for many countries ­ particularly developing coun- details see Benus and Orr, O' Leary et al.). tries ­ rigorous evaluations are usually designed (iii) Matching techniques. The control and treat- after a program is in place. Furthermore, there ment groups are likely to have different success rates in may be a significant time lag between participa- finding employment, even in the absence of ALMPs, tion in a program and follow-up surveys after the because of differences in their observable characteris- completion of the program. Many developing tics. To control for these spurious differences, synthetic countries do not have the data-collection infra- control groups are constructed. The synthetic control structure required to follow individuals over such group, a subset of the entire control group, is composed a lengthy time. of individuals whose observable characteristics most closely match the treatment group (there are different Econometric techniques can control for some of types matching techniques ­ for details see Baker). these problems, but they could also bias the results. The main appeals of quasi-experimental techniques Quasi-Experimental Techniques. In these techniques, are that they use existing data sources and are hence rel- the treatment and control groups are selected after the atively low cost, and that these evaluations can be done intervention. In order to get unbiased estimates of pro- at any time after the program has begun. gram impact, the comparison group must be similar to However, there are disadvantages. Statistical com- the treatment group in characteristics that affect the plexity is a key one: adjusting for differences in observ- outcomes of interest. While some of these characteristics able attributes (e.g. gender, education) is relatively (such as age, gender and level of education) are observ- straightforward but subject to specification errors; able, others (such as innate ability and motivation) are adjusting for unobservable characteristics (e.g., moti- not. To isolate the effect of the program, econometric vation, innate ability) requires procedures that can techniques are used to correct for differences in the yield different results depending upon specification characteristics of the two groups. (Box 4). 4 I m p a c t E v a l u a t i o n BOX 4: IMPACT ESTIMATES FOR PARTICIPATION IN RETRAINING PROGRAMS, HUNGARY In response to rising unemployment following the transition to a market-oriented economy, the Hungarian government instituted a wide range of labor market programs in 1990. One of these programs involved retraining. Quasi-experi- mental techniques were used to analyze the impact of the training for 1992 graduates of training institutions. Using different methodologies, significantly different estimates of the impact were computed. Estimation methodology Employment Earnings Gain Probability (%) ($/month) Simple difference in means 19.2* 14.9 Quasi-Experimental Techniques Matched Pairs 1.2 20.5 Correcting for Observables 6.3* 4.9 Correcting for Obs. and Unobservables 32.0* na (*­Statistically significant) On trying different specifications, the evaluators concluded that the high estimates obtained using the correcting-for- unobservables technique were extremely sensitive to the empirical specification used. They felt that these estimates were unreliable and that the true employment impact of the program lay between the 1.2% and 6.3% generated by the matched-pairs and the correcting-for-observables techniques respectively Relative Strengths of Techniques Regression-adjusted techniques are relatively simple It is clear from above that the absence of a control to perform. While they do not control for unobservable group results in the least reliable evidence of program characteristics, they can be applied in cases where the impacts. Such techniques give no explicit estimate of treatment and control groups are roughly similar in what would have happened in the absence of the pro- terms of observable characteristics. Matching tech- gram, and so they provide little indication of the effects niques aim to mimic experimental evaluations by of the program. While these techniques can produce removing observations in the control group that do not some indication of the gross outcomes of programs (e.g. closely "match" with the treatment group; however these number of unemployed served), policymakers should techniques are not able to control for unobserved char- not rely on them to make comparisons across programs acteristics. Still, regression-adjusted and matching tech- or decisions relating to the allocation of resources. niques are possibly preferable to experimental tech- Experimental techniques may be the most appropri- niques in many developing countries owing to their ate in terms of rigor and relevance and are now more relatively low cost and higher feasibility. regularly implemented in many OECD countries. How- ever, in many countries they may not be feasible owing The Importance of Costs to their high costs, excessive data requirements and the For the purposes of informing policy decisions, an practical constraint that evaluations must be designed evaluation is not complete until one considers the costs before the programs are underway. of both the ALMP and its alternatives. Cost-benefit Among quasi-experimental evaluations, selectivity- analysis is the standard method of aggregating benefits controlled techniques which aim to control for unob- and costs across outcome categories and across time. A servable characteristics, may be the least appropriate. program may be effective in the sense of creating bene- Analysis has shown that these techniques are very sensi- fits for participants (e.g. higher earnings and employ- tive to the empirical specification chosen, rendering the ment) but not be worthwhile if the benefits are less than estimates suspect. 5 I m p a c t E v a l u a t i o n the costs involved. Unfortunately, costs appear to be the ing for a training program) and local labor-market char- least analyzed aspect of active labor market programs. acteristics (e.g. regional unemployment rates). These There are two types of costs associated with pro- data would also be useful in the case of experimental grams ­ private and social costs. Private costs are those evaluations if the evaluators intend to do some sub- incurred by the individual. They include his/her fore- group analysis. gone earnings while participating in the program, plus Usually the data on participants is drawn through any fees or incidental expenses the individual incurs special baseline and follow-up surveys which track par- during the program. Social costs, on the other hand, ticipants over time since their entry into the program. In include society-at-large's spending on the program. the case of experimental evaluations, non-participants Hence, a societal cost computation would include the are also similarly tracked over time. Quasi-experimental private costs as well as, for example, the rental of build- evaluations may rely on other sources to collect infor- ings, equipment costs and teacher salaries. In most stud- mation on control groups ­ e.g. household or labor- ies that policymakers undertake, the social costs are used force surveys ­ though usually these will not provide the to evaluate cost-effectiveness. same type of information as a survey focused on partic- The main steps involved in estimating costs include: ular programs will. Several data are critical to successful evaluations (for Identifying all costs, whether or not they will be details see O' Leary): charged to the program. (For example, even if (a) sample selection: Participants and control- premises for a training project are provided free group members should have similar labor market status of charge by the government, a cost for rent and eligibility for participation in programs. As the eli- should be imputed for these premises.) gibility for participation in programs is usually condi- Estimating the accounting costs. This is the actu- tioned by registration at an employment service (espe- al amount paid for the goods and services (e.g. cially in most developed and transition economies), the salaries and benefits for administrative staff, cost register of unemployed job-seekers can be used as the of equipment and buildings). sampling frame. Including the private costs. (These include fore- (b) sample size: Samples should be large enough to gone earnings and any costs incurred by the indi- allow precise estimates. Larger sample sizes will permit vidual on training.) the detection of sub-group program impacts that may be of interest to policymakers. Cost-benefit/effectiveness evaluations may also help (c) site selection: Practical consideration should be determine whether ALMPs reduce government spend- given to the region to be canvassed. This can have sig- ing. A program might, for example, succeed in moving nificant cost implications, particularly when dealing people into productive employment and off unemploy- with remote and hard to reach areas. ment benefits (a saving for the government). At the same (d) follow up: To ensure the impact of the program time, the costs of the program might exceed those sav- is appropriately measured, it may be necessary to con- ings and so, on balance, actually increase government duct follow-up surveys for one or two years after pro- spending. A proper cost-benefit evaluation would esti- gram completion for both treatment and control groups. mate the net cost to the government. Cost data can be collected from the institutions that administer and implement labor market programs. Data Requirements Informational constraints can play a significant role Who Should Conduct Evaluations? in the type of evaluation conducted. All evaluations of An issue facing policymakers is whether evaluations ALMPs require data on earnings and employment (out- should be conducted by government agencies or by come measures). For regression analysis, quasi-experi- institutions outside government. The answer to this mental techniques require data on socio-economic char- question is critical as it will determine the development acteristics (e.g. age, education, gender, region) as well as of a country's evaluation capacity. Policymakers must some data on the program (e.g. length and type of train- consider a number of factors (see Benus and Orr). 6 I m p a c t E v a l u a t i o n Policymakers must recognize that quantitative eval- ensuring objectives are clear, indicators are uations require highly skilled practitioners. It takes con- agreed upon and baselines known; and siderable expertise to develop the capacity to perform strengthening government's ability to dissemi- competent quantitative evaluations. In many OECD nate the results. countries, these skills have been developed over the past three to four decades and the most accomplished quan- Conclusions titative evaluators are most likely to be found in the pri- The effectiveness of ALMPs is substantially vate sector. improved if impact evaluations are rigorous and the Another factor to consider is the objectivity of the results fed back into program design. Although carrying evaluations. As government officials are often involved out rigorous evaluations may be a time-consuming and, in the design and implementation of ALMPs, govern- at times, costly exercise, the long-term benefits and pay- ment researchers may not be completely objective in offs are substantial. evaluations. Political pressures to report positive results may put into question their objectivity. To reduce this Annotated Bibliography type of pressure, governments sometimes establish inde- Baker, J. (2000). Evaluating the Impact of Development Projects on Poverty: A Handbook for Practitioners. A World Bank pendent units to perform evaluations. While this Publication. Aimed at providing policymakers and proj- approach may have benefits, it does not completely ect managers with the tools needed to evaluate project eliminate the problem, especially with respect to public impacts. Provides extensive case studies of a wide range of perceptions. evaluations. Public perception is perhaps the most difficult issue Benus, J. and L. Orr (2000). Study of Alternative Quantitative to resolve. Policy makers must recognize that the gener- Evaluation Methodologies. Working Paper. ABT Associates, Washington D.C. Provides an overview of the importance al public and the legislature may not accept the results of of conducting evaluations, evaluation techniques and evaluations that are conducted by a governmental who should conduct evaluations. agency. This is especially true if the evaluation unit is in Dar, A. and Z. Tzannatos (1999). Active Labor Market Pro- the ministry that is responsible for designing and imple- grams: A Review of the Evidence from Evaluations. Social menting the program. Protection Discussion Paper No. 9901. Provides a brief In many OECD countries, governments have found overview of ALMPs and evaluation techniques and pres- ents cross-country evidence on the impacts of different that it is less expensive to encourage the private sector to ALMPs. develop the necessary capacity to perform quantitative Grubb, W. and P. Ryan (2000). The Roles of Evaluation for evaluations. In most cases, governments use this capaci- Vocational Education and Training. ILO, Geneva. Focused ty for the evaluation of specific government programs. A on vocational training but provides an overview of eval- second reason for using outside evaluators is that the uation techniques and methodologies. results of the evaluations are likely to be objective and O' Leary, C., A. Nesporova and A. Samorodov (2001). Manual more readily accepted by the general public. on Evaluation of Labor Market Policies in Transition Economies. International Labour Office. Discusses vari- Irrespective of who conducts the evaluation, it is ous labor market programs in transition countries, eval- crucial that developing countries should place great uation methodology and how to use evaluation results. emphasis on: Schmid, G., J. O' Reilly and K. Schomann (1996). Internation- al Handbook of Labor Market Policy and Evaluation. training in evaluation approaches and methods; Edward Elgar Books. Outlines the various methodologi- developing quality evaluation standards; cal approaches adopted in evaluation research, presents cross-country evaluation findings, and presents an strengthening monitoring systems for data on insight into institutional frameworks and monitoring program inputs, outputs and results; and evaluation systems. 7 I m p a c t E v a l u a t i o n Annex 1 Selection Bias: Program outcomes are influenced by unobservables not controlled for in an evaluation Some Commonly Used Terms in the Impact-Evalua- process (e.g. individual ability). Such factors can arise as tion Literature a by-product of the selection process into programs where individuals "most likely to succeed" are selected Additionality: This is the net increase in jobs creat- into the program. ed. It is the total number of subsidized jobs less dead- weight, substitution and displacement effects. Substitution Effect: A worker hired in a subsidized job is substituted for an unsubsidized worker who oth- Deadweight Loss: Program outcomes are no differ- erwise would have been hired. The net employment ent from what would have happened in the absence of effect is thus zero. the program. For example, wage subsidies place a work- er in a firm that would have hired the worker in the Treatment and Control Group: Program benefici- absence of the subsidy. aries are the "treatment" group. In a scientific evalua- tion, their outcomes are compared with a "control" Displacement Effect: This usually refers to displace- group of non-participants. ment in the product market. A firm with subsidized workers increases output but displaces output among firms without subsidized workers. Randomization Bias: This refers to bias in random- assignment experiments. In essence, this says that the behavior of individuals in an experiment will be differ- ent because of the experiment itself and not because of the goal of the experiment. Individuals in an experiment know that they are part of a treatment group and may act differently, as could individuals in the control group. 8 I m p a c t E v a l u a t i o n 9