Reducing Crime and Violence: Experimental Evidence from Cognitive Behavioral Therapy in Liberia

We show that a number of “noncognitive” skills and preferences, including patience and identity, are malleable in adults, and that investments in them reduce crime and violence. We recruited criminally-engaged men and randomized half to eight weeks of cognitive behavioral therapy designed to foster self-regulation, patience, and a noncriminal identity and lifestyle. We also randomized $200 grants. Cash alone and therapy alone initially reduced crime and violence, but effects dissipated over time. When cash followed therapy, crime and violence decreased dramatically for at least a year. We hypothesize that cash reinforced therapy’s impacts by prolonging learning-by-doing, lifestyle changes, and self-investment.


Introduction
In many countries, poor young men exhibit high rates of violence, crime, and other "antisocial" behaviors. In addition to their direct costs, crime and instability hinder economic growth by reducing investment or diverting productive resources to security. In fragile states, such men are also targets for mobilization into election intimidation, rioting, and rebellion. 1 Two of the most common government responses are policing and job creation. Both take the person as they are and try to change their incentives or simply incarcerate them (Becker, 1968;Draca and Machin, 2015). This paper investigates an alternative: rehabilitation, or changing behavior by shaping people's underlying skills and preferences.
A large literature has shown that a broad set of so-called "noncognitive" skills, especially self control, predict long-run economic performance and criminal activity. 2 These skills respond to investment, especially in childhood (Cunha et al., 2010). They are fostered by family, schools, and communities. There is little evidence, however, on the returns to latestage noncognitive investments, and so it's unclear whether by adulthood self-investment or interventions can shape noncognitive skills and hence behavior (Heckman and Kautz, 2014;Hill et al., 2011). It is also unclear what specific skills are both important and malleable.
To investigate, we recruited 999 of the highest-risk men in Liberia's capital, generally aged 18 to 35. Most were engaged in part-time theft and drug dealing, and regularly had violent confrontations with each other, community members, and police.
We experimentally ran two interventions. One was an 8-week program of group cognitive behavior therapy (CBT) called the STYL program, for Sustainable Transformation of Youth in Liberia. We assigned offers by lottery. Following this, we held a second lottery for a $200 grant-about 3 months wages. The cash was partly a measurement tool, to see if therapy affected economic decisions. The cash was also a treatment, in the sense that it could stimulate legal self-employment, and we included it to compare therapy to a rise in the returns to legal work. 3 Experimentally, subjects either received offers of therapy alone, cash, therapy then cash, or neither. To deliver both treatments cost about $530 per person.
CBT is a therapeutic approach that can be used to treat a range of harmful beliefs and behaviors, including depression, anger, and impulsivity. First, CBT tries to make people aware of and challenge harmful, automatic patterns of thinking or behavior. Second, it tries to disrupt these patterns of thinking and to foster better ones by having people practice new skills and behaviors-learning by doing. A Liberian non-profit, the Network for Empower-ment and Progressive Initiatives (NEPI), designed and ran STYL. NEPI facilitators were themselves ex-combatants or ex-criminals who graduated from prior NEPI programs.
Among "noncognitive skills," NEPI designed STYL to focus on forward-looking behavior and self control. By "self control", psychologists and criminologists typically mean one's short term abilities to regulate emotions and to be resistant to impulse, as well as more sustained abilities to be planful, persevering, and patient. This concept has parallels to economic time preferences, and we measure them in the manner of both fields. Becoming more self controlled and forward-looking are central components of many programs, from preschool to rehabilitation therapy. 4 The curriculum focused on helping men foster skills of planning, goal-setting, reflection, deliberate decision-making, and controlling emotions and impulses.
The therapy also encouraged nonviolent, noncriminal behavior and lifestyles by fostering a change in the men's social identity. A premise of STYL was that the men self-identified as outcasts and did not hold themselves to the standards of mainstream society. The therapy tried to persuade the men that they could change who they were and how they were perceived. NEPI facilitators modeled this identity change. They walked the men through basic steps, such as changing their appearance, engaging in normal social interactions, and behaving more cooperatively. They discouraged drug use and association with bad peers. Therapy also required men to practice going to supermarkets, banks, and other "normal" places.
Research in both psychology and economics supports the idea that social identity and associated values influence behavior, and that both can change. This literature treats values as direct utility benefits or penalties from acting in accordance with or against a set of preferences (Bénabou and Tirole, 2004;Almlund et al., 2011). Akerlof and Kranton (2000) and Jolls et al. (1998) both argue that these preferences or values are tied to a person's social identity, and that to some extent people can change their perceived social category and with it values that reward and penalize certain behaviors.
There are striking parallels between STYL and socialization into militaries, street culture, gangs and armed groups. Such groups use similar techniques (appearance change, practice, modeling) to shape young men's social identity and behavior (Vigil, 2003;Wood, 2008;Maruna and Roy, 2007). NEPI designed STYL to reverse this process.
We surveyed the men beforehand, a few weeks after the interventions, and a year later. Most had no fixed address, phone, or even name. Despite this mobility, we re-interviewed 93%. We rely on self-reported data since (like most poor and fragile states) there are no administrative data. We did not necessarily trust self-reports, and so we attempt to re-check and validate behaviors such as stealing through in-depth interviews with a subsample.
We approached roughly 1500 high-risk men, and 999 agreed to enter the study. Of those assigned to therapy, nearly all attended at least a day, and two thirds completed it. The higher risk men were the most likely to finish.
Men who received therapy reduced their antisocial behaviors dramatically, roughly 0.2 standard deviations compared to the control group. Within a few weeks of therapy, for example, we observe large reductions in an index of behaviors, including stealing and drug selling. With therapy alone, these effects diminished after a year. When therapy was followed by cash, however, the reductions in an index of all antisocial behaviors were lasting.
The therapy probably worked through many channels, and we see evidence of improvement in two of the hypothesized channels: time preferences and noncriminal identity/values, with time preference changes most persistent. There is also some evidence of improvements in positive self-regard, plus temporary changes in drug abuse and noncriminal social networks. With therapy alone, a broad index of all these intermediary outcomes changes diminished after a year, just as we saw with antisocial behavior itself. When therapy was followed by cash, however, the overall change in these intermediary outcomes were lasting and fairly large-at least 0.25 standard deviations. If we account for multiple comparisons, it becomes difficult to single out any one mechanism, but the largest and most statistically significant change is in time preferences.
How was cash used? Regardless of therapy, little of the grant was spent on drugs or "wasteful" things. Most funds were invested in business or saved. Cash led to a shortterm increase in an index of economic performance (including income, savings, employment, and investment), due largely to increased petty trading. After a year, however, these gains disappeared, partly because most men were robbed regularly, irrespective of treatment.
The fact that the grant was crucial to sustaining therapy's effects is our most unexpected and important finding. As we find no sustained effect of cash on earnings, cash clearly did not raise the opportunity cost of antisocial behavior after a year. Thus economic performance does not explain the sustained effect of therapy plus cash on crime and aggression. Drawing on qualitative interviews and psychological theory, we suggest that the short term increase in income and legal employment helped to solidify therapy's impact on noncognitive skills and preferences. Specifically, for a few months after therapy, cash allowed men to project a new self, to stave off homelessness and stealing, and practice the self control and future orientation started by therapy. This hypothesis will be important to test in future research.
An obvious concern is our reliance on self-reported data. We argue that misreporting is unlikely to drive our results for two reasons. The first is the pattern of treatment effects: 12-13-month impacts from therapy plus cash, but not from cash or therapy only. Systematic measurement error would need to be correlated with the "both" treatment arm only. This seems feasible but unlikely, especially given the magnitudes of the impacts. To check further, we attempted to validate a subset of questions using intensive qualitative observation. The patterns suggest that, if anything, the control group underreported sensitive behaviors such as stealing. Hence the treatment effects may actually underestimate therapy's impacts. We also learn that the control group reports fewer expenditures in the survey versus the validation exercise, suggesting that some of the short term economic gains from cash may be illusory. These insights come with the caveat that they assume that data collected though in depth interviews on a small subset of questions, with a focus on trust-building, are more accurate than survey measures. It is better to think of the validation as a confidence building exercise rather than hard proof.
In addition to evaluating the pairing of an economic intervention with CBT, this study addresses several gaps in the literature. One is the absence of evidence on behavior change outside the U.S., especially in fragile states. Even within the U.S., however, there is limited evidence on adult behavior change. Most evaluations of U.S.-based crime and violence reduction programs focus on education and employment interventions. 5 Studies of CBT tend to be small-sample and non-experimental (Wilson et al., 2005). 6 But STYL's impacts on adult antisocial behavior are consistent with evidence from U.S. adolescents and children showing that CBT programs in schools and correctional institutes reduce criminal recidivism. 7 Finally, few studies have measured noncognitive skill and preference changes directly, and so our study strengthens arguments that they are malleable and contribute to antisocial behavior. If we adjust our p-values conservatively for multiple comparisons, it is difficult to single out any one skill or preference change, though there is suggestive evidence that time preferences, identity, social networks, and mental health all improve. The malleability of social identity is consistent with evidence from stigmatized Indian sex workers, where short courses of non-CBT psychological therapy increased self-worth, reduced shame, and increased savings and health-seeking behavior (Ghosal et al., 2015). The majority of this evidence, however, comes from small, observational, unpublished studies, which, because of a reliance on administrative data, seldom measure mechanisms directly. 8 But three recent randomized control trials among at-risk Chicago adolescents show that CBT can help adolescents reduce automatic behaviors (such as violent retaliations to a slight) by learning to override "fast" decision-making with conscious "slow" reflection (Heller et al., 2013(Heller et al., , 2015. The parallels between that program and STYL, in both the curriculum and impacts, are striking. It remains to be seen if STYL is replicable, but it is promising that it was adapted from foreign therapies and developed its own facilitators from prior graduates, enhancing scalability. Future work should test the approach in new contexts, compare CBT to other therapies (or a placebo), and reduce the reliance on self-reported data.

Intervention and experiment
Liberia's capital, Monrovia, is home to a third of the country's 4.3 million people. There are few formal jobs. Most men aged 18 to 35 have limited employment and earn money through a mix of agriculture, casual labor, or petty business. A few turn to crime, which is becoming more violent and commonplace. From 1989From -96 and 1999From -2003 two civil wars wracked Liberia. They killed 10% of the population, displaced a majority, and recruited tens of thousands into combat. Since 2003, however, Liberia has been at peace with the help of a United Nations (UN) peacekeeping force. During our study period, 2009-12, the economy was growing 6% per year (Republic of Liberia, 2012). Nonetheless, in 2009, people aged 18 to 35 would have spent up to 15 years of their childhood or adolescence in conflict, many robbed of the institutions that normally fostered planfulness, emotional stability, and other noncognitive skills.
Marginalized young men were one of the government's main concerns, especially poorlyintegrated ex-combatants and other men involved in drugs and crime. Drug and criminal networks are disorganized, but the government worried they could consolidate. They also worried about political violence. High-risk men had joined riots and election violence in the past, and they were targets for mercenary recruitment into the 2010-11 war in Côte d'Ivoire.

Target population and recruitment
We set out to recruit 1000 high-risk men-men actively involved in crime, interpersonal violence, and drugs, or who were poor and at risk of engaging in these activities. With no administrative data on such men, we recruited them directly. We selected five neighborhoods in Monrovia known for high rates of crime. These were generally mixed-income residential areas with large markets, with populations of roughly 100,000.
Recruiters were NEPI affiliates who were not involved in the interventions. NEPI had extensive knowledge of these neighborhoods and connections to local leaders, as well as a strong reputation that target men could verify. Recruiters had worked closely with high risk men before, and were themselves past graduates of a NEPI program.
We charged the recruiters with finding men that were homeless, drug-using, disreputable in appearance, or present in locations known for crime, armed recruitment, and violence. Community members could easily identify these spots and their denizens. Similarly, certain professions had strong reputations for crime. 9 Appearance was also a useful guide. For instance, recruiters looked for men with a dirty or unkempt appearance, long hair, apparent intoxication, or a "tough" style of dress.
To minimize correlated outcomes and spillovers, we avoided recruiting close associates. We instructed NEPI to approach just one out of every 7-10 high-risk men they visually identified. Recruiters then described the therapy, the allocation by lottery, and the baseline survey. They never mentioned cash grants. Over several weeks, recruiters identified roughly 10,000 potentially high-risk men and approached 1,500. Of these, about one third refused interest in the therapy and survey. 10 In the end, 999 men agreed to enter the sample. We estimate they represent 0.6% of all adult males in the neighborhoods, and about 12% of men aged 18-35 and in the bottom decile of income (Appendix A.2). Column 1 of Table 1 describes this sample at baseline. On average the men were 25, had nearly eight years of schooling, earned about $68 in the past month working 46 hours per week (mainly in low skill labor and illicit work), and had $34 informally saved. 41% were a former member of an armed group.

Cash
A nonprofit organization, Global Communities (GC), distributed the cash. They ran a lottery, where winners received $200 cash and losers received a consolation prize of $10. There was minimal framing. 11 GC held cash lotteries a week after the end of therapy. 9 Location was especially important. Within each of the neighborhoods there were pockets of insecurity where high-risk men were known to live or congregate: abandoned buildings, garbage dumps, drug dealing spots, parking lots, and homes for rootless young men run by ex-military commanders. Common professions included "car loaders" who have reputations for pickpocketing, or wheelbarrow and motorbike parking areas with reputations for drug selling and crime. They avoided recruiting men known to be "bosses"-men who run homes or drug dens that cater to petty criminals and low-level drug dealers. 10 We do not have systematic data on refusers, but recruiters reported two main types: men who were poor but were "low-risk" in that they did not appear to be involved in crime and violence; and high-risk men who said they were too busy to take part in therapy because they had legal or illegal business to attend to.
11 See Appendix B.4 for implementation details. Prior to the lottery, subjects were given about 15 minutes of information on how to keep the money safe (e.g. depositing it with a bank) and examples of what they could use it for (e.g. starting a small business or home improvement). But GC explicitly emphasized to subjects that the grant was unconditional and they were free to do what they wished.  (1) reports the sample mean. A small number of missing values are imputed at the median. Columns (2)-(7) report the coefficients and p-values from ordinary least squares regressions of each baseline covariate on three indicators, one for assignment to each treatment arm, controlling for block fixed effects. Column (8) reports the p-value from a joint test of statistical significance of all three treatment indicators.
process of change. The facilitators were an integral part of this intervention, because they modeled the change in skills and values. All were graduates of a prior STYL-like program run by NEPI, and three-quarters were former street youth or combatants.
There are parallels to interventions which show that aspirations-forward-looking goals or targets-influence behavior and respond to intervention (Bernard et al., 2014). There are also parallels to switching social identity. 15 STYL curriculum and approach The sessions employed a variety of techniques, from lectures and group discussions, to various forms of practice, including: role playing in class, homework that requires practicing tasks, exposure to real situations, and in-class processing of experiences of executing these tasks. Like many CBT programs, these tasks began simply and got more difficult over time. 16 In the first three weeks, facilitators encouraged men to try to maintain some new, simple behaviors. This included getting a haircut and removing facial hair, wearing shoes and pants instead of sandals and shorts, improving personal hygiene and the cleanliness of their living area, and reducing substance abuse. These simple exercises in goal-setting and self control also helped men start to operate within mainstream social norms.
In the middle weeks, facilitators encouraged men to engage with society in planned and unaccustomed ways, akin to exposure therapy. 17 For instance, homework included reintroducing themselves to their family, joining community sports, and visiting banks, supermarkets, shops, and so forth. Men also studied successful people in their community, and reached out to one as a mentor. Men then processed their attempts as a group. Often homework was independent, but facilitators might accompany the more troubled men.
Men also learned to manage emotion: practicing nonaggressive responses to angry confrontations in class, and recognizing signs of angry reactions and learning to distract or calm oneself (walking away, doing other activities, or breathing techniques).
In the last weeks, facilitators taught planning and goal setting. These lessons included training on breaking down large goals into smaller accomplishable sub-goals, and then creating plans to accomplish them via concrete steps. For example, men would list subgoals of in part a result of the smaller sample in these subgroups. To reduce concerns that these imbalances cumulate and could influence endline results, in Appendix X we create indexes of baseline data much like the endline indexes and look for treatment effects (where these endline behaviors are available) and look for treatment effects. We find that that no baseline effect is significant at the 95% confidence level, and none of those that are marginally significant (p < 0.10) are in the therapy plus cash treatment arm. Therefore, baseline imbalances are unlikely to be driving our results. However, we still control for baseline covariates in our main empirical specification.
Compliance Both interventions had high compliance, in part due to NEPI's persuasive efforts and street credibility. Of men assigned to the grant, 98% received it. Of men assigned to therapy, 5% attended none, another 5% dropped out within the first 3 weeks, and two thirds attended most sessions (>80%) (Appendix A.4). Those who dropped out early had less schooling, less self control, and were less likely to exhibit antisocial behaviors like substance abuse or stealing (Appendix A.3). Thus the highest-risk men seem more likely to attend over poorer, noncriminal men.
Phased implementation For logistical reasons we recruited, treated, and studied the men in three phases. A pilot phase recruited 100 men, to ensure that the therapy and cash grant caused no harm, to assess statistical power, and to allow us to refine experimental protocols. The pilot showed no indication of harm, and so we scaled to a further 900 with only minor changes to the interventions and protocols in two phases. 21

Conceptual framework
The interventions were designed to affect two main outcomes: economic performance and anti-social behaviors (including crime and aggression). To structure this argument, we start with a model of occupational choice between legal and illegal work (such as crime, mercenary work, or election thuggery). 22 Later we consider how such a model could be used to understand other forms of anti-social behavior. We develop the formal model in Appendix C and outline the structure and results here.
CBT could affect these outcomes in many ways but, as outlined in the next section, we focused on and prespecified three intermediary outcomes: time preferences, self control, and the values associated with a mainstream social identity. Our simple model examines comparative statics in all three. A change in time preferences is the simplest to examine in a standard model, and we consider the effects of changes in both the discount rate and time inconsistency. More broadly, forms of self control such as improved emotional regulation, planning, and conscientiousness could be considered a form of human capital that affects productivity. Hence we also consider what happens when we model self control as a parameter individual production function. Finally, we introduce a change in criminal identity/values as a change in intrinsic preferences over criminal occupations. 23 Of course, the therapy is a multifaceted treatment that likely operates through a number of other mechanisms (changed peers or family circumstances, mental health, prosocial preferences) and affect other outcomes and behaviors that themselves are associated with crime (drug abuse or prosocial behavior). We examine these empirically below, but focused the model on the mechanisms that were most in line with NEPI's design principles and theories, as well as the psychological theory and evidence cited above.
Setup Suppose people can allocate their time between leisure l, legal work L b such as petty business or labor, and illegal occupations L c such as crime, mercenary work, or election thuggery. We refer to these simply as "business" and "crime".
We assume crime uses labor alone and pays a wage w, which may be uncertain. This resembles the returns we observe to illegal work of the type available to our population in Liberia. 24 In the budget function, crime also carries a punishment f with probability ρ, and we assume this risk increases with the time devoted to crime. Punishment could mean prosecution, mob justice, or social sanctions.
Business uses capital, yielding output F θ, L b t , K t where θ is individual ability and K t is capital inputs. People start with wealth in the form of a riskless asset, a 0 , and save or borrow at interest rate r. Self control skills are one element of θ, and output increases in θ. For simplicity, we focus on the case where self control skills are inputs into business but not crime. This is the interesting and relevant case, since otherwise investments in self control skills will not affect occupational choice. We did not assume this from the outset, recognizing that in principle STYL could teach men to be more effective criminals. The pilot phase, however, suggested the opposite was true, so this is the most useful case to discuss.
People choose consumption, labor supply in each sector, and the amount of wealth to 23 Typically models treat such preferences as fixed, or ignore them. We outline how exogenous changes in noncognitive abilities or preferences affect the comparative statics in an otherwise standard model. 24 Petty crime requires little capital; drug dealers typically work for a "boss" who owns the supply; and those who leave town to work in illicit mining work as "mining boys" for capital-owning "miners" on shortterm renewable contracts that pay a daily wage plus a payment tied to output. This is also why we assume below that self-control skills are less important for success in criminal activities.
invest in business (versus the safe asset) in order to maximize their utility subject to the constraint that consumption plus wealth are equal to total income from business, crime, and the interest on investment. We allow people to be present-biased in the sense that they have a general inter-temporal discount factor δ, but can also be time-inconsistent with an extra factor denoted β < 1 that multiplies all future periods relative to the present (the standard form of quasi-hyperbolic time preferences).
Finally, people value consumption and leisure, but we also allow for a consumption value from conforming to one's identity and values (Akerlof and Kranton, 2000;Bénabou and Tirole, 2004). 25 In this case, a person's identity and associated values can penalize criminal acts. We use σ to indicate a preference against crime, and we put it in the utility function, U (c, l, σL c ), to distinguish these internal preferences from external punishments f .
We are interested in the effect of the interventions on criminal versus legal labor. Therapy can potentially influence this occupational choice through noncognitive skills θ, time preferences (δ or β), Anticriminal values σ, or all of the above. Cash, meanwhile, can influence occupational choice by increasing the assets available for capital inputs into legal business.
Occupational choice in the absence of interventions Where financial markets work well and where people are time consistent (β = 1), businesses are at their optimal scale-they have borrowed until the marginal return to capital is equal to r. Of course, the poor are typically credit-constrained. In this case poor people are forced to invest in capital over time until they reach the same optimal scale. The young and those who have experienced bad shocks will be the furthest behind. As a result, crime is more likely to be chosen by men with low business ability θ, the poor and credit-constrained, those with low disutility of crime, and the time-inconsistent. People may also choose both crime and business. Credit-constrained people with partial capital for business may still spend some time in crime. Also, risk averse people may do both activities when returns are uncertain.

Impacts of cash
If there are no credit constraints, cash windfalls will not affect occupational choice. But if people are poor and credit-constrained, windfalls will be partly invested in business. People involved in crime will shift to business, especially those with high business ability. Cash infusions will lead to a smaller increase in business for time-inconsistent individuals, however, since they will choose to consume more today.
Impacts of therapy Therapy could increase σ, θ, β or δ. These channels have distinguishing predictions. Interventions that increase the disutility from crime, σ, will reduce 25 We ignore the possibility, proposed by Bénabou and Tirole (2004), that ability is imperfectly known and correlated with perceived self-image. time devoted to it, but will have no effect on returns to business. Interventions that increase noncognitive ability θ will induce more time and investment in business, and also reduce crime. With the presence of risk in both sectors (and assuming risk aversion), interventions in θ will have relatively greater effects in terms of pushing individuals away from crime, because an increase in θ now also makes business relatively less risky. A rise in σ will also have a bigger effect than without uncertainty, because risk aversion will reinforce the rise in crime aversion and further reduce hours in crime.
What if an intervention increases time consistency, β? This will increase business investment and earnings among the credit-constrained. If people become more time-consistent, they will be more strongly influenced by the consequences of their actions in terms of punishments, and will therefore reduce criminal labor (and increase business labor) as well. Similar comparative statics come from an increase in patience.
Cash and therapy in combination The model implies that both interventions should lead to a larger decline than one alone simply because the effects are cumulative. Moreover, when people are credit-constrained and also receive cash, this simple model predicts that the effects of a change in σ or θ will be greater with cash than without it. Thus the interventions may be complementary and the total effect could be greater than the sum of the parts. Note that this simple model does not allow cash to have direct behavioral effects through practice of new behaviors or reinforcement of therapy's lessons.
Relevance of the model for aggression This model is most useful for thinking about illegal acts that carry material rewards. Other violence does not earn a wage, or does not have an opportunity cost of time. Nonetheless, we can cautiously use the model to think about acts such as aggression. For instance, we can think of some acts as having consumption value that is fleeting (the expressive pleasure of anger or revenge) or persistent (deterring future slights). In this case, σ < 0. Like crime, these acts carry a risk of punishment. If the criminal wage is zero, there is still a tradeoff between the consumption value today and the risk of punishment tomorrow, and the main comparative statics of therapy are similar to the case of crime: instilling values against violence (increasing σ) will reduce aggression; and increasing time consistency, β, also reduces aggression. Cash, however, will have little deterrent effect on aggression. 26 26 In this simple case, there is no role for self control skills, θ, in aggression. This is a drawback of adapting the pecuniary crime model, since STYL explicitly teaches men skills to regulate their emotions in charged, automatic situations. In some sense, then, STYL may not only change the underlying value of σ (the extent of one's desire not to engage in criminal activity) but also one's ability to ensure that expressed actions conform to the underlying preferences rather than succumbing to immediate temptation or anger. This is functionally equivalent to predictions associated with a larger σ.
14 We tried to survey each subject five times: (i) at baseline prior to the intervention; (ii and iii) at "short-run" endline surveys 2 and 5 weeks after the grants; and (iv and v) at two endline surveys 12 and 13 months after grants. 27 We ran pairs of surveys to reduce noise in outcomes with potentially low autocorrelation such as earnings or criminal activity. To measure baseline time preferences and abilities (such as executive function), following each survey the respondents also conducted 45 minutes of incentivized games and tests. 28 This sample was mobile and difficult to track. Roughly 40% changed locations between surveys, many changing sleeping places every few weeks or nights. Just 30% had mobile phones. Most went by several aliases, and may have been on the run. To minimize attrition, we collected extensive contact information (all known addresses, plus at least five close contacts), and went to extreme effort to locate each person, wherever they had moved, averaging three to four days of searching per respondent per survey.
We collected data on 92.4% across all endline surveys. Attrition is relatively unsystematic: treatment arms had similar response rates (within 0.4% of the control group) while a test of joint significance of all baseline covariates yields p = 0.328. 29 We collected longitudinal qualitative data to better understand the context, intervention, and mechanisms. First, a Liberian research assistant acted as a participant-observer during the Phase 1 therapy. Second, we interviewed facilitators for their impressions of the intervention and participants. Third, three Liberian research assistants conducted semi-scripted interviews, 14 pre-treatment and 130 post-treatment, with 66 men in the sample. 30 Interviews covered job satisfaction, investments, economic challenges, plans, antisocial behaviors, and perceptions of the interventions.

Key outcomes and multiple comparisons
After observing the pilot results, we decided to focus on five primary outcomes: two "ultimate" outcomes-antisocial behavior and economic performance-and three intermediary outcomes: economic time preferences, self control skills, and anticriminal identity/values. 27 The exception is the 100 men in the pilot, which had a single "short run" survey 3 weeks after grants. Actual survey times were, on average, 2.2, 5.7, 55.4 and 61.1 weeks after grants. Surveys were 90 minutes long and delivered verbally by enumerators in Liberian English on handheld computers.
28 See Appendix D for measurement details. Average winnings equalled about a half day's wages. 29 See Appendix A.3 for tracking techniques, response rates by survey wave and treatment group, and correlates of attrition. Of the 298 non-responses (of 3,896), we (i) had no location information (75%); men were mentally incapacitated (1%); died (8%, or 9 men); were in prison (12%); or refused (3%). Covariates associated with higher attrition include better mental health and income. 30 19 in control, 16 in therapy, 15 in cash, and 16 in therapy then cash. Sampling was purposeful, based on variation in key baseline measures: economic success, crime, drug use, and present bias.
The study began before the advent of the social science registry, but we outlined these core hypotheses in a 2012 National Science Foundation (NSF) proposal 1225697. 31 The proposal does not have the level of detail or precision as a pre-analysis plan, but it does describe our main hypotheses and approach to measurement in some detail. The results we present hew closely to the proposal, with only a small number of exceptions. 32 Naturally, CBT could influence anti-social behavior through other mechanisms, such as drug abuse, changed social networks, or mental health. While plausible, these were not the primary focus of the therapy's design, and as such we did not specify them as core hypotheses in the NSF proposal. These "secondary" intermediary outcomes are important and relevant, however, and we measured and report on them them for this reason.
To organize and reduce the number of hypothesis tests, we combine related measures into mean effects summary indexes. 33 We do so for the two ultimate outcomes of interest plus an index of all intermediary outcomes (primary and secondary). We classify intermediary outcomes into six families: time preferences, self control skills, identity/values, mental health, substance abuse, and social networks. The first three families were prespecified, while the latter three families (the secondary outcomes) were determined ex post, based on perceived conceptual similarity. Appendix D describes these measurement decisions.
Moreover, the tables present unadjusted standard errors as well as p-values adjusted for multiple comparisons. We use the Westfall and Young (1993) free step-down resampling method for the family-wise error rate (FWER), the probability that at least one of the true null hypotheses will be falsely rejected, using randomization inference. 34 Rather than 31 See http://chrisblattman.com/documents/research/2012.01.13_STYL_NSF_proposal.pdf, where the core hypotheses (and division into ultimate and intermediary outcomes) are outlined in Sections 1 and 4.1, and the operationalization (and measurement) of key outcomes in Section 4.4 of the proposal. 32 These decisions are described in Appendix D in detail. First, our final measure of antisocial behavior (which we called "crime and violence" in the proposal) does not include political violence, because none occurred before endline. Second, after the proposal but before data collection, we excluded executive function from our measure of self control, since we determined it was unlikely to be affected by the therapy. Third, after the analysis, we expanded our measure identity and value change to include prosocial behaviors and appearance change, at the suggestion of referees. These changes have only a modest effect on the results, as outlined in Appendix E.1. 33 We take averages of our outcome measures, coded to point in the same direction, akin to the approach by Kling et al. (2007). Note that the outcomes used to create the summary index may themselves be composites of many survey questions, such as consumption (a composite of many goods) or an aggressive behavior index (a composite of many types of aggressive behavior, a standard way that psychologists measure aggression). We do so because it is typically the composite itself rather than its component survey questions that we have theoretical interest or priors. In most cases this is reflected in the survey design, where the survey questions in each composite measure comprise a separate survey section. Also, to create an index by averaging the component variables would give more weight to outcomes that are typically measured with many different questions (such as aggressive behavior) versus one that can be precisely measured with a small number of variables (such as drug selling), which we find inappropriate. Nonetheless, Appendix E.2 shows robustness to an index that averages all survey questions rather than composite measures, or uses covariance weighting rather than mean effects. 34 Other papers taking this approach include Kling et al. (2007); Casey et al. (2012); Anderson (2008).
simply adjust for comparisons across the major family outcomes, we also adjust for the fact that we are estimating three treatment effects (one for each arm). Thus for our three main family indexes (economic behavior, antisocial behavior, and intermediary mechanisms) we report p-values adjusted for nine comparisons in total. By reporting both the adjusted and unadjusted statistical significance, readers can use the threshold appropriate to their question and preferences. If interested in the specific hypothesis of the effect of CBT on antisocial behavior, the unadjusted p-value is appropriate. If asking about the effects of different treatment combinations on mechanisms, as in our NSF proposal, the conservative adjusted p-values are more appropriate.

Empirical strategy and estimation
We estimate intent-to-treat (ITT) effect on outcomes, Y , via the OLS regression: where T herapyOnly, CashOnly, and Cash&T herapy are indicators for random assignment to treatment arms: therapy only, cash only, or both therapy and cash. We control for a vector of baseline characteristics, X, and fixed effects for each of the j randomization blocks, γ j . Y ij is the average of the two proximate survey rounds (e.g. the 2-and 5-week surveys for short term effects). To reduce sensitivity to outliers, we top-code continuous variables at the 99th percentile. We test sensitivity to alternative approaches in Appendix E.2.
Self-reported data One threat to identification comes from systematic measurement error in self-reported data, especially measurement error correlated with treatment status. In the absence of administrative data such as arrest records, we developed a technique to validate select survey variables through intensive observation. Blattman et al. (2015) reports the approach in detail, and we summarize in Section 7 and Appendix F.
Spillovers Another threat to identification comes from spillovers. Our recruiting strategy-working in large neighborhoods, recruiting less than 1% of adult men in those areas, and less than 15% of high-risk men we could identify on the street-was designed to reduce equilibrium effects such as a change in the returns to illicit work. We do not have the data or research design, however, to confirm that these equilibrium effects were minimized. Another potential spillover involves interactions within and between treatment arms, especially therapy. For example, because of peer effects and the emphasis on social norms, Using the Westfall-Young bootstrap and the Holm-adjusted Bonferroni methods yields similar results.
there could be positive spillovers from treating groups of friends. If so, the coefficients on therapy would overestimate the effect of therapy in isolation. Alternatively, to the extent that control subjects interact with and learn from treatment subjects, they may acquire some of the lessons, leading us to underestimate therapy's impact.
We designed recruitment to minimize such interaction bias, but could not eliminate it. We do not have detailed social network data for the full sample, but we did trace social networks within the first two therapy groups. On average, each subject was acquainted with 6 of the 43 others in therapy, and 30% reported one close associate in therapy. Given randomization, we can assume similar relationships in the other arms. Without systematic data on networks we cannot estimate spillovers, and this is a weakness of our design. The two effects should cancel each other out somewhat, but the extent is unknown.
Interpretation and generalizability Another point is that our sample is not drawn from a well-defined population. This is a function of the setting-there is no administrative record of high-risk men in Liberia (or in any low-income or fragile state). We recruited men in a relatively transparent, replicable fashion, but a third declined to enter the study for reasons we cannot observe. Thus the treatment effects we estimate cannot be generalized to a defined population. This is not only a constraint of the setting, but also the nature of a proof-ofconcept trial, where we have two promising but highly uncertain treatments-unconditional cash and CBT. Thus our study is akin to a medical efficacy trial, to determine whether the intervention produces the expected result under favorable circumstances.

Results
Figure 1 reports ITT estimates using equation 1 on the two ultimate outcomes of interest and an index of all intermediary outcomes. Figure 2 reports ITT estimates for the six intermediary outcome families. Both panels display regular 95% confidence intervals as well as p-values unadjusted and adjusted for multiple comparisons, as outlined in Section 4.1. Figure 1 adjusts for 9 comparisons (3 arms × 3 outcomes) and Figure 2 adjusts for 18. 35 35 These are calculated separately for the 2-5 week and 12-13 month surveys. Appendix E.1 also displays results if we ignore the ultimate/intermediary and primary/secondary distinctions, and simply adjust p-values for 3 × 8 = 24 comparisons. In this case, the short term conclusions are unaffected, but over 12-13-months, the effect of cash plus therapy on antisocial behaviors has a p-value of 0.106. As seen in Appendix E.2, these effects are robust to a variety of specifications and attrition scenarios. We obtain similar results if we: pool the endlines rather than averaging them; construct summary indexes of all underlying survey questions rather than indexes of the composite measures; or covariance weight rather than weight index components equally. We also show that the results are robust to conservative attrition scenarios by substituting extreme values for missing outcomes. Figure 1: Program impacts on the two ultimate outcomes and an index of all intermediary outcomes (z-scores) with 95% confidence intervals and unadjusted/adjusted p-values Notes: The figure reports the effect of each treatment arm after 2-5 weeks and 12-13 months with 95% confidence intervals and two p-values, one unadjusted and one adjusted for 9 comparisons (3 arms and 3 outcomes) using the Westfall-Young method. Treatment effects are estimated via OLS controlling for baseline covariates and block fixed effects. Each summary index is the standardized mean of composite outcomes. Standard errors are heteroskedastic-robust. Broadly speaking, cash did not lead to a statistically significant or sustained reduction in overall antisocial behaviors, but therapy did. In the short run, therapy led to large reductions, by 0.25 standard deviations with therapy alone and 0.31 standard deviations with therapy plus cash. This reduction in antisocial behaviors persisted, however, only when therapy was followed by cash: one year after therapy, therapy alone led to a 0.08 standard deviation fall in antisocial behaviors (not statistically significant) compared to a 0.25 standard deviation fall with therapy plus cash (significant at the 1% with unadjusted p-values, and at the 5% level if adjusted). This difference between therapy and therapy plus cash after 12-13 months is significant at the 5% level using unadjusted p-values. 36 We see a change of similar proportions of the intermediary outcomes in aggregate, and this too is only persistent in the group that received therapy plus cash. 37 Individually, most of the intermediary outcomes shift in the expected direction, and moderate over time. In the short run, only time preferences, mental health, and social networks are statistically significant using the more conservative adjusted p-values. After a year, no individual intermediary outcome is significant using the adjusted p-values, although all (except substance abuse) are pointing in the expected direction.

Antisocial behaviors
We defined anti-social behaviors as disruptive or harmful acts towards others, such as crime or aggression. Thus the family excludes self-harm (e.g. drug abuse) or acts by peers. Table 2 reports impacts on the components of the anti-social behavior index, for illustrative purposes. The table reports both unadjusted and adjusted p-values. The adjusted p-values on the antisocial behaviors family index come directly from Figure 1 (adjusted for the 9 comparisons of three arms and three outcomes). The adjusted p-values on the seven components reflects the 3 × 7 = 21 comparisons across all arms and components.
We constructed the seven component measures from sets of related survey questions, each typically from its own survey module. All are self-reported. In general, the coefficients on therapy only or therapy plus cash are negative, and a majority are statistically significant 36 See Appendix E.3 for formal tests of the difference between both therapy and cash to therapy or cash alone. Appendix E.4 tests whether 2-5-week and 12-13-month impacts are equal. We cannot reject the hypothesis that the effects of both therapy and cash are equal over time, but we can reject the equality of effects for therapy alone. 37 A natural question is whether the therapy is impactful for the most or least antisocial men. Appendix E.7 reports ITT regressions where we add an interaction between treatment arms and a standardized index for baseline antisocial behavior, as well as initial future orientation and self control. The therapy was impactful for the average participant, but the greatest decline in antisocial behavior was among those with the highest antisocial behaviors and the lowest levels of self control and future orientation. These estimates must be taken with caution, since the heterogeneity analysis was not prespecified. But these were the only heterogeneity analyses run on anti-social behavior. Adj.
(1) Notes: The table reports intent to treat estimates of the effect of each treatment arm after 2-5 weeks and 12-13 months, controlling for baseline covariates and block fixed effects. We focus on pre-defined composite measures, typically defined by survey module. For instance, thefts/robberies is the sum of 8 kinds of crimes; disputes/fights is the standardized mean of 9 kinds of physical or verbal altercations with peers, community, and authorities; aggressive behaviors is the standardized mean of 19 possible types of aggression and hostility; and verbal and physical abuse of partners is the standardized mean of 3 forms of verbal abuse of intimate partners plus one form of physical abuse. (For the latter two cases, we report standardized indexes since the incidents are measured on a 0-3 frequency scale, and the absolute sum itself has no interpretation.) The overall summary index is the standardized mean of these seven composite outcomes, standardized. Heterosketastic robust standard errors are reported in brackets. Adjusted p-values use the Westfall-Young method to correct for multiple comparisons, as described in Section 4.1. The overall family index is adjusted for 9 comparisons, as in Figure 1. Within each endline round, the component indexes are adjusted for 21 comparisons, for 3 arms and 7 outcomes. P-values less than 0.05 are bolded. † These variables were not collected during every phase/round, so their regressions have a smaller sample size.

21
using unadjusted standard errors. We must interpret the individual point estimates with caution, since almost none of the individual components are significant when we adjust for 24 comparisons, save aggressive behaviors.
• Drug selling and other crime. In the short run, 17% of the control group said they sold drugs, and they admitted to 2.6 acts of theft in the past two weeks. A year later, 13.5% sold drugs and they reported 1.9 acts of theft. Crime rates may fall because we are recruiting people in hard times, and there is regression to the mean. With therapy, however, crime rates fell by almost 50% in the short run, and this fall persisted a year with therapy plus cash. Appendix D describes specific crime measures. To give a crude sense of magnitude, if we extrapolate results to the full year since baseline, therapy plus cash led men to go from 66 to 40 crimes per person per year (Appendix E.6). Given the $530 intervention cost, this is roughly $21 per crime in the first year, ignoring any ongoing impact on crime or other program benefits.
• Fights. We also asked about 9 types of verbal and physical altercations in the past two weeks, including the frequency and severity of disputes with peers, neighbors, leaders, or police. Here, as with all summary indexes in the paper, we use the standardized mean effects of all nine survey questions. 38 On average, men reported about one dispute in the past two weeks. None of the effects are distinguishable from zero, and only the point estimates on therapy and cash are negative.
• Weapons. We asked men if they carried a weapon on their body for protection. This was typically a knife, as guns were rare. After a year, 15% were carrying a weapon, and this fell about by about half with either therapy alone or therapy plus cash.
• Arrests. 14% of the control group reported an arrest in the two weeks before the shortrun endline, and 12% after a year. We did not see a statistically significant decline in arrests, though after one year the coefficient on therapy plus cash represented decline of almost a third, or about one arrest per year.
• Aggressive and hostile behaviors. We asked 19 questions about reactive and proactive aggression, such as the frequency with which they yell, curse, bully others, cheat, or lose their tempers. 39 After a year, the index of all 19 questions fell .15 standard deviations 38 A main reason is because the measurement scales differ across component survey variables and the absolute valuer of the scales themselves are not meaningful (e.g. a frequency scale of 0-3, from never to often) We standardize individual survey questions, average them, and standardize this composite to have mean zero and unit standard deviation. Results are robust to alternate weighting and indexing approaches. 39 We used nine questions from a standard scale, adapted to Liberian English (Raine et al., 2006), and added 10 more locally-relevant acts based on our qualitative interviews.
(not significant) with therapy alone and .34 with both (significant at the 5% level with multiple comparison adjustment).
• Intimate partner abuse. We have a crude measure of intimate partner abuse-3 questions on verbal abuse (e.g. cursing and yelling) and one on physical abuse in the past two weeks. A standardized index of these measures fell little in the short run with therapy, and after a year the coefficient are actually positive (the only instance where therapy is positively correlated with violence).
• Political violence. Given Monrovia's history of mercenary recruitment, riots, and election violence, we predicted the men would have opportunities for political violence. Indeed, shortly after the Phase 1 men received therapy, there was a minor riot in the city. 40 From then, however, Liberia entered one of the most politically quiescent periods in recent history, and so we had no political violence to measure. This is the only pre-specified outcome that we could not test directly. Table 3 reports program impacts on an index of measures of economic performance: income, homelessness, savings, investment, and employment levels. In the month after grants, general economic activity increased among those receiving cash alone (.66 standard deviations) or cash following therapy (.58 standard deviations). But after a year the effects in all three arms have approached zero. The same patterns hold if we look at income alone. 41 We measured income in three ways: (i) consumption in the past two weeks; (ii) estimated earnings in all activities in the past two weeks; and (iii) an index of durable assets. 42 The short term rise in consumption is significant at the 1% level, and the rise in assets at the 10% level, after adjusting for multiple comparisons. An overall index of all three income measures is significant at the 1% level (not shown). Homelessness also falls significantly as income rises, but there is no decline after a year. Consumption and assets could rise simply from spending the grant. But this doesn't explain the temporary earnings boost. Overall, the cash seems to have been invested in petty Adj.
(1) Notes: The table reports intent to treat estimates of the effect of each treatment arm controlling for baseline covariates and block fixed effects. The income summary index is the standardized mean of three composite outcomes (themselves first standardized). Heterosketastic robust standard errors are reported in brackets. Adjusted p-values use the Westfall-Young method to correct for multiple comparisons, as described in Section 4.1. The overall family index is adjusted for 9 comparisons, as in Figure 1. Within each endline round, the component indexes are adjusted for 21 comparisons, for 3 arms and 7 outcomes. P-values less than 0.05 are bolded. † These variables were not collected during every phase/round, so their regressions have a smaller sample size.  (4) report the coefficients and p-values from an OLS regression of the proportion spent on an indicator for assignment to therapy then cash controlling for block fixed effects and baseline covariates.
business, and this accounts for the rise in short run earnings. But bad shocks, especially theft, meant these gains were fleeting. To see this, we assessed grant spending in two ways. Using pictures of different types of spending and plastic chips, we first asked grant recipients to indicate how they used the grant. Table 4 reports self-reported allocations of the grant by treatment arm. We see little effect of the recent therapy on allocations. Little of the grant seems to have spent on drugs, alcohol, gambling and prostitution. Even if men underreport these expenses, we see no difference between cash recipients who did and did not receive therapy.
We can also look at expenditure data, which included a range of business investments in the two weeks prior to the 2-and 5-week surveys. As reported in Table 3, those who received only cash reported $56 more investment in each 2-week period. Thus the total 5-week investment treatment effect is at least $112-just over half of the grant (significant at the 1% level using adjusted p-values). Meanwhile, the therapy only group resembled the control group in terms of investment.
These short run investments did not last. In the cash only group, the stock of business assets after a year is only $19 greater than in the control group, not statistically significant.
We also see no one-year difference in total work hours. 43 What happened? From qualitative interviews, insecure property rights were a major barrier to capital accumulation. A large number of men reported the theft of all their assets, or all their wares, on a regular basis, by criminals or (for market wares) the police. 44 We added this question to the survey (though not as part of the performance index). At each survey round, about 70% of the men reported a house robbery and belongings stolen in the past month. 45 This implies a robbery every other month, at least. There is little difference by treatment status, suggesting that men were not more likely to be targeted if they received cash. But they would have had more to lose.

Indications of noncognitive skill and preference change
Tables 5 and 6 reports treatment effects on our six intermediary outcome families and their components. For the six family indexes, we report regular standard errors as well as p-values adjusted for 18 comparisons (three arms and six families), as in Figure 1 above. (Correcting for just the 3 primary outcomes yields qualitatively similar conclusions.) We also report and discuss the components of each index mainly for illustrative purposes. We add p-values adjusted for multiple comparisons within each family, so that a family with four components is adjusted for twelve comparisons.

Time preferences
We report a summary index of 4 measures of patience and 4 of time inconsistency--akin to δ and β in our model. Specifically, we measured: a set of incentivized tradeoffs between modest amounts of money now versus in two weeks, and again in two versus four weeks, that allow us to place men in seven ordered bins of patience and time-inconsistency (for an average payout of $3, about a day's wages); a hypothetical (non-incentivized) version of the same tradeoffs, with higher stakes tradeoffs; and self-reported assessments of time preferences. All are described in Appendix D.3.
In the short term, time preferences become more forward-looking for all treatment arms, though the result is largest (.32 standard deviations) and statistically significant only for therapy plus cash (at the 1% level even accounting for multiple comparisons). After a year, the point estimates from therapy are positive-0.15 standard deviations for therapy alone and 0.21 standard deviations for therapy plus cash. The latter is statistically significant using regular standard errors but, like all the family indexes, is not significant after a year after accounting for multiple comparisons. Looking within the family index, point estimates are larger and more precise for patience than time inconsistency. 44 In some cases this was theft by a friend, family member, or stranger. Also common was confiscation of wares by the police. Some forms of market selling contravene official rules, often unenforced, but nonetheless giving police opportunities to confiscate. Some confiscation is legitimate, some not. 45 We do not include this outcome int he economic performance indesx as it's not a measure of economic performance. Rather we report it in the table mainly for descriptive prurposes. Adj. Unadj. Adj. Unadj. Adj.
(1) Notes: The table reports intent to treat estimates of the effect of each treatment arm controlling for baseline covariates and block fixed effects. We focus on pre-defined composite measures, typically defined by survey module. The overall summary indexes are the standardized mean of its composite outcomes, standardized. Heterosketastic robust standard errors are reported in brackets. Adjusted p-values use the Westfall-Young method to correct for multiple comparisons, as described in Section 4.1. The overall family index is adjusted for 18 comparisons, as in Figure 2. Within each endline round, the component indexes are adjusted for 3 arms and the number of components within the family. P-values less than 0.05 are bolded. Adj. Unadj. Adj. Unadj. Adj. (1)

Self control skills
We measured self control skills using standard psychometric questionnaires for four constructs that psychologists associate with less impulsive and more planful behavior. 46 First, we looked at 9 questions from the Barrett Impulsiveness Scale (Spinella, 2007), which assesses one's inability to control thoughts and actions. 47 Second, we used 8 questions from the NEO-five factor personality inventory to assess conscientiousness (Costa and McCrae, 1997). Topics included following societal rules, and controlled, careful behavior. Third, we took 7 questions on perseverance from the GRIT scale (Duckworth and Quinn, 2009), which captures the ability to press on in the face of difficulty. Finally, we selected 8 questions on reward responsiveness-whether they are motivated by immediate, typically emotional rewards-from the Behavioral Inhibition/Behavioral Activation Scale. 48 We adapted the scales and questions to the context and Liberian English. Appendix D lists all questions.
In the short term, respondents reported little change in self control skills. In fact, self control was the only one of the six families that did not show a statistically significant increase in the short run, using unadjusted p-values. In contrast, after a year, both therapy and therapy plus cash are associated with increased self control of 0.16 and 0.24 standard deviations. The latter is statistically significant at the 5% level with regular standard errors but not with the adjustment for multiple comparisons. Looking at the components of the self control index, all point in the expected direction, though the magnitudes and precision are greatest with impulsivity and reward responsiveness.
Time preferences enter into our theoretical model differently than self control, but an obvious question is whether they are distinct. The correlation between the self control and time preference summary indices is 0.33, significant at the 1% level. If we combine the time preference and self control measures into a single summary index, therapy alone and therapy plus cash both have statistically significant positive impacts of roughy 0.22 standard deviations after 2-5 weeks and .26 after 12-13 months (Appendix D.8).
We must be cautious because all self control scales are self-reported, and treated men 46 In addition to these four psychological scales, we also conducted tests of executive function-cognitive processes associated with inhibitory control, working memory, self regulation, and planning, such as digit recall (see Appendix D.6). We did not hypothesize a change in executive function because these are thought to be abilities that solidify in early childhood (Appendix D.8). As expected, there is no statistically significant change from treatment. This is one instance where the NSF proposal says otherwise, but our assessment of the literature changed between that proposal and data collection. Hence we exclude executive function from the index. Including it would not materially change our conclusions, since the self control measure performs weakest of all the noncognitive skills. Appendix D.8 reports these robustness checks.
47 Examples include "I buy things on impulse" or "I say things without thinking". (1) Summary index of self-control skills The table reports intent to treat estimates of the effect of each treatment arm controlling for baseline covariates and block fixed effects. We have subdivided the summary indexes reported in Table 5 by their coverage of the specific topics in the STYL curriculum. With unadjusted standard errors, *** p<0.01, ** p<0.05, * p<0.1 † These variables were not collected during every phase/round, so their regressions have a smaller sample size.
could simply be repeating back their lessons. There is some evidence this is not so. We divide the 32 self control questions into two indexes: questions with high (44%) and low (56%) emphasis in the curriculum. 49 Table 7 reports the ITT estimates after a year. The effect of cash and therapy is at least as large for low emphasis items.

Anticriminal identity/values
Social identity and values are not straightforward to measure, and we are not aware of existing models. Based on our qualitative work, we developed three main measures, which we assemble into a single family index. First, we attempted to measure values directly, using a set of 33 self-reported attitudes towards the appropriate use crime and violence in the men's own lives-indicators of the degree to which they had internalized mainstream social norms. 50 Second, at the one-year endline we measured an index of prosocial behaviors, 49 We rated each index component on a scale of 0 (not emphasized) to 4 (very emphasized). We then defined low-emphasis components as those rated 0 or 1 and high emphasis components as those rated 2 or above. These results are unchanged for using 1.5 or 2 as the emphasis cutoff. 50 Our approach drew on the way social psychologists distinguish between social norms versus attitudes, but we focus on the attitudes alone. Social norms capture what people think others do or should do in a particular situation, and surveys sometimes then ask the sam respondents what they believe is appropriate behavior (i.e. a norm deviation). Our questions are akin to these attitudinal questions. (We did not measure perceived norms because we did not have the space, nor did we deem the measure a priority.) Specifically, we asked 11 questions on attitudes to the use of violence to solve community or personal problems, such as mob killings of suspected thieves, or attacking their unfaithful wife's lover. We also asked 12 questions about their attitude toward participating in crime, including whether they would feel fine taking unwatched goods including group memberships, group and community leadership, and contributions to local public goods. (These are more of a behavior than a skill or preferences, but we hypothesized that it would be a reasonable proxy, allowing us to infer preferences from behaviors.) Finally, the therapy encouraged men to change their appearance as part of the identity change, and we asked survey enumerators to record their subjective impressions: quality of dress, shoes, cleanliness, and smell. 51 In the short run, the family index improves by roughly 0.2 standard deviations in all treatment arms, significant at the 5% level with unadjusted p-values but not with adjusted p-values. With therapy followed by cash, the identity/values index rises 0.27 standard deviations, with p = 0.067 with our conservative adjustment. After a year, these treatment effects attenuate somewhat, particularly the change in appearance (which reverses in sign), and so the change is not statistically significant at conventional levels. Among the three components, however, the largest change appears to be in self-reported anticriminal and anti-violent values, and the impacts are sustained after a year in magnitude (though not statistically significant). 52

Positive self-regard / mental health
Half our mental health family index is positive self-regard. Poor self-regard has been linked with many aspects of negative behavior and counterproductive or extreme risk-seeking behavior (Coopersmith, 1967). Some research (e.g. Judge et al. 2002) suggests self-regard is captured by an interrelated set of psychological scales, including: (i) neuroticism, the tendency to experience emotional instability or anxiety, assessed with 8 questions from the NEO-5 factor personality inventory (Costa and McCrae, 1997); (ii) self esteem, assessed with 8 questions such as, "I am able to do things as well as most other people" or "I take a positive attitude toward myself"; (iii) locus of control, the extent to which individuals believe they versus fate control events affecting them, measured using 8 questions from a standard questionnaire (Sapp and Harrod, 1993). Arguably related to positive self-regard, we also collected a classic happiness measures, asking men to rank their subjective well being in absolute terms and relative to others in their community. 53 or stealing $100 from someone's pocket. We also asked about 6 hypothetical forms of political violence, including whether they discuss protesting with friends or making trouble or conflict with the authorities. 51 Unlike the other two measures, we did not prespecify appearance as a reflection of identity and values in the NSF proposal. But it is difficult to see where else it belongs, and at the suggestion of referees we include it in this family.
52 As with self control, we divide the 29 value questions into two indexes by high and low emphasis in the curriculum. Table 7 reports the ITT estimates after a year. The effect of cash and therapy is at least as low for the low emphasis components. 53 We asked about well being, health, wealth, and power in absolute terms. We asked about wealth, respect, power, and access to services in relative terms. Each used a picture of a ladder with 10 rungs. The summary A second element of the mental health index is depression and distress. We assessed 6 symptoms of depression and 12 symptoms of post traumatic stress, based on a locally adapted instrument used previously with ex-combatant populations in Liberia (Annan et al., 2015;Blattman and Annan, 2015). We group this with positive self-regard as a mental health family in the interests of minimizing the number of families.
We treat positive self-regard/mental health as distinct from the anticriminal identity and values because, in principle, a positive self-image and criminal/outcast social identity are compatible. This simply happens to be uncommon in Liberia, where there is typically little social esteem associated with outcast/criminal social category. Moreover, the main purpose of the anticriminal measure was not to capture the quality of the men's self image, but rather their change in values, as illustrated in the theoretical model.
In the short run, a family index of these measures rises 0.34 standard deviations from therapy plus cash (significant at the 1% level using multiple comparisons). After a year, these treatment effects attenuate and are significant with unadjusted p-values only. Examining components suggests that the effects are driven most of all by the subjective well being measure and self-esteem, although this does not stand up to adjusted p-values.

Substance abuse
Therapy tried to equip participants with strategies to cut back substance abuse, and while an obvious outcome of interest, it was not one we specified in advance of primary interest for two reasons: the program's overwhelming focus on antisocial behaviors, with drug use seen by NEPI as only a modest factor; and the fact that systematic reviews of CBT do not find support for its effectiveness treating substance use disorder (Hofmann et al., 2012). Nonetheless, if the therapy decreased substance use, we could see both economic and antisocial behavior change.
In the short run, reports of daily use of in the control group are 67% for alcohol, 50% for marijuana, and 21% for hard drugs. An index of all three indicators (0-3) fell 0.20, significant at the 10% level using adjusted p-values. At the one year endline, reports of daily use of in the control group are roughly similar. An index of all three indicators (0-3) fell only 0.06 after a year as a result of therapy and cash (not statistically significant).
index is the average of each ladder. Patterns are broadly similar across all ladders.

Quality of social networks
Finally, we also assessed risky social networks. 54 We did not prespecify a change here, but over the course of the qualitative interviews, respondents repeatedly talked about changing peer groups to avoid crime, violence and drugs. We thus measured the traits, positive and negative, of men's five closest peers. 55 We also asked about closeness to and support received from family members, former rebel commanders, and "big men" (intended to connote a criminal boss). A summary index of positive social networks increased in the short run by 0.15 standard deviations with therapy and 0.33 standard deviations with therapy plus cash (the latter significant at the 1% with adjusted p-values). After a year, the point estimates remain positive, but are about half as large and not statistically significant.

Insights from qualitative interviews and observation
One of the strongest impressions we gained from interviews was the importance men attached to identity change, or what NEPI called "transformation". Nearly all the subjects described feeling ostracized at baseline, and many reported that the therapy pushed them to believe they could be someone better for the first time. The facilitators played an important role here. The participants we interviewed unanimously had admiration and praise for the facilitators, highlighting that their backgrounds demanded respect and provided credibility, while their personal stories of change were encouraging. Beyond modeling the change in social identity, men reported the facilitators were also sometimes the first people to treat them with seriousness and respect, and this built their confidence to reintroduce themselves to community members, or to expose themselves to banks and shops.
Attempts to behave normally, especially the exposure to new social situations, seemed to reinforce skill and identity change. Many of the men failed in their plans, or experienced stigma in their shop or bank visits. In group sessions, men discussed what went wrong and why (such as poor decisions, or choice of dress). Men with setbacks learned from and were encouraged by the positive experiences of others. And facilitators sometimes observed men's homework attempts and coached them through difficulties.
Men's appearance also transformed during therapy. The first day men arrived with long or messy hair, facial hair, dirty or ripped clothing, wearing t-shirts with shorts and sandals.
Their demeanor was tough, and their appearance signaled outcast status. Haircuts were offered in week two, and many men took advantage, symbolizing the change. Others showed up beforehand having gotten a haircut on their own. Similarly, before the unit on hygiene, some men began arriving in pants, shoes, and collared shirts. Typically a few men in each group resisted these changes. But seeing the positive experiences of others, they too began to arrive more clean cut, trying out the new identity. The survey results confirm a short-term change in appearance. The absence of 12-13-month change is puzzling.
A year later, therapy participants also described applying skills of self-regulation in their lives. To avoid fights, they used new tactics: removing themselves from emotionally-charged situations, allowing space to process their feelings, and ignoring negative automatic thoughts in the favor of more controlled thinking. Related were improved social and communication skills. Interviewees described how such skills allowed them to engage with community members or in disputes and express themselves without anger or violence.
Not only did the community regard them differently, many said, but troubled young men began coming to them for advice and lessons learned from the therapy once they saw the sudden and sustained change-another important source of reinforcement, and perhaps one reason we do not see a change in peer quality in the data. 7 Can we believe our self-reported data?
Self-reported data raise several worries, the most serious being measurement error correlated with treatment. For instance, men who receive an anti-violence intervention might be more likely to tell us they are non-violent, overestimating the estimated treatment effect of therapy.
This kind of bias is hard to square with the patterns of effects we observed. Therapy followed by cash would have to induce systematic errors where therapy or cash alone did not. Nonetheless, this is possible-for example if the largest misreporting were associated with larger past benefits. Thus, concerned that our survey measure, y s , may be biased, we set out to intensively validate some measures, y v . If y v is closer to the true behavior, y * , this allows us to estimate the degree and direction of bias. We summarize the approach, empirical strategy, and results here, with details in Appendix F and Blattman et al. (2015).

Approach to validation
Of more than 4,000 endline surveys, we randomly selected 7.3% and re-tested answers to six survey-based measures with two-week recall periods. We chose four potentially sensitive behaviors-marijuana use, thievery, gambling, and homelessness. We also chose two everyday expenditures that could be subject to recall bias or other error-paying to watch television in a video club, and paying to charge a mobile phone. We chose these six because we wanted a diverse set of behaviors with similar recall periods. We also wanted very specific behaviors (e.g. stealing rather than any crime, or marijuana rather than substance abuse). Finally, we wanted outcomes that were a primary focus of the treatment (e.g. stealing) and others that were not (gambling or expenditures).
We used intense qualitative work-in-depth participant observation, open-ended questioning, and efforts to build relationships and trust-to try to elicit more truthful answers. Over several days of trust-building and conversation, plus direct observation, we tried to elicit a direct admission or discussion of the behavior.
We selected and trained eight of the study's most talented qualitative research staff as validators, all Liberians. In the ten days following the survey, a validator visited the respondent over four days, spending several hours each day in conversation and observation. Validators shadowed respondents as they went about their day, rather than conduct formal interviews. They raised target topics through indirect questions while chatting.
Validators developed techniques to foster trusting relationships and to build rapport: becoming close to street leaders; eating meals with subjects; sharing personal information (including similar acts they or their friends engaged in); and mirroring participants' appearance and vernacular as appropriate. Validators would also observe the respondent's behavior from afar, as well as converse with peers and family. The goal was to attain insider status, and thus reduce the chance of misreporting. The premise was that time, a focus on a small number of behaviors, and trust/rapport building would mean that respondents were less willing, or feel less able, to deceive a more familiar person, who also knows them better. Validator also had the opportunity to clear up misunderstandings and get a more accurate assessment of the behaviors. By discussing sensitive behaviors openly, relating their own experiences and that of friends, validators sought to dispel any notion that certain answers are more desirable, or would result in any strategic gains.
Without knowing the respondent's survey response, y s , the validators coded an indicator of whether or not the respondent engaged in the behaviors in the two weeks prior to the survey, y v . The authors reviewed the evidence and the coding for every case. In general, we used a relatively high standard of evidence, only coding y v = 1 for a direct admission of the behavior or persuasive statements that they did not engage in the behavior. 56 If this technique simply reproduced the errors in the survey data, then the validation is little help. The key assumption is that four days of building trust and gathering ex-tensive information, regarding just six behaviors, reduced experimenter demand and other biases correlated with treatment compared to responses during a 300-question, 90-minute questionnaire.
Nonetheless, y v is not free from error. Appendix F.1 reviews our approach and its limitations in more detail. Many of these limitations-the requirement of a direct admission, the disruption in people's lives, errors in recall periods, or increased social desirability bias from scrutiny-undoubtedly led to systematic errors in y v . These errors, however, are not necessarily correlated with treatment. This is possible, for example, because validators could have learned men's treatment status in conversation, and this could have biased their coding. Nonetheless, we designed the trust-building and evidentiary standards to minimize this risk.

Survey-validation differences
Of the 297 men we selected for validation, we found and validated 240 (81%). 57 Table 8 reports the means of y s and y v in the full sample and each treatment arm, as well as the percentage of times the two measures agree. y s and y v are identical about 80% of the time for sensitive measures and about 70% of the time for expenditures. As expected, however, y s <ȳ v : The average person reported 1.21 sensitive behaviors and 1.09 expenditures in validation, and 1.12 sensitive behaviors and 0.82 expenditures in the survey.
With this sample, only the underreporting of expenditures is statistically significant. We report t tests of the simple difference, y s i − y v i , in Appendix F.2.1, as well as a discussion of patterns of under-and over-reporting. Expenditure underreporting appears to be largest in the control group, possibly because they are trying to appear more needy. Among sensitive behaviors, underreporting is generally less than 10% of the survey means, and is only statistically significant in the case of gambling. This is mainly driven by the cash only arm, who may have been reluctant to report spending the grant this way.

Is measurement error correlated with treatment?
Empirical strategy If we believe that the validation measure is closer to the true behavior, then one way to test for bias in the survey-based treatment effects is to take the difference y s i − y v i , our proxy of measurement error for person i, and regress it on treatment: 57 Attrition was higher than the survey as we could not validate the behaviors of men who migrated across the country. Attrition was not correlated with treatment or baseline covariates (Blattman et al., 2015).  Notes: The table reports the means (standard deviations) of the survey and the qualitatively validated measures for the full sample and by treatment arm. "% in agreement" is the percentage of respondents for whom the survey indicator equals the qualitatively validated indicator.
If β 1 < 0 for sensitive measures, then treated men were more likely to under-report bad behaviors, and our survey-based treatment effects may overestimate the decline in antisocial behaviors. And if β 1 > 0 for expenditures, then treated men may have over-reported their expenditures more than the control group, and our survey-based treatment effects may overestimate the short-run increase in income.
With a sample of 240, we estimate we are powered to detect average under-or overreporting of at least 14%, and error correlated with treatment of 28%. 58 Because of power concerns, we pay close attention to the sign, magnitude, and confidence interval for β 1 .
Of course, the crucial assumption is that y v is closer to the true behavior. This parallels the "no liars" and "no design effects" assumptions in list experiments. The assumption cannot be tested directly, but can only be argued on context and the quality of the approach.
We can also let misreporting vary by whether validation confirmed the behavior: Equation 2 is simply the special case whereβ 2 = 1 andβ 3 = 0. 59 We are mainly interested in whetherβ 1 = 0 andβ 1 +β 3 = 0. The disadvantage of this more flexible form is statistical power, especially with three treatment arms. 60 We are also interested in correcting for the average bias in survey-based treatment effects, which we get from β 1 from equation 2. But the more flexible form provides insight into the patterns of measurement error. For instance, if underreporting is concentrated among men who commit crimes and were treated, theñ β 1 +β 3 < 0. 61 58 Our target sample of 297 was the maximum number of interviews we felt qualified validators could manage logistically. We calculated minimum detectible effects (MDEs) using a two-sided hypothesis test with 80% power at a 0.05 significance level, using baseline and block controls when calculating the R-squared statistic. We calculated an MDE for both the 0-2 expenditures index and the 0-4 sensitive behaviors index. The expenditures index had a mean of .82 in the survey and an MDE of .13 for general over-and underreporting and .29 for a treatment effect on misreporting. The sensitive behaviors index had a mean of 1.12 in the survey and an MDE of .2 for general over-and under-reporting and .36 for any treatment effect on misreporting. We estimate that doubling the sample size would have increased power by about a third. 59 Appendix F.1 derives and interprets these regressions in more detail. 60 With 240 observations in total, each parameter is estimated off of roughly 30 observations, putting us on a steep part of the power curve. 61 If we exclude the block fixed effects used for estimating ITT effect, as in equation 1, thenβ 0 also contains contains information: if men honestly report crime in the survey thenβ 0 should be close to zero andβ 2 should be close to 1, while if there is a general desirability bias in the survey, thenβ 0 +β 2 < 1. See Appendix F.2.2 for this analysis. In general, estimates ofβ 0 suggest that sensitive behaviors are 12-15% more likely to be reported in the survey, possibly because of the validator's fidelity to the 2-week recall period and specific definitions, or a general conservatism, but there is no evidence this "survey overreporting" is correlated with treatment, which is the main purpose of the analysis. Notes: The table reports the degree and direction of bias in our treatment effects. In Figure A, we assume that our measurement error does not vary by whether or not the individual engages in the behavior, which allows for a simple way to use β 2 to adjust our ITT estimates. In Figure B, we relax this assumption and let the measurement error vary by behavior and treatment arm at the cost of reduced statistical power.
Results for sensitive behaviors We estimate equations 2 and 3 in Table 9, including block fixed effects. 62 For sensitive behaviors, almost none of the coefficients on treatment indicators or interactions are statistically significant. We see little evidence of the therapy inducing a desirability bias, and indeed the effects run in the opposite direction.
Indeed, looking at the index of four sensitive measures (Panel (a), Column 5), β 1 is actually greater than zero for therapy plus cash, implying that the impacts of therapy plus cash are, if anything, larger than the survey data imply. Appendix F.3 displays these updated treatment effects. For example, using survey data alone, the treatment effect (standard error) of therapy and cash on the sensitive behaviors index is -0.4 (0.09), a 36% decrease. The results from Panel (a), Column 5 suggest that the adjusted treatment effect should be -0.516 (.194), significant at the 1% level.
The results of the more flexible regression in Panel (b), Column 5 shows that these averages conceal important heterogeneity. Treated men who we think did not engage in the sensitive behaviors tend to over-report them (β Both 1 > 0), and treated men engaged in the sensitive behaviors seem to under-report them (β Both 1 +β Both Results for expenditures All treatment arms associated with a roughly 0.3 increase in our proxy for measurement error (Panel (a), Column 8). There is underreporting across all arms, but it is greatest in the control group. This could have implications for one of our main findings, on income. Using survey data, the treatment effect of cash only on the 2-item expenditure index is 0.08 (0.052), which is consistent with the short run increase in consumption we observed among cash recipients. But adjusting for observed measurement error, the adjusted treatment effect is -0.205 (0.143).
Interpretation Our qualitative work suggests two explanations. The men have been members of a subculture where drugs, crime, and gambling are commonplace, and admitting to the behaviors in a survey carries little stigma. Speculatively, therapy may have accustomed men to talking about these behaviors or reduced stigma. As for expenditures, control men may have acted strategically, trying to appear poorer in the hopes they would be eligible for assistance. We discuss implications for our conclusions in the following section.

Lessons from the cash transfer
One lesson is that these supposedly undisciplined men largely invested and saved a grant. Even accounting for the underreporting we see in gambling and other expenditures, little of the grants seem to have been spent on temptation goods. While Evans and Popova (2015) see the same result in 19 other cash transfer programs, it's striking to see the same with this extreme group. Caution is also warranted, because of the evidence that the control group underreported expenditures. But in the short run, men seem to have used the cash for petty trade, earning returns to capital of at least 26%. 63 There is only weak evidence, however, that criminal activities fell as business income rose. Those who received cash reduced antisocial behaviors only 0.08 standard deviations (not significant) but reduced stealing by about 30% (significant at the 10% level with unadjusted p-values only). The direction of effects, however, is consistent with rural ex-combatants in Liberia, who shifted away from illicit activities when a much more intensive employment program raised their farm productivity (Blattman and Annan, 2015).
Any investments and income gains disappeared within a year, however, in part due to poor property rights protections. The men's homes and neighborhoods were highly insecure. Extrapolating from reports of burglary and theft at each endline (from Table 3), men in our sample experienced a theft or robbery roughly eight times in the year after the grant. While treated men were no more likely to experience theft, they had more to lose, especially their savings and investment in nascent businesses.
Nonetheless, the fact that cash was well-used is important, since concerns about temptation spending restrain political support for cash-based welfare programs. The men received a few months worth of income, and basic consumption-especially basic shelter and food-improved for about that length of time. This is important.
Future research should study how to sustain the economic effects of cash. It may be that helping people relocate to better quality neighborhoods or enhance personal security, or providing the information and means to gain necessary licenses or protection from security forces would reduce expropriation. Alternately, programs can try to provide crude insurance. It is possible that regular cash transfers would stimulate enterprise development more than the one-time transfer we study (Bianchi and Bobba, 2013;Karlan et al., 2015).

Lessons from behavior change
The interventions have extremely similar impacts on both antisocial behaviors and an index of all noncognitive skill and preference measures, suggesting that noncognitive change, broadly-speaking, was a major source of behavior change. The impact of cash plus therapy on antisocial behaviors (0.31 standard deviations in 2-5 weeks and 0.25 after a year) is mirrored by a change of similar magnitude in noncognitive skills and preferences (0.43 standard deviations in 2-5 weeks and 0.25 after a year, see Figure 1). Likewise, therapy alone significantly affects antisocial behaviors and noncognitive changes within 2-5 weeks but not after a year, and cash alone does not have a significant effect on either outcome in any period.
Among the six noncognitive families we defined, all but self control skills showed a large and (with unadjusted standard errors) statistically significant decline after 2-5 weeks as a result of therapy and cash. Forward-looking time preferences, mental health (particularly positive self-regard), social networks (particularly peer quality and family relationships), and identity/values all show large and robust changes after adjusting for multiple comparisons.
After a year, it is difficult to single out any one noncognitive skill or preference change as robust. Individually, the largest and most precise changes are in forward-looking time preferences and self control skills.
Nonetheless, if we were to merge time preferences and self control skills into a single index of "future orientation", we see some evidence of sustained impacts as a result of therapy plus cash. The combined index increased by .22 standard deviations after 2-5 weeks (adjusted p = .029), and by .26 standard deviations after one year (adjusted p = .068). We must take this result with some caution, partly because we prespecified them as distinct measures, and partly because we do not see robust changes in short term self control. Nonetheless, a change in future orientation would echo the effects of adolescent CBT programs in Chicago that target similar automatic behaviors (Heller et al., 2015).
We see less conclusive evidence on the least standard aspect of the therapy-the focus on changing social identity and values. This family index increased 0.27 standard deviations after 2-5 weeks, significant at the 1% level with conventional standard errors but an adjusted p = .067. This index change driven by significant changes in both appearance and anticriminal values. But effects moderate after a year, particularly for appearance (and with the addition of prosocial behaviors), and become less precise.
These estimates contrast with men's qualitative personal accounts, where identity change was paramount. Qualitatively, the changes in appearance, in community regard, and the exposure to new places and situations seem to have been particularly important. So was the identity of the NEPI facilitators, and the fact that they modeled this identity change. This change has a basis in the theory underlying CBT: positive interactions challenged respon-dents' negative beliefs about themselves, and reinforced their identity as more responsible, mainstream members of society. Possibly identity and values are difficult to measure, and so this remains an important area for further innovation and testing.
In psychology, efficacy trials such as this one are typically followed by further trials that try to identify the "active ingredients", by varying modules and methods. This, plus more investment in measurement, seems like a fruitful area for research.

Understanding the cash-therapy interaction
We did not expect that the effects of therapy would persist only when cash was received as well. Our theory predicted that the two interventions should have a larger effect only if cash raises earnings permanently, which was not the case.
Our qualitative evidence and psychological theory, however, suggest a hypothesis for testing in future trials: that receiving cash was akin to an extension of therapy, in that it provided more time for the men to practice independently and to reinforce their changed skills, identity, and behaviors. The therapy was brief-just eight weeks long. It helped men change their intentions, identity and behavior, and provided almost daily commitment and reinforcement. After eight weeks the men who received therapy alone had to contend with their usual economic and peer pressures. The grant, however, provided some men with the cash they needed to maintain their new identity-to avoid homelessness, to feed themselves, and to continue to dress well. They had no immediate financial need to return to crime.
The men could also do something consistent with their new identity and skills: execute plans for a business. This was a source of practice and reinforcement of their newfound skills and identity. It was also a form of performance, to themselves as well as their family and neighbors, who could see the men engage in legitimate business. Our qualitative interviews also suggested that the cash helped men to survive shocks. In this way, the grant may have parallels to "booster sessions" commonly used in therapy. A small body of experimental research on CBT for aggression or substance abuse indicates that follow-up therapy sessions weeks or months after the intervention improve 12-13-month outcomes (e.g. Lochman, 1992).
Caution is warranted. We cannot reject the hypothesis, for instance, that positive reinforcement from winning a grant was enough to reinforce therapy. In future research, a comparison of extended therapy to shorter therapy plus cash would offer a more direct test.
Nonetheless, high short-run returns to capital and sustained social spillovers suggests that the combination of cash and therapy had promising returns. Since the private returns to the grant were temporary, however, the cost effectiveness rides mostly on the social benefits from roughly one fewer crime per week per person. These social returns are unknown. If these social returns are greater than $20 or $25 per crime, however, the STYL program is a 43 promising investment on basis of crime reduction alone.

Generalizability
For several reasons this approach has promise beyond Liberia. First, the therapy was adapted from U.S.-based CBT programs, suggesting that adaptability to other contexts is feasible. Second, we kept the intervention low-cost and created a publicly-available manual, curriculum, and training guidelines to ease adaptation and replication. Third, with time it should be possible to develop qualified and effective facilitators in other countries, not least because there are established methods for training counselors in CBT; general levels of education (and the number of social workers) are greater in most other countries; and new facilitators should emerge among graduates of the program, as with STYL.
The theory and results are also strikingly consistent with comparable U.S. programs and best practice. The attention to noncognitive skill change and social identity, the targeting of the highest-risk men, as well as the non-residential nature of the therapy, correspond closely to best practice in criminal rehabilitation in U.S. correctional institutions (Andrews et al., 1990;Lipsey, 2009). The 40-50% falls in antisocial behaviors we observe are similar in proportion to the falls in arrests documented in Tennessee and Chicago (Little et al., 1994;Heller et al., 2015). Moreover, as in Chicago, the effects of therapy alone were temporary.
Other U.S. work suggests that employment can be complementary to social and emotional counseling (Heller, 2014). In low-income countries, however, where most employment programs will involve self-employment, property security and risk are important scope considerations. Cash transfers in other poor countries have generally led to higher and more persistent incomes, in part because the gains are not stolen. So the STYL program could arguably work even better in places with more secure property rights.
There are limits to generalizability of course. For instance, there were no gangs or armed groups vying for men in our sample. CBT-based approaches may be most effective against disorganized, impulsive crime and violence rather than organized crime. There is also selection onto the street, and a country which has experienced many negative shocks (such as Liberia) might have more high-potential young men who need only a little help to regress to the mean. On the other hand, our evidence from dropouts suggests that the most antisocial men stay, and the program is most effective with them. These limits are speculative without further testing, however, and replication and experimentation seem more than warranted given the results of these efficacy trials in Liberia, Chicago, and elsewhere.  Overall, therefore, there is minor imbalance. We control for all baseline covariates in all treatment effects regressions in the paper to account for this. i ii iii

A.3 Tracking and attrition
We achieved tracking rates of roughly 93% over a year. 2 Given that this was such a transient population, we took special measures to minimize attrition.
Tracking to reduce attrition At baseline we were clear about our desire to stay in touch. We took photos and signature samples, and collected as many as ten different ways to contact each respondent. We documented contact information for each respondent, including all the places they said they sometimes stay, plus contact information for the network of people around them who have a more stable location. Respondents were often on the run from the police or other people, and so their contacts might be uncomfortable speaking to enumerators and revealing the respondent's location. Thus, after the baseline survey, we asked respondents to use the enumerator's phone to call their most stable contact and introduce the enumerator and study and give permission.
At each endline, enumerators would typically start with the phone numbers of the various contacts or respondent and try to arrange an appointment. Contacts received no financial incentive. Failing that they would begin visiting the various locations listed. A slight majority of respondents were found within a few hours. In other cases, all leads were cold and more extensive sleuthing and asking around the neighborhood was required. If someone had traveled or moved far away, enumerators either waited until they returned or traveled across the country to find them in person.
On the upper tail, it could take three to four days of physical searching to find the hardest-to-locate people. Enumerators only stopped searching when all possible leads had been exhausted.
Response rates Table A.4 lists survey response rates by treatment group and survey wave (pooling the 2-and 5-week surveys, and pooling the 11-and 13-month surveys). It also reports the p-value from a t-test of the difference between the response rate in each treatment group and the control group. None of the differences are statistically significant, and all are within about a percentage point of the control group response rate. The control group response rate is a tiny bit lower in the 12-13-month surveys and a tiny bit higher in the short run ones. But none of these differences control for covariates or even strata fixed effects, as in the next table.
Correlates of attrition and compliance We analyze the correlates of attrition in Columns 1 and 2 of Table A.5, which reports an OLS regression of an indicator for attrition on selected baseline covariates. 3 There are not significant differences in attrition by treatment group, substantively or statistically. Those who attrit are slightly wealthier and have slightly poorer mental health. In all, the treatment indicators and covariance are jointly significant at p = 0.047 so attrition is not ignorable. This is one reason we control for covariates in all treatment effects regressions. Similarly, in the US, researchers were able to reach 98% of the Perry Pre-school children at age 19 and 95% at age 27. One reason is that a small sample is easier to track intensively. Another reason is that enumerator wages are lower in Liberia in the U.S. and this means that intensive sleuthing and tracking is affordable. 3 We do so to reduce collinearity and thus ease interpretation. Results with full covariates draw similar conclusions.
vii Notes: Survey response rates are calculated as the difference between the total number of respondents at baseline and the number of respondents "unfound" at each endline, all divided by the number of respondents at baseline. Here, "unfound" refers to both respondents we could not locate and those we did locate but who choose to not participate in the survey.
A.4 Treatment compliance Figure A.1 displays the distribution of class attendance for those assigned to therapy. NEPI did not collect attendance data during the first week (three sessions), so for simplicity we assume that all participants who attended at least one session after week one also attended the first three sessions.
We use two definitions of compliance. Our first measure is defined as "attending at least 8 days of therapy", or about three of the eight weeks. Our second measure is defined as attending at least 80% of sessions (16 classes plus the 3 in the first week).
We analyze the correlates of compliance in Columns 3 through 8 of Table A.5. Being assigned to cash in addition to therapy did not affect the likeliness of attending therapy, which is to be expected since the cash grants were not known to participants until after therapy. The main correlates of compliance in the first three weeks are higher education, higher initial antisocial behaviors, and higher self-control skills. The main correlates of attending at least 80% of the sessions are higher education, better mental health, and patience in game play. Higher initial antisocial behaviors, and higher self-control skills are no longer so relevant.

B.1 Power calculations
After completing the pilot, we decided on a target sample of 1,000. This target was based on maximum program capacity and financial constraints. Based on the pilot, we estimated that the Minimum Detectible Effect for the full 1,000 (with a quarter for each treatment) would be a 0.12 standard deviation change in a standardized dependent variable for a two-tail hypothesis test with statistical significance of 0.05, statistical power of 0.80, an intra-cluster correlation of 0.25, and the proportion of individual variance explained by covariates as 0.10.

B.2 Randomization protocols
For the therapy and cash randomization, men in each block took turns drawing colored chips from an opaque fabric bag. In general, the bag was shaken and then the subject was instructed to turn away and to place one arm into the bag and to draw out a single chip. The color was confirmed and recorded.
In the cash instance, men were randomized in roughly equal sized blocks of about 50 people. Each man was invited into a private room to draw to ensure privacy and safety. This procedure was explained to the entire group, and all chips were placed into the bag in front of everyone. Then the bag was taken into a private room, and participants were called into the room individually. If they wished, they could inspect the bag to confirm that there were still chips of both colors inside. After everyone present had drawn, staff drew the remaining chips for the no-shows.
In the case of therapy, men were randomized each day, according to how many were recruited and surveyed in that neighborhood. This led to blocks ranging in size from 1 to 20, though the vast majority of blocks contained roughly 7 to 15 people. The draw was not as private as the cash draw, and men observed the outcomes of others drawing at the same time. Those who lost in the therapy randomization were offered a free meal along with the opportunity to discuss their situation with x someone, and they were transported to a location of their choosing. A small percentage of the men were visibly upset and refused to engage at this point.

B.3 Therapy
NEPI's standard curriculum tended to be longer and broader than the two noncognitive skill and value changes that we study. For the purposes of this study, we worked with NEPI to streamline and focus the traditional STYL curriculum in two ways. First, we further grounded the approach in terms of CBT, emphasizing more practice over lectures. In general these modifications were quite modest, since the program already incorporated these techniques. Second, we asked NEPI to exclude modules not relevant to their theories of change: interpersonal skills; conflict resolution skills; dealing with war trauma and PTSD; career counseling; and community leadership.
To clarify and validate NEPI's curriculum, a Liberian qualitative researcher acted as a participant observer throughout one of the two Phase 1 pilot classes. Based on NEPI's training materials, our analysis of the theoretical grounding of the therapy, and this participant observation, we and NEPI developed a full program manual for the intervention. 4 The manual details the history and theory of the interventions, guidelines for recruitment of trainers and participants, training suggestions, the full curriculum, and guidelines for out-of-classroom engagement.

Curriculum
The curriculum has eleven main modules, which we present here with some examples of goals and activities: 1. Transformation. A tenet of CBT is that the therapist explicitly sets goals with participants and lays out the therapeutic strategy. This module introduces the concept of transformation, its significance, and the processes involved in transforming oneself.
• The men are introduced to the techniques that will be used (role playing, lectures, storytelling, etc.), homework assignments, home visits, and the reasons for each.
• The module also introduces ground rules for behavior, in terms of being respectful, practicing listening, waiting your turn, etc. The men do not necessarily have these skills, or haven't exercised them in some time, and learning to abide by these behavioral rules is an important part of the therapy.
• Facilitators also begin to teach the songs, slogans, and call-and-response that will be used repeatedly throughout the course. These songs and slogans serve as important reminders of rules of behavior for the men to follow. They also can be used to bring order to a disorderly or inattentive group.
• There are symbolic rituals to indicate a break in their lives. For example, the men write their "street names" and aliases on sheets of paper and they are burned together.
2. Substance Abuse. This module defines substance abuse and discusses its ill effects, as well as steps for moving past it. It explicitly encourages participants to reduce their consumption of drugs, alcohol, and tobacco. They are cautioned against cutting drugs entirely, to avoid withdrawal problems. 4 Available at http://chrisblattman.com/documents/policy/2015.STYL.Program.Manual.pdf.

xi
• Men talk through and list reasons that they use drugs. The idea is to make them consciously aware of the reasons for their own behavior and risk factors in their lives. They also talk through the ill effects. Men talk through publicly about ways in which drugs have adversely impacted their own lives, sharing experiences.
• Men role play situations where they could be pressured to use drugs and practice strategies for saying no.
• An outside speaker comes to the classroom, often a former graduate of the therapy, to talk about their experiences with drugs and what it did to their lives, as well as what strategies they used to emerge. Men discuss strategies they can use in their own lives. They practice some of these as homework and come back to discuss their experiences with the class.
3. Body Cleanliness. The module explores the health, psychological, and social benefits of maintaining body cleanliness. Participants are encouraged to change behaviors that alienate them, and to present a public image (such as hair and dress) that promotes positive social interactions with community members.
• Body uncleanliness is defined and highlighted as a problem mainly by getting men to discuss and volunteer their own opinions and experiences in a group.
• The facilitators bring in a hair cutter, an electric shaver, and a set of nail clippers for men to clean up if they like.

Garbage/Dirt
Control. An extension of the previous module, this module highlights the importance of cleanliness in participants' environments, and the ill effects of living in a dirty environment. It aims to help them maintain clean, healthy, and orderly living spaces.
• Facilitators present the men with pictures of dirty and clean homes, businesses, and streets, and men point out different risks and unclean elements, and discuss the consequences.
• Men identify ways they can improve cleanliness where they live (e.g. get a garbage can) and set and execute these plans as homework, to be followed up with home visits.
5. Anger Management. This module discusses the causes and effects of anger, the problems with acting out in ways they may later regret. It also provides participants with tools to manage their anger.
• Men discuss the signs and indications of anger, in themselves and others, through discussion and role playing. Facilitators show pictures of angry faces and situations, and men interpret them. The aim is to make them cognizant of these signs.
• Men discuss the causes of anger, and learn to link some of their actions to other people's anger.
• Men discuss and role play the negative consequences of aggression and violence, or share experiences from their own life.
• Men practice nonaggressive responses to angry confrontations in class, such as learning to distract or calm oneself (walking away, doing other activities, starting discussions and de-escalating, or practicing breathing techniques). Men practice these techniques as homework.
xii 6. Self-Esteem. This module emphasizes the need for participants to discover themselves in order to begin the path to recovery. This module links their behavioral changes to respect, pride, and confidence.
• The facilitators try to link poor self-image directly to many of the behaviors they have discouraged in previous modules, both as a cause and consequence.
• Men discuss ways they can build self esteem, make plans, and execute them as homework.
• Facilitators work with men to identify worthwhile skills and characteristics they hold that are worthy of others' respect.
• Men practice shopping for goods in a supermarket or shop as one of the first exposure activities. They work through successes and failures as a group and try again, sometimes with the help of a facilitator.

7.
Planning. Reviews the steps and components necessary for planning and implementation. The goal of this module is to build participants' capacity to develop short-and long-term plans and understand the processes involved in executing these plans.
• Planning skills are commonly taught in CBT programs as a method to build new skills. At its most basic, this involves helping the men break down larger plans into smaller steps and helping them work through ways to accomplish those steps, positively reinforcing successes and helping them process challenges and setbacks, often as a group. Men give examples and discuss them together. Another example: Small groups of men are tasked with organizing activities, such as a football match. The larger group listens to the different plans and critiques them.
• As homework assignments, initially men are tasked with simple tasks (create a short term survival plan for feeding yourself or your family), and then more complex tasks (such as a business plan or home garden).
• Men are also tasked with identifying a successful friend or family member and determining what steps led to their success. A motivational speaker (usually a past graduate) is also invited to talk about the steps involved in their success and their learnings and setbacks.
8. Goal Setting. The module outlines tools participants can use to develop goals, objectives, and indicators for measuring success in their own lives.
• Participants are taught what short and long term goals are (through discussion and examples) and how to set reasonable short-and long-term goals (such as feeding their family, or starting a garden).
• First participants practice setting goals and making plans, and then the larger group discusses and critiques them. Participants then set their own small, short term goals (e.g. changing a behavior, reconciling with a family member, or saving a certain amount this week) and execute these as homework, processing successes and failures as a group.
• Participants discuss the characteristics of good goals (e.g. achievable, measurable, timebound) and revise goals and plans. They are given poor goals as a group and practice turning them into better goals. Another motivational speaker is used to discuss the role of goal setting in their own life.
xiii 9. Money Business. This module stresses the importance of engaging in positive spending habits and appropriately managing money. Impulsive spending habits are emphasized. Participants are taught to make plans and prioritize their needs and wants prior to spending their money.
• Men engage in exercises to track their own recent spending to see where their money has gone. They discuss the use and misuse of their own money. As a group they discuss regrets and bad decisions and work through the negative consequences. These are illustrated dramatically through role-playing and skits, followed by discussion.
• Later discussion, role playing and skits focus on techniques for resisting peer pressure and temptation. There is also testimony from a motivational speaker, usually a past graduate of the program.
10. Money Saving. The module introduces participants to various saving options and encourages them to reflect on the most suitable saving method for their lives. They practice interactions in informal and formal financial institutions.
• Men discuss the reasons for and advantages of saving and it is explicitly linked to positive self image and esteem in the community. There is another motivational speaker.
• Men learn techniques for saving safely at home without formal institutions. They learn to set and execute saving plans, using their goal setting and planning skills.
• Homework assignments involve saving money they would have otherwise used on things they regret (identified in the previous module). Homework also involves trips to the bank and informal lenders. Prior to these assignments they meet and role play in groups, and their strategies are discussed and critiqued by the larger group. There is also a focus on appropriate presentation and image in these outings.
11. Challenges and Setbacks. The module explores potential challenges and setbacks they will face and has them practice positive coping mechanisms needed to effectively overcome them. Challenges and setbacks are framed as a test of one's maturity, potential, and abilities, and an opportunity for improvement.

A note on the approach
Note that in the United States, cognitive behavioral approaches to reducing violence are conscious of the fact that the values and behaviors it encourages could be maladaptive in some situations, since being violent can also protect people. As a result, these therapies teach people to judge when and where to use aggression. 5 NEPI, in designing the STYL therapy, did not consider the need for educating men on such contingent, adaptive behavior. Rather, their philosophy was that fighting back or retaliating in this context would lead to cycles of violence and an escalation of future risk, not a decrease. NEPI also emphasized how it was also important for the men who passed through STYL to demonstrate to the community that they were not aggressors or violent, to maintain the new image, and retaliation could be counter-productive there.

B.4 Cash grants
We contracted the international non-profit Global Communities (GC) to conduct the registration and cash distribution, as well as oversee NEPI's financial management and implementation schedule. We did so for several reasons: 1. To keep the therapy and the research teams distinct from cash distribution; 2. To coordinate registration and implementation of the two activities; 3. To relieve the research team of project and financial management of the interventions; and 4. To make the intervention as close as possible to a real-world, replicable intervention by other non-profit or state organizations.
For safety, GC developed a highly structured system of cash distribution. GC staff held cash in a car that moved around the neighborhood, to avoid theft. A lottery team with the men gave grant winners a voucher, and put them on a motorbike taxi that was then directed to the street corner where the car with the cash awaited. They were told to approach the car (which had an identifying mark such as a red bag on the dash), hand over their voucher, and receive their cash. The car would then move to a new corner, whose location would be relayed by mobile phone, and the process would repeat.
Anyone who was assigned to the cash treatment but was not present on the day of disbursal was still eligible for the grant. GC attempted to locate them for up to three weeks afterward, and generally succeeded.

C Formal theoretical model
Our model is rooted in previous models of occupational choice with self-employment (Fafchamps et al., 2014;Udry, 2010;Blattman et al., 2014), but adapted to have a criminal sector as in the broad class of models described by Draca and Machin (2015). We employed a similar model in Blattman and Annan (2015).

C.1 Setup
We model an individual's choice between legitimate business and illicit activities under different conditions-with and without time inconsistency, and with and without financial market imperfections-and assess the predictions for a number of common labor market and crime-reducing interventions: greater punishment, increasing productivity in legitimate business (e.g. through technology or skills improvement), cash or capital transfers, and interventions that shape preferences-either time preferences or personal preferences against illegal behavior.
We use L b and L c to denote time spent in legitimate activities (such as petty business) and illegitimate activities (such as crime). Legitimate business produces revenue according to production function F (θ, L b t , K t ), where θ is productivity or individual ability and K is accumulated capital used in business. A person's decision to participate in illegal activity is motivated by the potential gains and costs from such activity. Gains include the expected illegitimate payoff per hour spent xv in illegal activities, w. Costs include the possibility of apprehension and conviction, which occurs with probability, ρ, and implies a penalty, f L c t−1 . Thus the penalty for criminal behavior is a linear function of hours spent in criminal activities in the previous period 6 The individual's total expected earnings from legitimate and illegitimate activities are y t ≡ F (θ, L b t , K t ) + w t L c t − ρf L c t−1 . . In addition to investing in business, the individual can also invest or borrow through a riskless asset with constant returns 1 + r. At each period t, the individual decides how much to invest for next period a t+1 and reaps interests ra t from last period's investments.
Individuals have utility function U (c, l, σL c ), where c denotes consumption and l denotes time for leisure. We also allow for individuals to have direct disutility from engaging in crime, as measured by σL c , where σ > 0 implies that implies that illicit work induces some internal penalty such as shame, though in principle it could also reflect social penalties such as a loss of esteem or exclusion from peers and other social networks. We make the standard assumption that U c ≥ 0, U l ≥ 0, We allow for the individual to have quasi-hyperbolic (β, δ) preferences.
We first consider the case without any uncertainty. The individual's problem is:

Without credit constraints
Without time inconsistency (β = 1) or credit constraints, the set of optimality conditions are: 6 One reason for this modeling choice is because we want to explore the role that quasi-hyperbolic preferences play in the decision to commit crimes when the punishment is in the future not the present.
7 For ease of analysis, we also assume that the marginal return to capital is infinity for the first unit of capital invested in business, and that as long as there is positive capital input, marginal product of labor for the first unit of labor will be infinity, i.e. lim F K (θ, L b , K) = +∞ as long as K > 0. This assumption guarantees that investments and hours in business will always be positive.
xvi where for ease of notation, we use U (t) to denote U (c t , l t , σL c t ) and F (t) to denote F (θ, L b t , K t ). Since we modeled crime punishment as a potential reduction in future wages, the risk neutral individual will view crime as an occupation with a discounted wage w t − ρf 1+r . To find the marginal conditions for engaging in each sector, we first consider the case where illicit activity is not feasible. This would arise naturally if the probability of apprehension is high enough and punishment is heavy enough that w ρf 1+r . In this case the decision to engage in business depends on productivity θ, wealth level and the returns on other financial assets r. We use c ba , L ba and K ba to denote consumption, labor and capital level in this scenario. Each period t, the individual chooses L ba t to satisfy which says expected returns from crime are higher than the highest possible marginal rate of substitution between leisure and consumption the individual can achieve without engaging in crime.
Since −U σL m /U c > 0, a rise in σ means more people will drop out of crime.
If condition (6) is satisfied and if K t > 0, the individual then chooses L b t and L c t such that the marginal product of labor in business equals his expected marginal gains from crime, which also equals his marginal rate of substitution between leisure and consumption: i.e. conditions (1) and (2) will be satisfied. Notice L c t may not always be positive. The individual will not engage in crime if any or all three of the following happens: w t is very low relative to the probability of apprehension ρ and punishment f ; productivity in business θ is very high; the degree of aversion to crime σ is very high.
Capital investment and hours in business will satisfy condition (3). Notice that w, ρ and f will not affect returns to investment in business.
Interventions that increases the disutility of crime or the size or probability of punishment will reduce time devoted to in crime, but will have no effects on returns in business. 8 However, interventions that increase business productivity θ will not only induce more investment in business, but also reduce involvement in crime. In other words, ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous, ∂L c ∂θ < 0 and ∂L b ∂θ > 0. Finally, interventions that provide capital or liquid financial assets, such as a cash windfall, will not affect occupational choice at all, since the individual will already be working at his optimal level in both sectors. The windfall will simply be consumed and saved.

With credit constraints
In this section we consider the model with a simple credit constraint in the form of a t ≥ 0-individuals are unable to borrow in any period. We focus our attention on individuals whose initial a 0 is low enough that at some point in his life, the credit constraint is binding. Credit constraints will affect optimal conditions (2) and (3). The optimal condition for capital investment (3) becomes The level of investment in business may change depending on the shape of the utility and production functions, but the returns to investment will not change.
xvii and the optimal condition for hours in crime (2) becomes 1+r . For the impatient individuals whose 1 δ >1+r, their optimal level of capital investment will be lower than the baseline case because of the credit constraint. They are also have a higher expected returns from crime than in the baseline case, because the low level of business investment also forces them to put a higher discount rate on potential future punishment from crime.
Critical condition (6) becomes Credit constraints induce more individuals who would otherwise not engage in crime to commit crime. For the impatient individuals, credit constraints increase their hours in crime and reduce their capital investments and hours in business activities.
Interventions that ease the credit constraint, including cash windfalls, will induce more investment in business and reduce involvement in crime. As in the baseline case, ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous, ∂L c ∂θ < 0 and ∂L b ∂θ > 0; however, the magnitude the effects of a change in σ or θ will be greater than in the baseline case; the magnitudes also increases with the degree of impatience: |∂L c /∂σ| dδ < 0, |∂L c /∂θ| dδ < 0 and |∂L b /∂θ| dδ < 0 (notice that the lower the value of σ, the more impatient the individual).

Without credit constraints
Time-inconsistent individuals (β < 1) will be more reckless in the present. Intuitively, the smaller is β, the more individuals want to enjoy higher consumption today at the expense of future consumption, which means they will borrow more, save less, invest less in business and/or involve more in criminal activities. However, as long as there is a perfect financial market, no one will change their business or criminal activities in order to consume more today-they will simply borrow more (or save less) today through the financial market.
In terms of optimal conditions, in the absence of any credit constraint, the only condition that changes is equation (4), which becomes where W t denotes total wealth at time t, c P t+1 denotes the individual's predicted future decision about c t+1 at time t. For the sophisticates c P t+1 = c t+1 while for the naifs c P t+1 > c t+1 . Compared with the baseline case, the discount factor δ is replaced by the effective discount factor ∂c t+1 ∂W t+1 βδ+(1− ∂c t+1 ∂W t+1 )δ, a weighted average of the short-run and long-run discount factors βδ and δ where the weights are the next period marginal propensity to consume out of total wealth. xviii Notice that neither condition (2) nor condition (3) changes, as long as we have no credit constraints. Compared with the baseline, time inconsistency alone will not affect criminal activities or business investment. It would only change the level of savings or debts.
In this case, interventions that aim to correct time consistency will have no effects on either business investment or criminal activities, but will have an effect on consumption, savings and income.
Compared with the baseline case, τ > 1 + r as long as an individual is credit constrained (i.e. has no savings). The level of τ will be higher for the sophisticates than for the naifs. However, regardless of their level of sophistication (i.e. the way individuals set their expectations for their future behavior), we know for sure that τ > 1 δ , and the smaller β is (i.e. the more time inconsistent), the higher τ will be. Compared to the time-consistent credit constrained case, fewer individuals will invest in business, more individuals will engage in crime, business investment levels will be lower, and hours in crime will be higher for everyone. The difference increases with the level of inconsistency (i.e. decreases with β).
Interventions that improve time consistency will shift people away from crime towards business. So will increasing the disutility of crime (though, as in the case without time inconsistency, while ∂L c ∂σ < 0, ∂L b ∂σ is ambiguous). Increasing business productivity will have similar effects as before: ∂L c ∂θ < 0 and ∂L b ∂θ > 0. In all of these cases, however, the magnitudes the effects of a change in σ or θ will be greater than under time consistency, and the magnitudes also increase with the both degree of impatience and the degree of time inconsistency: |∂L c /∂σ| Notice that the lower the value of β, the more time inconsistent the individual is, and similarly, the lower the value of σ, the more impatient the individual is.

C.4 Introducing uncertainty and risk aversion
Three potential sources of risk are uncertainties in business productivity θ, wages from criminal activities w, and the potential punishment after apprehension f . We assume that decisions on business investment and hours in both sectors are made before risks are realized, and that θ, w and f follow independent stochastic processes.
With uncertainties in both the business and illicit sector, business investment and hours in both sectors depend on the variance of returns in both sectors and the level of initial wealth a 0 . If both sectors are sufficiently risky, then those with high levels of wealth a 0 will turn away from both activities by reducing K, L b and L c and investing instead in other riskless assets. K, L b and L c will all be lower than the cases without risk. Those with low levels of initial wealth will not be able to live off savings alone, so they will have to invest more in either or both sectors, depending on the relative riskiness of the two sectors. As long as both sectors are similarly risky, K, L b and L c will all be higher; otherwise, if one of the sectors is less risky than the other, individuals will invest more time in that sector. L c L b +L c will be lower than in the case without uncertainty if returns to crime are more volatile than business returns. One special case would be if individuals face a significantly positive chance of death after committing any crime. This is the equivalent of saying f = +∞ with strictly positive chances. In this case hours in crime will be reduced to zero as long as the probability of apprehension is positive, ρ > 0.
With the presence of risk, inventions in θ will have greater effects, because an increase in θ now also makes business relatively less risky. A rise in σ will also have a bigger effect than without uncertainty, because risk aversion will reinforce the rise in aversion and further reduce hours in crime.

D Measurement
In this section, we discuss measurement decisions in more detail, including what was and was not specified in the 2012 National Science Foundation (NSF) proposal 1225697 that substitutes for the absence of a pre-analysis plan. 9 Section 4 of the proposal provides a numbered list of hypotheses and primary outcomes, and (roughly) how we planned to operationalize them, especially Sections 4.1 and 4.4. Section 5 expands on measurement approaches, both for these primary outcomes, as well as for control variables and other outcomes of interest. These are the key sections to examine now. Section 4 in particular is the basis for our organization of the current paper. That section and the introduction (Section 1) not only emphasize particular primary outcomes, but also the division into ultimate and intermediary outcomes.
We also report control group means and treatment effects on all of the survey questions that enter an index in the main tables. A note of caution: the standard errors have not been adjusted for multiple hypothesis testing, and so patterns across treatment effects within an index are suggestive only. 9 See http://chrisblattman.com/documents/research/2012.01.13_STYL_NSF_proposal.pdf, where the core hypotheses (and division into ultimate and intermediary outcomes) are outlined in Sections 1 and 4.1, and the operationalization (and measurement) of key outcomes in Section 4.4 of the proposal. Table D.1 displays treatment effects for all components of our antisocial behaviors index. 10 These are purely illustrative, and we do not adjust standard errors for multiple comparisons.

D.1 Antisocial behaviors
Sections 1 through 4 of the NSF proposal make the primary, ultimate outcomes fairly clear: "poverty" and "violence", where Section 4.1.C defined "violence" as "crime, aggression, and political violence". As discussed in the main paper, only political violence was later dropped because none occurred before endline. We renamed this collection of outcomes "antisocial behaviors," for generality and clarity.
We are not aware of existing scales or measurement tools for Liberia, or even similar populations in sub-Saharan Africa or other low income countries. Thus, in general, our variables grew out of months of field work, qualitative interviews, and survey pre-testing by the authors and their research assistants, in order to understand common offenses and behaviors. Liberians speak a pidgin English and street youth have a slang of their own, and so even where we began with common scales (such as aggressive behaviors) the wording had to undergo extensive translation and testing to make sense. We also added new aggressive behaviors common to the study population and Liberian culture. Table 3 of the main paper reports all measures of economic performance, and we do not replicate it here. The NSF proposal emphasized "poverty" as one of the two ultimate outcomes of interest, and in Section 4.1 expanded on this to discuss the expected impacts of the therapy on "economic decisionmaking and outcomes", including "levels of business investment and expenditures, savings, income and assets/consumption". The table in that proposal focused on business investments and our three measures of income (consumption, earnings, and asset stock). We look at all these measures of economic performance in a single family index, but would draw the same conclusions if we took a narrower definition of poverty and focused on the income measures alone, or even consumption alone. Table D.2 displays control means and 12-13-month treatment effects for all subcomponents of our forward-looking time preferences index. The summary index consists of eight equally-weighted components: four measures for patience (δ) and four measures for time inconsistency (β). Components come from incentivized game play, hypothetical trade-offs over time, and survey measures.

D.3 Time preferences
The NSF proposal outlined these measures fairly specifically. Section 4.1.A specified our interest in the malleability of present versus future orientation and time inconsistency, and Section 4.4.A operationalized these measures as incentivized intertemporal choice games; hypothetical intertemporal choice games; and self-reported preferences. The main source of ambiguity was that the proposal referred to these measures variously as "discount rates", "present bias" or "forward-looking behavior".
In the end, the survey/incentivized games collected four types of measures, and each one yields a proxy of patience and time inconsistency 10 To save space, we only display 12-13 month results.
xxi Following the survey, subjects were asked to play a set of "real money games" where they had to make a series of intertemporal choices between money at one point in time versus more money later in time, with some probability of a payout. The average payout was about $3, roughly a day's wages. 11 The first choice was between money now and more money in two weeks; second between two weeks and four weeks; and finally one more question for each of these pairs of delays, but with the numbers modified depending on their first answer (i.e. if they chose to wait, then they were asked again but with a lower reward in the future). This bifurcating design allowed us to glean as much information as possible about their preferences with as few questions as possible, and we pretested the potential payouts to maximize the variance in responses.
Based on game play, we assigned present and future patience scores for each respondent, ranging from 0 (less patient) to 3 (more patient). 12 We then used the sum of patience scores from the games to put people into 7 increasingly patient bins (0-6), and the difference of scores to put people into 7 increasingly time inconsistent bins.

Hypothetical trade-offs
During the survey questionnaire, well before the incentivized games, we asked respondents to make the exact same series of tradeoffs as above, but in a purely hypothetical setting. We constructed the patience and time inconsistency proxies in exactly the same manner. Our aim was largely methodological, as we were interested in whether people responded differently when games were incentivized rather than hypothetical. This analysis-comparing the consistency and comparability of time preferences over different measures and over time-will be the subject of future methodological work, based on similar data we have collected across several countries and populations. 13 In the meantime, we merely use all available time preference measures in our summary index, in the interest of reporting all survey measures used from each family.

Hypothetical discount rate
We also attempted to measure the discount rate in a second way (again, mainly for the methodological study mentioned above). As in Holt and Laury (2002), we asked respondents 11 Subjects were told that one of the questions across the next few activities would be picked for payout, and their choice implemented, so that they should pay careful attention to their decisions. We told subjects that if one of the inter-temporal tasks was chosen for payout, and if their individual choice implicated a delayed reward, that we would come back and find them at the appointed time, in their own environment, to pay them.
Since we were typically returning in a few weeks to interview them again, and had interviewed them several times before, this was a reasonably credible commitment. Nonetheless, it could lead us to conflate patience with trust that the survey team would return. By the endline stage (their fifth survey with us), respondents knew us fairly well and knew that we were able to track them (and that we had paid them everything we had promised them in the past).
In fact, for logistical reasons, we also made one of the games a choice between a certain payout now and a lottery between a high and low payout (i.e. a risk preference question) and we selected this risk game for payout with very high probability, such that the intertemporal games were almost never paid out Although we did not technically lie at any point (since we did not mention the probabilities that each task would be paid out) this could be construed as minor deception. None of the respondents brought this up, even after having gone through the process five times.
12 For example, if a respondent preferred 150 Liberian dollars (or LD, where 1 USD = 60 LD at the time) in a week over 50LD now, and 100 LD in a week over 50 LD now, they received a 3 for their present patience score. If they preferred 50 LD in two weeks over 150 LD in three weeks, and 50 LW in two weeks over 300 LD in three weeks, they received a 0 for their future patience score. 13 In the meantime, we can see that the means similar (3.96 for the incentivized game versus 3.35 for the hypothetical), but this 15% difference is statistically significant at the 99% level. xxiv a series of hypothetical inter-temporal choices for larger amounts of money (on the order of US$10-30, about a week's wages). This was organized as two lists of 11 binary decisions, with a fixed amount right now versus a varying amount in two weeks (or two weeks versus four weeks for the second list). The delayed amount started as strictly less than the sooner amount (e.g. 1000 LD now or 900 in the future), then equal to, and then larger and larger until it was four times as big (1000 LD now or 4000 LD in future).
We calculated discount rates based on each respondent's first switch from a present preference to a future preference. 14 Those who preferred 900 LD in the future over 1000 LD in the present received a discount rate of .9, while those who always preferred money earlier received a discount rate of 4. We then took the average of the inverse of the present (now versus 2 weeks) and future (in 2 weeks versus 4 weeks) discount rate as our measure of patience, and the difference between future and present as our measure of time inconsistency.

Self-reported survey questions
We asked respondents six qualitative questions to gauge their self-reported levels of patience and time inconsistency. 15 For example, respondents were asked to place themselves on a ladder from 0 (least patient) to 5 (most patient) as one measure of self-reported patience, and how much they agree with statements such as "When I get money, I spend it quickly" as a proxy of time inconsistency. Specific questions are displayed in Table D.2. By reporting all measures collected in the endline survey, three-quarters of our time preference measures are hypothetical rather than based on incentivized games. For robustness purposes, in Table D.2 we also report a summary index of the incentivized games only. Table D.3 displays control means and 12-13-month treatment effects for all subcomponents and survey questions in our self-control index. Again, these treatment effects are for illustrative purposes only.

D.4 Self control skills
The NSF proposal highlighted "noncognitive skills of self control" as one of the three primary intermediate outcomes of interest in Section 4.1, though the proposal sometimes used "impulse control" or "self-discipline" synonymously. Sections 4.1.A and B gave examples such as "inhibition control, executive function, and perseverance". Section 4.4.A added that we will measure this using "standard psychological skills such as conscientiousness, locus of control, working memory, and inhibition." This ex ante description included executive function and locus of control, neither of which we presently include in family of self control measures in the paper. We decided to exclude these two measures after the NSF proposal was written but before final data collection. While this decision is not formally documented, the psychological and neurological principles are clear.
• While economists often refer to executive function as a "noncognitive" skill, one associated with self control, it is a technically cognitive ability in that psychologists and neuroscientists view it as a measure of mental performance established at a young age. Psychological and 14 Enumerators continued down the list, and (oddly) a nontrivial fraction switched multiple times. We use the first switch only. Furthermore, about 17% of respondents preferred less money in the future as a commitment device, especially if they were expecting a large purchase coming soon. 15 Dohmen et al. (2011) and Jamison and Karlan (2011) show that basic self-reported attitudes on risk and time preferences can be externally valid.
xxv xxvi neuroscientific research suggests that executive function responds to childhood but not adult investments, and that investments result in very task-specific changes. Thus both theory and evidence suggests that this neurological capability should not be affected by CBT. Appendix D.7 below describes this research in more detail and illustrates the consequences of including or excluding executive function from the self control family.
• Locus of control or self-efficacy should never have been mentioned in the NSF proposal as linked to self control, and indeed the one mention of it was an aberration and an error. The concept of self control is intended to measure the degree to which you feel you are able to control your own emotions and behavior versus the degree to which you find you act impulsively or act as your emotions dictate without restraint. It is a skill. Locus of control, meanwhile, is a perception-a measure of the degree to which a person feels that events can be influenced by their own behavior versus luck or fate. It's possible that self control skills could affect a person's perceptions of self-efficacy. These two concepts are considered distinct, even unrelated, by psychologists. Psychologists view locus of control as a measure of self-regard and often combine it with measures such as neuroticism and self-esteem (Judge et al., 2002;Judge and Bono, 2001).
Instead, the survey included four psychological scales: impulsiveness, conscientiousness, GRIT, and reward responsiveness.
These existing scales typically have many more questions than we could use in the survey (or are commonly used in any assessment). These questions are typically organized into sub-scales to capture subcategories of behavior. We selected questions to use based mainly on whether they were easily understood and familiar to pre-test respondents, but we took care to ensure roughly equal proportions of questions from each sub-scale remained.
Because all personality questions were selected from questionnaires used in the United States, we first translated them into Liberian English by the enumerators, the authors and their research assistants then pre-tested the questions with young men from the same population as the youth in our study (but not members of the study sample).
To ensure that the questions continued to assess the original underlying constructs, we performed two checks. First, within the pre-test data we ensured that groups of questions were correlated or anti-correlated as one would expect given the underlying personality measure (e.g., impulsivity was negatively correlated with conscientiousness). Second, we performed a confirmatory factor analyses to ensure that within scales, questions were answered similarly.

D.5 Anti-criminal and anti-violent self-image/values
We are not aware of prior attempts to conceptualize or measure anti-criminal and anti-violent values and self-image. We developed three measures: self-reported values, prosocial behaviors, and appearance. The first two of these were prespecified in the NSF proposal.
Values and self-image were operationalized in Section 4.4.B of the NSF proposal, as the main "threat" to identifying the operation of the therapy through time preferences and self control skills. We recognized that the therapy could "change norms or beliefs about violence and its acceptability and risks and thereby reduce violence not through an effect on time preferences and self-control but through norms and the intrinsic utility or disutility of violent action." As a result, the proposal suggested that "we should observe a treatment effect of the [therapy] on self-reported norms towards xxvii (1) Most of the time, I will do things for no other reason than that I will enjoy them. Notes: The table reports 12-13 month intent to treat estimates of self control outcomes. N=943 because 4 respondents did not answer all questions. We calculate the impact of each treatment arm controlling for baseline covariates and block fixed effects. We focus on pre-defined composite measures, typically defined by survey module. The overall summary indexes are the standardized mean of its composite outcomes, standardized. Heterosketastic robust standard errors are reported in brackets. *** p<0.01, ** p<0.05, * p<0.1 xxix violence, criminality and other antisocial behaviors, and possibly an increase in forms of collective actions, such as contributions to public goods or political participation." As we finalized the intervention, saw the pilot results, and designed the endline survey, we realized this was not a "threat" but interesting in itself, and simply another intermediary mechanism. In the paper, we chose to focus on the direct measure of preferences as opposed to the prosocial behavior (which were actions not preferences, and where by using the word "possibly" we indicated that we did not have strong priors).
As an afterthought, we also happened to measure the enumerator's impression of the respondent's appearance. We didn't conceive of these as measures of an actual skill or preference change, as they are choices or actions. Hence we treated them as an "other" outcomes in the last version of the paper. One of the referees argued persuasively that our measure of self-image should be changed ex post to include this measure of appearance. We agree and have made that change in the revised paper.

Self-reported values
The closest parallel to our measure of values is the measurement of social norms, where social psychologists ask respondents: (1) what the respondent thinks other people do (descriptive norms); (2) what the respondent thinks other people believe is appropriate (prescriptive norms); and, in some cases, (3) what the respondent him or herself believes is appropriate (an attitude) (Paluck, 2009). We used social norm surveys on behaviors such as bullying and conflict resolution as models for our approach, but had to develop our own original measures suitable to the context and treatment.

Prosocial behavior and appearance
From qualitative interviews (and prior surveys in the country) we also developed a number of locally-relevant prosocial behaviors and appearance measures D.6 Additional intermediary outcomes of interest: Mental health, substance abuse, and social networks Sections 1 to 4 of the NSF proposal made no mention of mental health, substance abuse, or social network change as major hypotheses or intermediary outcomes. Nonetheless, many of these were highlighted in Section 5 of the proposal as control or other outcomes of interest, as it is conceivable that they could be affected by the interventions, and that any change in these could also affect poverty or antisocial behavior. Table D.5 displays the control group means and 12-13-month impacts of all the survey questions that comprise these three families. We describe the composition of each of these measures in Section 6.3 of the main paper.
xxx (1) If your best friend has some counterfeit money that need to be washed with mercury, will you join him in the search of the mercury to get lots of money?  Table D.6 displays the control group means and 12-13-month impacts of all the components of executive function. We decided to measure executive function over time out of other research interests in measurement and consistency of executive function. We did not hypothesize a change in executive function, for reasons noted above, and so these results are not reported in the paper. This is the only surveyed outcome not reported in the paper's main tables. In the following section, below, we show the robustness of results to its inclusion.

D.7 Executive function
In order to measure executive function, our behavioral protocol included three interactive activities drawn from economics and psychology. 16 Planning behaviors We used a series of mazes to test planning behavior. Mazes were unknown to nearly all respondents. Subjects were shown an example maze on paper and then given 2, 2, and 3 minutes respectively to complete increasingly difficult mazes. Each had two entry points, one of which almost immediately led to a dead end. The main outcome of the mazes was the subject's ability to pause and plan their approach before completing the maze (i.e. did they plan their approach before choosing a starting point). As outcomes, we measure "time to first touch", 16 Across all behavioral tests administration was standardized. First, a clinical psychologist and economist trained enumerators in test administration. Next, in collaboration with experienced enumerators and research assistants, a comprehensive protocol was developed and used by all future enumerators. Enumerators were also instructed to answer clarifying questions and were taught the over-arching concept within each game so they could address questions/alleviate concerns without straying from the central concepts of the tests. This tight control over the testing situation allowed us to collect relatively sophisticated measures of cognitive function and behavioral responses to rewards in a constrained and otherwise under-resourced testing environment. xxxiv or the amount of time spent planning prior to engaging in the maze; and number of mistakes (or "backtracks") in Maze 3, the hardest maze, which required the most planning and by which time participants had learned the concept of the maze. On average subjects took 18 seconds to plan for Maze 3 (SD = 23 seconds).
Behavioral inhibition and cognitive flexibility We developed the "arrows game", a modified directional Stroop task, a class of tasks that assess inhibitory control. Here subjects were shown a sequence of large black or white arrows that pointed either up or down and were first told to respond "up" or "down" to each arrow ("arrows baseline"). In the second version they were again shown the arrows but now were told to state the opposite direction; this constitutes producing the less common response while suppressing the more common response and is an assessment of inhibition ("arrows inhibition"). Finally, in a third version subjects were told to switch between two approaches: if the arrow was white they were to state the actual direction, but the opposite direction if the arrow was black. This is commonly called 'switching' and is an assessment of cognitive flexibility, the ability to move rapidly between two goals as the situation demands ("arrows switching"). For each version, the outcome data included total time to completion and the number of correct/incorrect responses out of 32 arrows. On average subjects made .33 errors (SD = 1.5) on arrows baseline, 2.4 errors (SD = 3.5) on arrows inhibition, and 3.9 errors (SD = 3.9) on arrows switching. Arrows took on average 25 seconds (SD = 17.7), 38 seconds (SD = 45.8), and 46 seconds (SD = 28.7) for baseline, inhibition, and switching separately.
Working memory Working memory is the ability to hold something in mind when it is no longer present in the environment and then manipulate it. The digit span task is an assessment of working memory. The digit span tasks involved the enumerator saying a random sequence of digits (1-9) xxxviii out loud with a short pause between each digit, followed by the respondent repeating them back either in the same (forward-digits) or the reverse (backwards-digits) order. The enumerator began by giving two 2-digit numbers (one at a time) and recording the responses. If the subject correctly reported either of the numbers back, the enumerator would do the same with 3-digit numbers, and so on up to a maximum of 9 digits. As soon as the subject incorrectly reported both examples at a given level or span the enumerator moved on to the next activity (backwards-digits). The reverse digit span was done the same way, except that the subject was instructed to repeat the digits in the opposite order that the enumerator gave them (e.g., "three, zero, one") On average subjects were able to remember 5.5 digits forward (SD = 1.23) and 3.33 digits backwards (SD = 1.03). Each activity existed as two slight variants (e.g. changing the numbers in the gambles). These activities were alternated in the 2 versus 5-week endlines and the 12 versus 13-month endlines, so that participants were never asked identical questions too close together in time.

D.8 Distinguishing between different measures of "self-control"
Our summary indexes distinguish between self-control skills (assessed by various psychological scales), economic time preferences (using incentivized and hypothetical games), and (as an "other" outcome) executive function. Here we discuss the decision to separate these measures and what happens when we relax that assumption.
First, we treat the difference between time preferences and self control skills as an empirical question. As reported in Section 6.4, they are positively and significantly correlated but with a correlation of 0.33 it is unclear whether they are distinct or not. As we report in Table D.7, combining both into an equally-weighted index leads to large increases in the measure for both the therapy-only group (0.17 SD after 2-5 weeks, 0.18 SD after a year) and therapy and cash group (0.22 SD after 2-5 weeks, 0.26 SD after a year).
Second, we separate executive function from self control as well.
A main reason is that these abilities mature over the lifespan, and psychologists and neuroscientists have emphasized the importance of early-stage investments over late-stage investments because the neuroscientific principle of developmental plasticity, and data from randomizing young children into different early investments suggests that early, but not later investments shape cognitive function (Nelson, 2007).
This is not to say that they are not highly correlated or have common roots early in life. A large literature documents that in some extreme populations (e.g., individuals with substance abuse disorder, kids with ADHD) many of these indices of 'self control' co-vary. That is, kids with ADHD have deficits in performance on inhibition tasks (e.g. Barkley, 1997). These same children, by definition, behave impulsively and appear to be more sensation or risk seeking. Taken together, many have taken this covariance as evidence that these traits are interdependent. There is even a small neuro-imaging literature which suggests that these different forms of impulsivity are subserved by the same neural areas (Aron, 2007).
Nonetheless, there are many hints in the psychology and neuroscience literature that this is an oversimplification. For example, even within extreme populations, sensation seeking and impulsivity, measured similarly, may be differentially linked with behavior (Ersche et al., 2010). In typical developing children, successfully resisting temptation on delay of gratification tasks is not predicted by performance on inhibitory control tasks, but the strategies employed in attempting to resist temptation is (Eigsti et al., 2006).
xxxix In fact, the best test is to do what we have done here: randomly assign individuals to an intervention which shifts one of these indices and observe if they all move together. The fact that we see no improvement in executive function is consistent with the skills being different. In Table D.7 we test the combined measures formally, and we do not observe significant increases in a measure combining self control with executive function. Furthermore, their correlation is only 0.15, less than half of the correlation between self control and time preferences E Additional treatment effects analysis E.1 Ignoring the ultimate/intermediary distinction

E.2 Robustness of treatment effects to alternate models
Our robustness tests focus on the five main summary outcomes. First, in Table E.2, we show robustness to alternative ways of constructing the indexes and pooling or averaging of endlines. Columns 2-4 report results from the main paper for comparison. Recall that in this main specification we averaged endline surveys (at 2 and 5 weeks, and 11 and 13 months), took an index of composite measures rather than individual survey questions and used equal weights. In columns 5-7, we do the same except use randomization inference to assess statistical significance. In columns 8-10, we pool our composite measures from both endline surveys and cluster our standard errors by individual.
In columns 11-13, we do the same except weight each survey question equally. In columns 14-16, we use covariance-weighted indexes from Anderson (2008) and average both endlines. 17 The conclusions from these three specifications are quantitatively similar to those from the main specification. Exceptions are as follows: • The impact of cash and therapy on the covariance-weighted antisocial behaviors index is not significant after a year at conventional levels. This is because half of this index's weights come from domestic violence and number of arrests, two components that were unaffected by treatment. If we exclude domestic violence from the index and recalculate covariance weights, cash and therapy lead to a .26 standard deviation decline in antisocial behaviors after a year (column 19, significant at 99% level).
• Cash increases antisocial behaviors after a year in some specifications. In Column 15 we see that after a year the men who report cash only increased their antisocial behaviors by 0.17 standard deviations. In the other specifications, the coefficients are positive as well but smaller and not statistically significant. One possibility is that receiving a cash grant and failing, or 17 For this index, each component is weighted by the inverse of the covariance matrix of all index components. Outcomes that are highly correlated with each other receive less weight while outcomes that are uncorrelated receive more weight as they represent new information. We cannot covariance weight the pooled endlines, since they are unbalanced in the sense that some outcome measures appear in only one endline while others appear in both. xlii having the money stolen, reinforces men's participation in crime. This is largely speculative, however.

xl
Next, we check for robustness to alternative attrition scenarios by bounding treatment effects. We impute outcome values for unfound individuals at different points of the observed outcome distribution. The most extreme bound, from Manski (1990), imputes the minimum value for unfound treated members and the maximum for unfound controls. Following Karlan et al. (2015), we also calculate less extreme bounds by imputing relatively high values of the dependent variables for missing control group members, and relatively low values for missing treatment group members. 18 Specifically, we impute missing dependent variables for the treatment (control) group as the found treatment (control) mean minus (plus) 0.10, 0.25, or 1 SD of the found treatment (control) distribution. Note these imply large and systematic differences between missing treatment and control members-Columns 8 -10 assume unfound control group member outcomes are roughly 2 SD greater than unfound treatment group member outcomes. Table E.3 reports ITT estimates under these attrition scenarios. Our results are generally robust to these alternate specifications. When X = 0.25 SD, we still observe large and statistically significant changes in antisocial behaviors and our index of mechanisms after a few weeks and also a year. When X = 1 SD, our estimates of treatment effects lose significance but generally point in the correct direction. Meanwhile, the Manski bound brings us closer to having no treatment effects in the medium term term.

E.3 Both versus just one treatment
In this section, we compare the effects of receiving one treatment versus receiving both therapy and cash. Specifically, we test whether the coefficients on either therapy only or cash only in Section 6 are statistically different from the coefficients on therapy and cash. Table E.4 displays the mean difference between treatment effects and corresponding p-value for each of our three main outcome variables.
Our results indicate that cash and therapy compliment each other in reducing antisocial behaviors in the medium-run, while therapy compliments cash in the medium-run mechanisms.

E.4 2-5-week versus 12-13-month treatment effects
In discussing our results, we emphasize differences between outcomes 2-5 weeks after the intervention and outcomes 12-13 months after the intervention. In this section, we test whether the 2-5-week and 12-13-month impacts are the same. We pool our short-term results with our longer-term results and run the following OLS regression: where ShortT erm is an indicator for outcomes measured in weeks 2 or 5, and T is an indicator for treatment group assignment. In our application, we have three treatment groups (therapy only, 18 This assumes the dependent variable points in the positive direction. If treatment leads to a decrease in the outcome variable, as is the case for antisocial behaviors and antiviolent and anticriminal values, we impute in the opposite direction (i.e smaller values for control, larger values for treatment).
xliii our main specification, where we average composite measures and do not cluster standard errors. In columns 5-7, we do the same but use randomization inference to get our standard errors. . In columns 8-10, we pool our endline surveys, weight composite measures equally, and cluster standard errors by individual. . In columns 11-13, we pool our endline surveys, weight each survey question equally, and cluster standard errors by individual. In columns 14-16, we weight components using a covariance-weight from Anderson (2008) and average both endlines. In columns 17-19, we remove domestic violence from our antisocial behaviors index, weight survey questions using a covariance weight from Anderson (2008), and average both endlines,. *** p<0.01, ** p<0.05, * p<0.1 xlv cash only, and therapy and cash), include baseline controls and block fixed effects, and cluster our standard errors at the individual level i. The size and direction of β 3 determine whether the treatment effects we observe after 2-5 weeks are the same as those observed after a year. Table E.5 reports these estimates for our three main family indexes. For many outcomes, we cannot reject that β 3 is zero. In particular, the short-versus longer-term effects of both therapy and cash are not statistically distinguishable for antisocial behaviors and all mechanisms. However, there are two exceptions worth noting. First, while the cash-only group experienced the largest increase in the economic performance 2-5 weeks after the intervention, these effects diminished a year later. Second, while all three treatment groups saw decreases in antisocial behaviors in the short term, the effects of cash alone and therapy alone subsided 12-13 months later. Table E.6 reports the incidence of specific crimes reported in the two weeks prior to the 12-13-month survey, breaking down the total number of crimes into the type of crime reported. For consistency, we shift from the incidence of drug selling reported in Table 3 to the frequency-the number of times men reported selling drugs in the past two weeks.

E.5 Crime: Disaggregated and annualized impacts
Control men committed 2.54 crimes in the previous two weeks, and this fell by almost one crime with therapy plus cash. All types of crime decreased by 20 to 100% with cash and therapy, but the statistically significant (and largest proportional) reductions are in burglary, muggings, and scams (e.g. the sale of non-existent goods, or down-payments for a hidden fortune). We do not adjust p-values for multiple hypothesis testing and so these comparisons across crimes should be taken xlvii   (1) to (4) report the same ITT regression as in Table 3, with robust standard errors in brackets. Columns (5) and (6) Table E.7 reports impact heterogeneity from an OLS regression of the antisocial behaviors summary index on baseline level of either antisocial behaviors or self control and time preferences, treatment indicators, and interactions between treatment and baseline antisocial behaviors or an index of self control and time preferences, controlling for baseline covariates and block fixed effects. (Recall that our measure of antisocial behaviors is a standardized index with mean zero. Therefore, the coefficient on the treatment indicator represents the treatment effect for an individual with mean level of antisocial behavior at baseline, while the coefficient on the interaction term is the additional effect for individuals whose baseline level of antisocial behaviors was 1 standard deviation higher than average.)

E.6 Heterogeneity analysis on antisocial behaviors
We did not prespecify any heterogeneity analysis with antisocial behaviors, and so these estimates must be taken with caution. But these were the only heterogeneity analysis we conducted.
Therapy decreased the incidence of antisocial behaviors for the average participant, but men exhibiting more antisocial behavior at baseline saw larger declines. For example, men with average levels of antisocial behaviors at baseline who were assigned to both therapy and cash experienced a 0.25 standard deviation decline in their level of antisocial behaviors 12-13 months later, but men xlix whose initial level of antisocial behaviors was a standard deviation higher than average experienced about double the decline. Our results also indicate that after a year, men with high levels of initial antisocial behavior who received a cash grant actually increase their antisocial acts. This is especially interesting given that the effects of cash on occupational choice and income disappeared after a year. One possibility is that this increase in antisocial behavior is a reaction to the failed attempt at legitimate livelihoods, but these results are more speculative than anything else.
Our results also indicate that therapy and cash decreased the incidence of antisocial behaviors by 0.25 SD for participants with average self control and time preferences, but the effects were smaller for men who were more patient at baseline. These conclusions remain when we adjust for two comparisons within the "both" treatment arm. 19

E.7 Program impacts on occupational choice
To measure changes in occupational choice, we asked respondents at each endline whether they had engaged in 22 occupations, from farming to petty business, trades, and formal jobs. For each occupation, we collected self-reported earnings and hours in both the last week and the week prior. We use these to calculate the total earnings and hours variables. With two endline surveys, we have four weeks of employment data per person in both the 2-5-week and 12-13-month surveys.
We can also calculate hours by occupations each week, aggregating our 22 occupations into 5 mutually exclusive categories: 19 The adjusted p-values for the interaction terms on "assigned to both" are 0.011 in the short term and 0.014 in the long term for antisocial behaviors, and 0.177 in the short term and 0.037 in the long term for self control and time preferences.
l  Table E.8 reports ITT estimates on the average of the two weeks of data. While we generally observe no changes in overall average hours worked per week (the one exception is those assigned to cash only work approximately 15% more hours per week in the short-term), treatment effects how participants allocate their time. In the short-term, all three treatments cause participants to shift from illicit work to non-agricultural low-skill business. Those assigned to both therapy and cash experience the largest decline in illicit work. Time spent in illicit work falls 38% 2-5 weeks after implementation relative to the control group, and is 17% less than the control group one year later (although the latter is not statistically significant). Although the cash-only group more than doubles its weekly hours spent in non-agricultural low-skill business in the short-term, these effects phase out 12-13 months later. li

E.8 Program impacts on baseline data
There is the threat that post-randomization outcomes could be due to baseline imbalance. Table E.9 investigates this by displaying program impacts on families of baseline variables aggregated to be as similar to our endline outcomes. The time preferences, self control, mental health, and substance abuse indices are the same as at endline. The antisocial behavior index is missing (i) carries a weapon on body, (ii) arrested in past 2 weeks, and (iii) verbal/physical abuse of partner because only collected data on these outcomes at endline. The economic performance index is missing the investment and non-durable consumption components. We collected none of the components of the identity and social network indices at baseline, so they are excluded from the table.
If baseline imbalance is driving our results, we should see treatment effects on baseline data. However, Table E.9 shows this is not the case. No treatment effect is significant at the 95% level. Only four comparisons out of 36 (11.1%) have p < 0.10, and none of these are in the therapy plus cash treatment arm. Therefore post-randomization outcomes are not due to baseline imbalance, lii F Survey data validation details

Variable selection
We selected six variables for validation, all with recall periods of two weeks. We chose outcomes with varying degrees of salience (or memorability) and potential social stigma and experimenter bias. We wanted very specific behaviors (e.g. stealing rather than any crime, or marijuana rather than substance abuse). Finally, we wanted sensitive outcomes that were a primary focus of the treatment (stealing) and others that were less so (gambling or expenditures). The variables we selected in the end were: 1. Stealing. The survey asked how many times in the last two weeks the respondent stole someone's belongings or deceived or conned someone of money or goods. 20 Based on our fieldwork, we hypothesized that stealing would be the most salient and least socially desirable of all six measures.
2. Gambling. The survey asked how many times in the last two weeks the respondent gambled or bet on sports. Beforehand, we hypothesized gambling had a lower level of salience and sensitivity than stealing, but was still somewhat stigmatized.
In part, our use of many non-primary outcomes was deliberate. But, to be frank, our choices were driven more by the practicalities of validation, and in retrospect it would have been useful to focus on more primary outcomes.

Validator staff
Eight local staff performed validations over the two years of data collection. We selected validators from the study's qualitative research staff. These people typically began as survey enumerators, but displayed such skill and rapport with the subjects that we hired and trained them to conduct a separate qualitative research component: longitudinal, formal, open-ended interviews with a different subsample of subjects. All conducted the qualitative validation when they were not working on the formal open-ended interviews. 21 Each validator received at least 10 days of training on the methods, including both classroom learning and extensive field training. We trained more qualitative researchers than were needed for the exercise. Those who exhibited superior performance during the trainings were selected as validators. The aim of the training was to develop and refine trainees' skills in acquiring informed consent, building rapport with respondents, collecting and recording data, and analytical reasoning. Trainings were held for eight hours each day and, over the course of 10 days, transitioned gradually from exclusive classroom learning to field trainings with short debriefing sessions. Field trainings provided trainees with opportunities to practice the skills and techniques they had learned.
Like any qualitative study, we believe staff recruitment and training to have been among the most important tasks and also the largest start-up cost of this method.

Approach
For each respondent, validators tried to determine whether the respondent had engaged in any of the measured behaviors, even once, in the two weeks preceding the respondent's survey date, as the survey asked about behaviors occurring during the two weeks prior to the survey. We found it optimal for validators to visit each respondent four times, on four separate days, with each visit or "hangout session" lasting approximately three hours. The validator aimed to begin hanging out the day after subjects completed their quantitative surveys and to conduct all four visits in the days following the respondent's endline survey date.
Validators deliberately avoided the feeling of a formal interview and would typically accompany respondents as they went about their business. 22 Validators sometimes took notes during visits, but only in isolated areas out of sight from the respondent. 23 The idea follows from basic principles of ethnography, which seeks to study subjects in their natural settings, similar to those the researcher 21 All but one were men, and all had a high school education. Two of the men completed roughly half the validations with the remainder doing roughly 10 to 20% each. To find these validators, we trained roughly two to three times the number of people needed from the pool of research staff, selecting only those with the most natural questioning and rapport-building skills for the validation exercise.
22 On the first visit validators would obtain verbal consent. We designed the consent script to be informal, and explained that the goal of hanging out with the respondent was to talk about some of the same things they discussed in the survey. In addition to this verbal consent, the formal consent form that preceded the recent survey said that qualitative staff may come and visit them again to gather more information.
23 e.g. in a toilet stall or teashop. If validators were unable to find a secluded area in which to take notes, they sometimes recorded information in their cell phones, pretending to send a text message.
liv hopes to generalize about. The intent is to reduce the sense of being in an experimental situation, which ethnographers perceive as creating bias.
The main approach was to engage in casual conversation on a wide range of topics, including the six target topics/measures. The target topics were raised mainly through indirect questions while informally chatting. For example, validators typically started conversations with discussions of family. This was both customary among peers in Liberia and a sign of respect and interest in respondents' lives. It was also a stepping stone for discussing the target behaviors-either because the validator can discuss an issue in their family (someone engaging in one of the activities) or how the respondent's family feels about their current lifestyle and circumstances.
In general, validators found it helpful to tell respondents stories or scenarios about another person or themselves, related to the target measures, then steer the conversation to get information about how respondents had behaved in similar situations, eventually discussing the past two weeks. Validators were careful to present these behaviors and incidents in a non-stigmatized light, for instance by discussing a friend who stole in order to get enough to eat, or how they themselves had periods of homelessness or used drugs and alcohol. Validators found these personal stories (all of which were truthful) and genuineness were essential to building rapport and trust.
Validators might hold these conversations once or twice over the three hours, spending perhaps twenty or thirty minutes in conversation each time, to avoid unnaturally long or awkward conversations. The validator spent the remainder of the three hours in the general vicinity, observing respondents engaging in their daily activities. This could involve taking a rest in the shade or in a tea shop (as is common) or engaging others in conversation. Validators would also try to talk casually with the respondent's friends, relatives, or neighbors to learn about him (although we considered information from these second-hand sources as insufficient to support a conclusion about the respondents' behaviors, but merely as supporting information).
We found that building a rapport with participants in a short space of time was crucial. To develop trusting and open relationships, validators used techniques, including becoming close to respected local community and street leaders, eating meals together, sharing personal information about themselves, assisting subjects with daily activities, and mirroring participants' appearances and vernacular, as appropriate. In addition, validators tried to maintain neutrality and openness while discussing potentially sensitive topics. For instance, conveying-through stories or otherwise-that illicit behaviors were not perceived negatively, allowed respondents to feel comfortable sharing their involvement in such activities. Validators did not lie to or deceive respondents, however.
Overall, this approach-trust-building, spending time together over the course of several days, assuming the role of an "insider," attempting to obtain admission or discussion of the behavior, clandestine but fairly immediate note-taking, and (as discussed below) close examination of the evidence for each respondent with the investigators-was designed to counter the observer bias and selective recall that concern participant observation. 24 Developing a rapport with respondents, spending time to develop a relationship, and obtaining insider status are considered central to obtaining more honest and valid responses (Baruch, 1981;Bryman, 2003;Fox, 2004). We are not aware of any study, however, that has quantitatively tested this proposition.

Validation sampling and non-response
In each endline survey round we randomly selected study respondents to be validated, stratified by treatment group. 25 Table F.1 describes the samples selected for validation in each survey round over the course of the study. In total, we randomly selected 7.4% of all surveys, 297 in total, for validation.
We found 240 (81%) of the 297. 26 This attrition is an identification concern, but there is little evidence of biased attrition. Excess validation attrition (those who were surveyed but not validated) was not robustly associated with baseline characteristics (see Appendix A.3).

Statistical power
In order to minimize the confidence intervals surrounding any treatment-measurement error correlation, we chose the sample size that maximized the number of interviews we felt qualified validators could manage logistically. 27 Post hoc calculations of statistical power confirm the estimates we made at the design stage. With a sample of 240, we can detect general over-or under-reporting greater than 17% of the survey mean (14% of the "true" validated mean). 28 Because each treatment arm is a subsample, however, we cannot precisely measure the effect of treatment on misreporting-it is difficult to detect effects greater than 33% of the survey mean (28% of the validated mean). Thus we are principally interested in the sign and magnitude of the treatment effect on misreporting by treatment group.

Coding validated data
Validators were unaware of the respondents' survey responses, and formed their own opinions (based on the evidence collected) about whether respondents engaged in the six activities during the time 25 For each pair of survey rounds, study participants were randomly divided into blocks (e.g. 1, 2, 3, 4), and block 1 study participants were surveyed before block 2, and block 2 before block 3, etc. Within each block we randomly selected validation subjects using a computer-generated uniform random variable. The selection was performed without replacement in a given pair of survey rounds (e.g. the short-term endline surveys in a given phase), but sampling was performed with replacement across survey rounds. Twenty subjects were validated in more than one round. 26 We could not find 15 for even the endline survey. We could not validate a further 42 because they were difficult to find even immediately after the survey or (more commonly) because they lived a long distance away. In general, we surveyed respondents who had moved far out of Monrovia, but we were unlikely to validate them because of the time and expense and opportunity cost. 27 In general, the validation sample was a balanced subsample of the full sample. Power calculations, based on roughly the first 60 validator interviews, indicated that there was a modest degree of underreporting of all behaviors, sensitive and non-sensitive, but that the correlation between treatment status and measurement error was uncertain-across outcomes it varied in sign and magnitude, but was about zero on average. Thus the chief advantage of maximizing the sample conditional on time available was to shrink the confidence interval to build confidence in our method and the main outcomes of interest. Further validation was mainly limited by the number of validators we felt could be trained and supervised. 28 We calculated this minimum detectible effect (MDE) using a two-sided hypothesis test with 80% power at a 0.05 significance level, using baseline and block controls when calculating the R-squared statistic. We calculated an MDE for both the 0-2 expenditures index and the 0-4 sensitive behaviors index. The expenditures index had a mean of .82 in the survey and an MDE of .13 for general over-and under-reporting and .29 for a treatment effect on misreporting. The sensitive behaviors index had a mean of 1.12 in the survey and an MDE of .2 for general over-and under-reporting and .36 for any treatment effect on misreporting. We estimate that doubling the sample size would have increased power by about a third.
lvi  Notes: The proportion selected in each round was principally a function of logistical feasibility (e.g. number of available staff), and in some none were selected. As procedures became more familiar and staff more experienced, more could be done over time. The percentage validated in the treatment group includes any treatment (cash, CBT, or both).
lvii period captured by the quantitative survey. Every coding recommendation was then discussed with and vetted by one of the authors.
A core part of the validator training included logical reasoning, supporting reasoning with evidence, and writing this down in a clear and structured manner. After each visit, validators made written notes about the relevant data collected, including evidence to support their conclusions, on a standardized form. At the conclusion of the four visits, the validator coded six indicators, one for each behavior, where "1" meant that he had relatively direct evidence that the respondent engaged in the behavior during the recall period, and "0" otherwise. 29 Validators recorded an average of 1.35 "major" pieces of evidence per respondent per behavior to support their coding decision sheets. This was typically the most persuasive piece or pieces of evidence rather than all evidence collected. 30 Table F.2 reports evidentiary methods by behavior.
In general, the validators used some form of direct or indirect questioning-a direct admission of the behavior or persuasive statements that they did not engage in the behavior. The validators only witnessed or found direct evidence of the behavior in a fifth of cases, or had third party verification in about 6% of cases. In any event, witnessing or third party verification were not sufficient evidence for a final coding. For instance, witnessing had to be followed by questions confirming that the respondent also engaged in the behavior in the two weeks prior to the survey. This accounts for most of the cases where there was more than one piece of evidence highlighted.
In general, the patterns of evidence are fairly commonsensical. Witnessing is limited to observable behaviors such as marijuana, gambling, homelessness, and phone charging. Stories and scenarios where the respondent is invited to comment or discuss are especially common for the most sensitive subject, stealing. Indirect questioning is most common for everyday topics such as homelessness ("Is this your house?") and phone charging ("I need to charge my phone. Where do you usually charge yours?").

Limitations of the approach
While we think, based on our experiences, that this validation exercise gave enough time to gather detailed, accurate information and fostered trust and frankness, there are nonetheless limitations to this approach.
1. Potential disruption. The presence, and interactions and conversations with the validators may be intrusive and might disrupt respondents' daily activities, thereby altering the 29 Over the course of the exercise, different measures offered different experiences and lessons. Because of its relative frequency and visibility, we suspect marijuana use was the easiest to directly observe. But validators found other behaviors straightforward to discuss in conversation. In the survey and (especially) the validation, phone battery charging led to the most confusion-in particular, did simply charging one's phone count, or did only paying to charge one's phone count? Paid charging was the focus of the survey question (it appeared in an expenditure survey module), but we were concerned that the validators would use a more expansive definition. We attempted to mitigate such differences through trainings and regular discussions on the coding.
Homelessness also proved somewhat challenging to measure and validate, as we discovered its definition is subjective. Circumstances arose that were somewhat ambiguous, such as having no home of one's own but regularly sleeping on a friend's floor or in an acquaintance's market stall. To account for the potential variability in perceptions of homelessness, validators were instructed to include as much information as possible about respondents' living situations in their summary reports. The authors then worked with validators to code a somewhat broad definition of homelessness that included any ambiguous circumstances. Prior to analysis, it was not clear whether survey respondents applied the same definition, and hence we err on the side of finding underreporting in the survey. 30 We do not have complete paper records of all evidence collected, and so the 1.35 pieces of evidence is probably an understatement of the full amount of evidence.
lviii Notes: Direct questions imply the validator asked the respondent directly about his engagement in the activity. Indirect questions imply the validator brought up the subject in general conversation (Where do you live? What do you do to make money?). Stories and scenarios are a form of indirect questioning where the respondent is invited to comment. Witnessing or found evidence implies the validator saw the respondent engaging in the activity in question or found physical evidence that the respondent recently engaged in the activity. Third party accounts imply the validator asked the family and friends of the respondent whether or not he engaged in the activity. Other or unclear methods include a handful of cases of unprompted information from the respondent, and also cases where the behavior could be inferred from other knowledge.
Mainly it implies that coding was inconclusive or incomplete but is likely a form of questioning.
lix findings. To mitigate this risk, validators wore clothes that would blend in with their respondent's environment, and typically accompanied and assisted respondents in their activities as appropriate (e.g. helping a scrap metal collector scavenge).
2. Differences in recall periods. The validation occurred after the time period about which the survey questions had asked, and validators or respondents could have made errors about the relevant window of time (e.g. homelessness could have been observed the week after the survey, and inferred to the time of the survey incorrectly). This is most likely a source of random measurement error.
3. Inconsistent questions. The survey and validation questions might have been interpreted differently, making it difficult to compare results. As discussed above, phone charging and homelessness proved somewhat difficult to measure consistently. We used close consultations and reviews of the data, and focus groups with survey and validation staff, to maximize consistency.

4.
Reverse Hawthorne effect. Training validators to look for certain behaviors could lead them to overreport those behaviors (akin to the problem of "when you have a hammer everything looks like a nail"). This reverse Hawthorne effect would probably be more of a risk if the validation method relied on passive observation. Rather, validation involved active discussion and (usually) a direct admission of the behavior. Also, one of the authors reviewed and discussed the evidence for every subject with the validator.

5.
Increasing social desirability bias. In principle the participant observation method, by building rapport, could lead to a different source of measurement error by (for example) increasing social desirability bias. Our strong sense is that the opposite is true, that trust and rapport reduced the bias, but this is a subjective interpretation and not independently verifiable.
6. Consistency bias. In principle, respondents could recall their survey response and try to remain consistent despite trust-building. This could motivate randomizing the order of validation and survey in the future.
7. Non-blinded validators. The researcher is not immune from bias in qualitative research (LeCompte and Goetz, 1982;LeCompte, 1987). We are especially concerned with any bias correlated with treatment. While validators weren't given the subject's treatment status, it's possible and even likely that this could come up during the extended conversations. Thus there is a danger that the validators' biases will be correlated with treatment. The trust-building and preference for direct admission of the behavior was intended to mitigate this risk, but it still remains.
Most importantly, it seems unlikely that validators would commit most of these errors differentially across study arms. Misreporting correlated with treatment is still a risk under the consistency bias and non-blinded limitations, but the in-depth focus on a handful of questions, time invested, and trust-building is designed to counteract these biases as much as possible. If so, the qualitative validation method may be most useful at building confidence estimated treatment effects.
Finally, like any qualitative work, this is not an off-the-shelf tool. To select and refine the variables, recruit and train validators, and monitor quality of the data requires the researcher to have some familiarity with the context and population and at least basic experience in qualitative data collection. lx

Replicability of the approach
There are three reasons to think that this method could be replicated in other developing country field experiments and observational analysis using surveys. First, the expertise needed to implement the method effectively exists in most countries. Indeed, it should be considerably simpler to implement outside than inside Liberia. After fourteen years of civil war, and with one of the lowest human development indices in the world, Liberia has very low local research capacity, even compared to other poor and post-conflict states.
Second, most social scientists are nearly as well prepared to design and implement the approach as they are a new survey instrument or measure. Like any measure or method, it takes local knowledge, care, and extensive pretesting to develop a credible approach, and can benefit from someone with expertise in the subject area. In our case, one of the field research managers had some background in qualitative work and quality assurance, which we believe improved the quality of training and selection of the validator staff.
Third, the cost of the data collection is not necessarily large relative to many field experiments or large-scale panel surveys. In this instance, the fixed cost of startup was primarily in the recruitment and training of the small number of validators-approximately 2 to 3 weeks of work. We estimate the marginal cost of validation was roughly $80 per respondent, mainly in wages and transport. By comparison, the marginal cost of surveying a respondent was roughly $70. 31 While this method is considerably more expensive than survey experiments, it is more in line with the depth and cost of commonplace efforts to improve consumption measurement through the use of diaries physical measurement. 32 For crucial measures in large program evaluations, or for statistics informing major policies, the cost is small relative to the intervention, larger study, or larger purpose. For instance, as a proportion of total expenditures on the study, this validation exercise cost under 3% of all research-related costs, and less than 1-2% of program plus research costs.

F.2 Further analysis
Misreporting levels Table F.3 reports our proxy of survey over-reporting: the simple survey-validation differences, with p-values from a t-test of the difference from zero. Negative values indicate survey under-reporting, assuming the validator measure is more accurate of course. As noted above, we have the statistical power to detect differences greater than about 17% of the survey mean.
Overall, gambling seems to be slightly underreported in every treatment arm, and highly underreported by men in the control and cash only groups. For instance, 33% of the cash only group admitted to gambling during validation, compared to 13% during the survey. Some of this underreporting could be due to ambiguous behaviors being coded as gambling in validation interviews but not in the survey. But the fact that underreporting is smaller in the therapy arms suggests that the underreporting is not an artifact of different definitions, but rather reflects a strategic response to treatment status.
31 Both figures were driven by the fact that it typically took one to two days of searching to find each respondent for surveying, plus the time to survey itself. Both surveying and validating in Liberia were expensive by the standards of household surveys, largely because of the cost of operating in a fragile, post-conflict state and the great difficulties in tracking such an unstable population. 32 In one extreme example, in the India NSS consumption survey, enumerators physically measure the volume of all food consumption Group (2003 Notes: Columns 1 to 8 report the simple mean differences in the survey and validation measures for the full sample and for each treatment arm, along with p-values for as t-test of whether the mean is different from zero. We bold p-values ≤0.05.
If we look at stealing, marijuana use, and homelessness, however, none of the survey-validation differences are statistically significant. There is possibly some slight underreporting of drug use and slight over-reporting of stealing, but the magnitudes are generally small in the sense that they are less than 10% of the survey means reported in Table 9. The sample size is small, however, and so many of these differences are not precisely estimated.
We see much stronger evidence of underreporting of expenditures in the survey. The difference for both expenditures is -0.27 in the full sample (Table F. 3,Column 6). This difference is large-about a third of the survey mean reported in Table 9. Expenditure underreporting is largest for the video club measure, but both expenditures appear to be underreported. Interestingly, the mean differences appear to be smaller and less statistically significant if the men received one of the treatments. We return to these differences across treatment arms below.

Patterns of survey under-and over-reporting
In our validation exercise, there may be cases where the validation technique did not report a behavior that was reported in the survey. Table F.4 reports the number of cases where the survey and validation measures do not agree, divided into cases of survey over-and under-reporting relative to the validation measure. Over-reporting is driven by stealing, gambling, homelessness and going to the video store. Over-reporting is limited for marijuana use and phone charging, which are some of the least ambiguous and most habitual activities.
Another way to understand this point is to rerun equation 3 in the paper but omit block fixed effects and restrictβ 1 = 0 andβ 3 = 0: lxii In this case,β 0 is an estimate of survey over-reporting. Table F.5 reports these results, Panel (a) with the restrictionsβ 1 = 0 andβ 3 = 0 and Panel (b) without, for comparison. Looking at the sensitive behaviors in Panel (a), we see evidence of survey overreporting ranging from rough 12 to 15%. Moreover,β 0 andβ 2 are relatively similar in both Panels (a) and (b), suggesting that treatment has little effect on this survey over-reporting.
We do not know for certain why up to 15% of people would report a sensitive behavior in the survey but not in the validation exercise but there are several plausible explanations. First, survey respondents may not have considered the "last two weeks" recall period carefully, and reported behavior over a wider range. Validators were trained to be more strict with the recall window. Second, although we tried our best to maintain consistent definitions across the survey and the validation exercise, validators might have used more restrictive definitions of the behavior in question. Finally, validators may simply have been more conservative in their coding of these behaviors, or set too high a bar for certainty.
They key, however, is that there is no evidence that misreporting is associated with treatment status-which itself is the core finding from the general analysis of the validation exercise.

Treatment effects in the overall sample versus the validation sample
In this section we investigate whether the treatment effects observed in the validation sample are similar to those observed in the validation sample. Panel (a) of Table F.6 takes the survey measures of our six validated outcomes, and reports ITT estimates in the validated sample (N=238) and the full sample. Panel (b) takes the validator measures of our six outcomes, and reports ITT estimates in the validated sample (N=238). Although the validation sample only has 238 observations, and so standard errors are large, the estimate treatment effects are qualitatively similar across all three sets of regressions.

F.3 Adjusted treatment effects
We estimate the effect of each treatment on survey over-reporting, in Table F.7. These estimates effectively take the simple survey-validation differences in Panel A of Table 10 and estimate the difference across treatment arms, adjusting for baseline covariates as well as block fixed effects. We use these to calculate an adjusted treatment effect.
lxiii lxv First, the results imply that the adjusted treatment effect of therapy and cash on sensitive behaviors overall is no lower than what we estimate with self-reported survey data, and may even be larger (Column 1). This holds true for each of the individual sensitive behaviors, save marijuana use. Despite the large standard errors introduced by the small validation sample, the adjusted treatment effect on all sensitive behaviors is larger and significant at the 1% level.
Meanwhile, the underreporting of gambling does not have a statistically significant association with treatment. Those who received cash alone underreported gambling to the surveyors more often than control group members, and so the measurement error in gambling is probably a combination of a general desirability bias as well as one correlated with treatments. A larger sample size would be needed to separate these more precisely.
In contrast, the slight underreporting of expenditures behaviors in the survey (seen in Table F.3 above) implies that the short term increase in survey-based expenditures due to cash could be due to measurement error correlated with treatment. The adjusted treatment effect of therapy plus cash is generally negative but not statistically significant (Column 6). We see a similar pattern with another expenditure-related item, homelessness, in Table F.7-the survey-reported decline in homelessness tends to disappear with adjustment.