Policy Research Working Paper 11143 Luck of the Draw The Causal Effect of Physicians on Birth Outcomes Christian Posso Jorge Tamayo Arlen Guarin Estefania Saravia Development Economics A verified reproducibility package for this paper is Development Impact Group available at http://reproducibility.worldbank.org, June 2025 click here for direct access. Policy Research Working Paper 11143 Abstract This paper estimates the effect on birth outcomes of a vital statistics records, and records from physicians’ man- mother’s being treated by more-skilled versus less-skilled datory graduation exams. The findings show that mothers physicians, by exploiting a Colombian government program treated at local health centers with more-skilled physicians that randomly assigned newly graduated physicians to local were 9.14 percent less likely to give birth to an unhealthy health centers. It estimates the impact on 255,089 children baby, potentially because the more-skilled physicians better whose mothers received care in the local health centers using targeted care toward more-vulnerable mothers. administrative data from the program, local health centers’ This paper is a product of the Development Impact Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at aguarin@worldbank.org. A verified reproducibility package for this paper is available at http://reproducibility. worldbank.org, click here for direct access. RESEA CY LI R CH PO TRANSPARENT ANALYSIS S W R R E O KI P NG PA The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Luck of the Draw: The Causal Effect of Physicians on Birth Outcomes Christian Posso r , Jorge Tamayo r , Arlen Guarin r , Estefania Saravia r ∗ Keywords: Physicians’ skills, birth outcomes, experimental evidence JEL Codes: H51, I14, I15, I18 ∗ The authors’ names are listed in random order. Posso: Banco de la República de Colombia (email: cpossosu@banrep.gov.co); Tamayo: Harvard University, Harvard Business School, Digital Reskilling Lab (email: jtamayo@hbs.edu); Guarin: Development Impact (DECDI), World Bank (email: aguarin@worldbank.org); Saravia: University of California, Los Angeles (email: esaravia@ucla.edu). We are grateful to Achyuta Adhvaryu, Maria Aristizabal, Carolina Arteaga, Guadalupe Bedoya, Francesco Bogliacino, Leonardo Bonilla, David Card, Juan Esteban Carranza, Maíra Coube, Janet Currie, Kaveh Danesh, Margarita Gafaro, Robert Gonzalez, Marcus Holmlund, Hilary Hoynes, Raymond Kluender, Rem Koning, Juliana Londoño-Vélez, Edward Miguel, Anant Nyshadham, Paul Rodriguez, Daniel Rogger, Emmanuel Saez, Molly Schnell, Benjamin Scuderi, Jesse Shapiro, Mauricio Villamizar, Christopher Walters, Danny Yagan, the editor and coeditor, four referees, and numerous seminar participants for helpful comments and advice. We thank Manuela Cardona, Leidy Gomez, Silvia Granados, Nicolas Mancera, Brayan Pineda, Daniel Marquez, Gabriel Suarez, Santiago Velasquez, and Carolina Velez for excellent research assistance. Arlen gratefully acknowledges financial support from the University of California, Berkeley, Opportunity Lab. We also thank the Colombian Ministry of Health, the Departamento Administrativo Nacional de Estadística (DANE) and the Instituto Colombiano para la Evaluación de la Educación (ICFES) for providing access to the data and for insightful discussions. The findings, interpretations, and conclusions expressed in this paper are solely those of the authors and do not necessarily reflect the views of the World Bank and its affiliated organizations, the Executive Directors of the World Bank, the governments they represent, or of the Banco de la República and its Board of Directors. 1 1 Introduction Inequality can originate as early as the prenatal period. These critical months shape children’s health at birth, which has been shown to predict future abilities and health trajectories beyond what genetics alone can explain (Almond et al., 2005; Black et al., 2007; Currie, 2011; Currie and Almond, 2011). While much research on the determinants of birth outcomes has focused on maternal health and families’ socioeconomic conditions (Currie, 2011; Currie and Schwandt, 2016b), recent evidence suggests that health care providers may also play a significant role (Okeke, 2023). This evidence shows that physicians have differential effects on babies compared to other practitioners with substantially less training. In turn, this evidence raises the question of whether physicians with similar training may nevertheless have differential effects on children’s birth outcomes because of differences in their level of medical skill. In this paper, we provide causal evidence of the role that skilled physicians play in children’s health at birth. Prior research has shown that physicians significantly impact patients’ health (Chan et al., 2022; Chen, 2021; Currie and MacLeod, 2017, 2020; Das and Hammer, 2005) and that poor health at birth has long-lasting adverse impacts on an individual’s future outcomes (and the outcomes of the next generation), including earnings, education, and disability (Almond et al., 2018; Currie, 2011; Persson and Rossin-Slater, 2018; ?). If more-skilled physicians have differential impacts on children’s health compared to less-skilled physicians, the health of babies and their future outcomes could both be boosted by policies aimed at better assigning or targeting these more-skilled physicians to populations with greater needs. The lack of causal evidence regarding the impact of skilled physicians on birth outcomes is not surprising because answering this question poses substantial empirical challenges: it requires both accounting for the selection bias associated with the match between physicians and hospitals (Doyle Jr et al., 2010) and overcoming the difficulty of obtaining reliable measures of physicians’ medical skills.1 We overcome this challenge by exploiting a policy experiment conducted in Colombia. In this national-level program, 2,126 recently graduated physicians were randomly assigned to 618 local health centers (LHCs) through a within-state random lottery. This feature of the program’s design enables us to bypass the selection bias issue and isolate the impact of physicians on children’s health at birth.2 As a proxy for their medical skills, we match these physicians with their scores on the health-specific modules of 1 There is an extensive literature on positive assortative matching (PAM) that shows that companies and high- productivity workers match together (for example, Abowd et al., 1999; Becker, 1973; Kremer, 1993; Roy, 1951; Shimer and Smith, 2000; Woodcock, 2008). 2 These LHCs are equipped to provide primary care, emergency care, and outpatient and inpatient care, including for childbirth. They are typically referred to as “hospitals” despite being smaller and having less capacity than hospitals in more developed urban settings. 2 the mandatory exams they took just before graduating from college. Several features of our context are conducive to accomplishing our study’s goals. First, Colombian regulations mandate that medical school graduates dedicate the first year of their careers to the national Mandatory Social Service (Servicio Social Obligatorio, or SSO) program. This program randomly assigns new physicians to LHCs in the state where they apply. Because LHCs and physicians are assigned to one another without regard to their characteristics, physicians with different levels of skill encounter similar facilities, administrative resources, and health staff. By comparing birth outcomes across LHCs, we can estimate the causal effect of a mother’s being treated at an LHC that was randomly assigned a more-skilled cohort of SSO physicians on children’s health at birth. Second, we combine several rich and granular administrative records in Colombia, which allow us to observe the LHCs where physicians were assigned, obtain a proxy for their skills, and measure LHC outcomes as performance measures. Specifically, we collect data from the reports published by Colombia’s Ministry of Health after the SSO lottery draws that took place between 2013 and the third quarter of 2014. Further, we use individual records from mandatory college graduation exams to identify more-skilled physicians. Finally, we link the LHC to which physicians were randomly assigned to the national vital statistics records (VSRs), from which we obtain birth outcomes and maternal sociodemographic characteristics. The random assignment of physicians to LHCs allows us to satisfy the identification assumption that the cohorts of SSO physicians assigned to LHCs are mean independent of unobservable variables associated with the LHCs. In our setting, some mothers were exposed to multiple cohorts of physicians during their pregnancies. To isolate the causal variation associated with the random assignment, we estimate an instrumental variable (IV) model. In this model, we use the skill level of the first SSO cohort to which a mother was exposed during her pregnancy as an instrument for the average skill level of all SSO cohorts she was exposed to over the course of her pregnancy. The key identifying assumption behind our IV approach is that, conditional on the design fixed effects of the first cohort, the average graduation exam score of the first physicians’ cohort predicts the average exam score of all physician cohorts to which the mother was exposed and affects birth outcomes only through this channel. To make the interpretation straightforward, all the results are expressed in standard deviations of the skill measure. Our local average treatment effect (LATE) estimates indicate that more-skilled physicians improve birth outcomes. We find that mothers who were treated at an LHC that had been randomly assigned a cohort of SSO physicians whose graduation exam scores were one standard deviation higher were 9.14 percent less likely to give birth to an unhealthy baby. We define a baby as unhealthy if at least one of the following three conditions is satisfied: its birth weight is low (below 2,500 grams), it is born prematurely (before 37 weeks of gestation), or its Apgar score 3 is low (below 7).3 The effect of treatment by more-skilled physicians is consistent across each of these measures of health at birth: we find a 9.57 percent decrease in the probability that an infant has a low birth weight, a 10.99 percent decrease in the probability that an infant is born prematurely, and an 11.56 percent decrease in the probability that an infant has a low Apgar score.4 Our findings are consistent with evidence from related studies showing that variations in the quality or availability of health care providers significantly impact patient outcomes. For example, Chen (2021) finds that shared work experience among physicians reduces mortality rates, and Currie and Gruber (1996) show that increased access to Medicaid for pregnant women improves infant health outcomes. To assess the internal validity of our identification strategy, we implement two tests. First, we assign a placebo treatment to babies born before the arrival of the SSO cohorts in our sample. The random assignments that we use in our main specification took place in 2013 and 2014. We run placebo tests similar to our main specification but using outcomes for children born in the same LHCs from 2009–2012, the four years prior to the random assignment. We find that the treatment generates precisely estimated zeros. Second, we show evidence of the actual randomness of the assignment by testing for any correlation between physicians’ skill levels and LHC, municipal, and demographic characteristics using balance tests on pretreatment (2010–2012) and concurrent predefined characteristics during the SSO cohort’s assignment. We recognize that focusing solely on graduation exam scores as a proxy for physicians’ level of medical skills may overstate their importance while understating the relevance of other correlated characteristics. We therefore take advantage of the random assignment to obtain an estimate of physicians’ relative value-added (Angrist et al., 2017; Chetty et al., 2014; Kane and Staiger, 2008). Following Jackson (2018) and Fletcher et al. (2014), our shrunken value-added result implies that assigning an LHC a cohort of physicians at the 75th percentile of the quality distribution, versus the 25th percentile, would decrease the likelihood of a baby being unhealthy by approximately 0.08 standard deviations. Using our unbiased value-added 3 Low birth weight is one of the key measures of health at birth studied in the literature (Currie, 2011). Prematurity is highly correlated with low birth weight, mortality, and several health complications (Butler et al., 2007; Currie and Walker, 2011; Taylor et al., 2001; Veddovi et al., 2001). The Apgar score is also frequently used in the literature as an indicator of health at birth (Almond et al., 2010; Ehrenstein, 2009; Lin, 2009; Moore et al., 2014). 4 Unfortunately, during our analysis period, we could not test the impact of physicians on mortality due to data issues. First, the variable indicating the number of weeks of gestation is missing from birth records for a significant portion of fetal and neonatal deaths. This omission prevents us from determining the gestation period’s start for these deaths, thereby hindering our ability to precisely identify exposure to physician cohorts, as we can for births. Additionally, fetal and neonatal records frequently lack information about the LHC and about mothers’ and children’s covariates. Given these limitations, we conduct a cohort-level rather than a child-level exercise. We quantify the number of fetal and combined fetal plus neonatal deaths during the time a cohort was assigned to an LHC. This quantification disregards how long the gestation period was exposed to the cohort and is based on the data with all aforementioned limitations. While these results are expected to be subject to measurement error attenuation bias, we still observe a negative, albeit not statistically significant, point estimate, which aligns with our main results. 4 estimates, we study the relationship between the physicians’ value-added and several observable characteristics, such as their scores on the health-specific modules of the mandatory graduation exam (our skill measure), proxies for the quality of the medical program they attended, family socioeconomic characteristics, and gender. The results suggest that the health-specific graduation exam scores are the variable with the highest power for predicting a physician’s skill level measured as relative value-added. In contrast, the other characteristics have no significant relationship to their value-added. How might more-skilled physicians contribute to improved birth outcomes? To shed light on potential mechanisms, we first analyze several heterogeneous effects across groups of mothers with different characteristics. Although the effects of being treated by more-skilled physicians are slightly more pronounced among first-time mothers, teenage mothers, mothers with low education, and single mothers, the differences between groups are not statistically significant. Furthermore, we examine heterogeneity across infants and LHCs. First, we estimate effects separately for male and female infants. It is commonly observed in the literature that male fetuses, as well as male infants, tend to be more susceptible to health shocks than females (Eriksson et al., 2010; Kraemer, 2000; Naeye et al., 1971; Pongou et al., 2017). To the extent that LHCs with more-skilled physicians improve children’s health at birth, they may help mitigate adverse shocks in utero. We find that the reduction in the probability of being born unhealthy is particularly pronounced for male infants, but the difference is not statistically significant between male and female infants. Second, we explore heterogeneity related to the proportion of SSO physicians within the LHC. We split the sample into LHCs with a high and low share of SSO physicians. While the point estimate is larger for LHCs with a higher share of SSO physicians, the difference between the two groups is not statistically significant. Having analyzed potential heterogeneous effects, we explore a mechanism through which physicians may improve health at birth: prenatal checkups. According to WHO (2016) and the Colombian government (Gomez et al., 2013), better and more frequent prenatal care can improve the health of mothers and their children.5 We follow the standard recommendations of the WHO (2016) in 2013, the first year of the records that we use, and define “adequate prenatal care” as having at least four checkups during pregnancy.6 We find that more-skilled physicians, on average, do not schedule more prenatal checkups than less-skilled physicians.7 This means 5 Better and more frequent prenatal care improves maternal health because, during a prenatal checkup, pregnant women are screened and treated to avoid complications, preterm births, and other problems. Additionally, pregnant women are given critical information on nutrition, diet, and general safety practices, which has been shown to play a crucial role in in utero infant growth (Amarante et al., 2016; Kramer, 1987). Furthermore, in Colombia, the Ministry of Health requires that physicians carry out prenatal checkups (Gomez et al., 2013). As a result, physicians are responsible for prenatal care, and they are the professionals who attend 98 percent of deliveries. 6 The data we have access to only record the number of prenatal checkups within ranges, preventing a more flexible use of this variable. In our sample, 87 percent of mothers have at least four checkups. 7 Carrillo and Feres (2019) find no evidence of increase in prenatal care when physicians were replaced by nurses. 5 that more-skilled physicians do not improve health at birth by increasing the number of prenatal checkups they offer. Without increasing the number of prenatal checkups, more-skilled physicians might improve birth outcomes by better targeting these checkups. We therefore test whether more-skilled physicians target prenatal checkups toward more-vulnerable mothers (measured as those predicted to be more likely to give birth to an unhealthy baby) without compromising the care of lower-risk mothers. We use several machine learning techniques to generate predictions of the probability that a mother will give birth to an unhealthy baby on the basis of a set of LHC and mother characteristics, such as indicators of first-time mothers or teenage mothers, that are usually salient to physicians at the time of prenatal care. Regardless of the predictive technique we use, the results show that lower-risk mothers are not significantly more likely to have at least the suggested number of prenatal checkups if they see more-skilled physicians. This is consistent with the idea that more-skilled physicians do not compromise the care of lower-risk mothers. However, physicians do seem to target more prenatal checkups toward more-vulnerable mothers. We likewise show that the effects on birth outcomes of being treated at an LHC with more-skilled physicians are particularly pronounced among mothers with an ex ante high predicted probability of giving birth to an unhealthy baby. Taken together, these results are consistent with the account that physicians are time constrained and cannot increase the average number of prenatal checkups for all mothers but do improve the targeting of care toward more-vulnerable mothers without compromising the care of lower-risk mothers. This paper contributes to the literature in several ways. First, our study contributes to the experimental evidence on the effects of more-skilled physicians on health outcomes (Chan and Chen, 2022; Currie and Zhang, 2023; Dahlstrand, 2021; Fadlon and Van Parys, 2020; Stoye, 2022). Our identification strategy and the availability of granular administrative records allow us to measure the causal impact on health outcomes of being treated at an LHC that was randomly assigned more-skilled physicians. Previous studies have documented the relationships between health outcomes and physicians’ diagnostic skills (Currie and MacLeod, 2020), physicians’ teams (Chen, 2021), health care access (Almond et al., 2010; Anderson et al., 2014; Aron-Dine et al., 2015; Bardach et al., 2013; Finkelstein et al., 2012; Michalopoulos et al., 2012), health care costs (Alsan et al., 2019; Clemens and Gottlieb, 2014; Molitor, 2018), the quality of physicians’ academic institutions (Doyle Jr et al., 2010), physicians’ performance on qualifying examinations (Carrera et al., 2018; Tamblyn et al., 2002; Wenghofer et al., 2009), physicians’ competence (Das and Hammer, 2005, 2007; Das et al., 2008, 2016; Das and Sohnesen, 2007; Leonard and Masatu, 2007; Leonard et al., 2007), physicians’ ability to facilitate adherence to prescription medications (Iizuka, 2012; Simeonova et al., 2020), physicians’ fees and payment for performance (Basinga et al., 2011; Ho and Pakes, 2014a,b), general practitioners and specialists (Baicker and Chandra, 2004), and physicians’ 6 communication (Curtis et al., 2013). Second, our paper contributes to the broader literature on service providers’ value-added, extending the framework typically applied to education into health care. Studies have shown that effective service providers, such as teachers, can significantly impact outcomes in their respective fields (Araujo et al., 2016; Chetty et al., 2011; Rivkin et al., 2005; Rockoff, 2004). Similarly, we find substantial heterogeneity in physicians’ value-added, highlighting the crucial role of physician quality in health outcomes. Furthermore, consistent with Davis et al. (1995); Schnell and Currie (2018), who provide evidence on the significant link between physicians’ education and their professional performance, our results show that physicians’ test scores on their graduation exams are strong predictors of their value-added. These observable scores serve as practical tools with high predictive power for unobservable features like physician value-added and can be effectively used to identify higher-performing physicians. Finally, we contribute to the literature showing differential effects on the variation in the health care personnel expertise. Previous papers have found wide variations in treatment rates across LHCs due to allocative inefficiencies and variations in treatment expertise (Abaluck et al., 2016; Chandra and Staiger, 2020; Currie and MacLeod, 2017). We benefit from recent advances in machine learning techniques to show that more-skilled physicians target prenatal consultations toward mothers with the highest risk of giving birth to an unhealthy baby. Our results suggest observable risk factors receive more attention from more-skilled physicians. This would suggest that taking physicians’ skills into account when assigning and matching them to areas or populations of greatest need could yield positive social value by improving health outcomes among vulnerable populations. The remainder of this paper is organized as follows: In section 2, we describe the Colombian health system and the SSO program, the setting we exploit to identify parameters of interest. Section 3 describes the rich administrative data we derive from physicians’ graduation exams and patients’ birth outcomes. In section 4, we introduce our empirical strategy, show evidence for the randomness of physicians’ LHC assignments, and present our main estimated effects. Section 5 presents our robustness checks. In section 6, we discuss the frequency of prescribed prenatal checkups as a potential mechanism through which more-skilled physicians impact health outcomes. We conclude in section 7. 2 Institutional Background and Experimental Setting 2.1 Institutional Background According to the Political Constitution of Colombia of 1991, access to health services is an individual basic right. The system is structured to promote equity in the distribution of 7 subsidies and access to health services (Law 100, Congress of Colombia, 1993). Law 100 of 1993 introduced two types of health insurance: subsidized and contributive. The contributive regime covers formal employees (and their families) who contribute a fixed share of their employment income to the system. The subsidized regime covers poor household members who lack formal employment.8 By 2011, access to health care was close to universal; indeed, even among the poorest population, insurance coverage was at 87 percent, while in rural areas it was at about 88 percent (Páez et al., 2007). High levels of health care access are associated with greater use of reproductive health services, which is essential to reducing the risks associated with pregnancy and childbirth, as well as infant mortality (WHO, 2016). During our period of analysis, 87.7 percent of Colombian women received adequate prenatal care, defined by the WHO (2016) as having at least four prenatal checkups. Likewise, 8.8 percent of infants had a low birth weight, and 9.3 percent were born prematurely. Still, the system faces important challenges. In 2017, according to the United Nations Statistics Division database, the neonatal mortality rate (deaths per 1,000 live births) was 7.8 and the infant mortality rate (infant deaths per 1,000 live births) was 12.2.9 To become a physician in Colombia, one must be accepted into an undergraduate health program in medicine.10 Medical students earn a BA after five to six years of education. According to Colombian law, all professionals who graduate from health programs are social servants; as such, directly after graduation, they must work in urban and rural areas with limited access to health services for one year before practicing as professionals. This service is provided under the SSO program. The current SSO program was created by Law 1164/2007 (Congress of Colombia, 2007), but it was only adopted in 2010 when its implementation was legislated by Resolution 1058/2010 (Ministry of Health, 2010). The main objective of the SSO program is to improve the quality of health services in depressed urban and rural areas, to increase access to health services in those areas, and to better distribute human talent in health throughout the country. The SSO program also promotes spaces for the personal and professional development of those beginning their careers in the health sector.11 Physicians play a key role in maternal medicine in the Colombian health system. The Ministry of Health (2013), in Resolution 1441 of 2013, states that any physician in Colombia can perform low-complexity surgeries and procedures, including child delivery, cesarean sections, providing medical care to infants, and offering early detection activities like prenatal checkups. An important characteristic of the Colombian health system is that physicians always carry out prenatal checkups. According to the practical guide for preventing, detecting, 8 Eligibility for the subsidized regime is defined by the household’s wealth score in the System of Identification of Potential Social Program Beneficiaries (Sistema de Identificación de Potenciales Beneficiarios de Programas Sociales, or SISBEN), which is used to target public program beneficiaries in Colombia. 9 https://data.un.org/, consulted in May 2020. 10 Other health programs include nursing, bacteriology, and dentistry. 11 See Resolution 1058/2010 (Ministry of Health, 2010). 8 and treating pregnancy complications by the Colombian Ministry of Health (Gomez et al., 2013), prenatal checkups can be carried out by nurses specializing in maternal-perinatal care instead, but calculations from the VSRs show that physicians are responsible for all prenatal checkups and attend 98 percent of deliveries.12 2.2 Experimental Setting: The SSO Program By 2007, as the number of people getting medical training in Colombia increased, there were fewer available positions for SSO physicians than there were applicants. Therefore, how applicants would be chosen and assigned to LHCs became one of the program’s most critical decisions. Law 1164/2007 (Congress of Colombia, 2007) required that LHC assignments were to be “guided by the principles of transparency and equal conditions for all applicants.” In concordance, Resolution 1058/2010 established that applicants must be selected and assigned to LHCs through state-level lottery draws. At the end of 2012, a more organized approach was introduced. The first two years of the new program had shown that the directions in Resolution 1058/2010 were not robust enough to guarantee that the assignment of physicians to LHCs would be transparent and organized. Consequently, Resolution 566/2012 (Ministry of Health, 2012b) mandated that there would be four state-level SSO lottery draws each year, starting in January 2013.13 Applicants would choose the state of their assignment but would be randomly assigned to available positions in that state. Resolution 4503/2012 (Ministry of Health, 2012a) also provided clearer and more organized guidance on how the lottery draws should be conducted. To prevent strategic application behavior and to take advantage of the fact that the number of newly graduated physicians was about twice the number of available positions, Resolution 4503/2012 established that physicians could apply only to one state and only when the number of applicants for that state was lower than twice the number of available positions. This rule guaranteed an excess of demand for spots in each state and cohort. After the application process closed, each state publicly and randomly assigned its available spots according to the following steps: First, an oversight board consisting of one civil servant from the state secretariat of health and four health professionals was chosen. The civil servant then publicly announced the number of positions available and who had registered for each profession. At this point, she also stated the rules for the lotteries, which typically used ballots. If an applicant received a white ballot, they were exempt from the SSO program and received a certificate allowing them to work in Colombia as a professional (i.e., their medical license). Otherwise, they received a red ballot with the randomly assigned code of the LHC where they would work. If there were fewer applicants than positions available, all the applicants who had 12 Nurses who have just graduated from college cannot perform prenatal examinations in Colombia. 13 The lottery draws took place in January, April, July, and October in each of Colombia’s 32 states. 9 registered were assigned to an LHC, but the specific LHC was still assigned through the lotteries. Finally, the civil servant of the secretariat of health prepared a report listing the SSO physicians and their assigned LHCs, as well as the applicants who were exempt from the SSO program. A physician’s social service at their assigned LHC typically began between one and three months after the lottery draw and lasted for 12 months. The starting date was defined before the random assignment and, therefore, was orthogonal to the physicians’ characteristics as well. If a physician refused to work in the LHC to which they were assigned or unilaterally quit before the official end of their service, they were given a six-month sanction, during which time they could not work as a health professional. After that period, they had to apply to the SSO program again. This sanction imposed strong costs for quitters and proved to be a good deterrent against dropping the program. This system for randomly assigning applicants to LHCs lasted for seven lottery draws.14 Since October 2014, a new centralized system that gives more weight to applicants’ stated preferences and a prioritized list has replaced the random assignment process. The Ministry of Health (1990, 2001) specifies that the responsibilities of physicians during their SSO service include the following: • Developing health prevention programs (such as vaccinations, family planning programs, prenatal controls, chronic diseases controls, and buccal and visual health programs) • Providing primary care and diagnosis • Assigning treatment and therapies • Creating and improving medical records • Making a health plan and epidemiological profile for the local community • Performing any other duty stated in their contract Moreover, LHCs explicitly mention attending and performing surgical procedures, including cesarean sections and child delivery, as part of the functions and activities of SSO physicians.15 The period of time during which physicians were randomly assigned to LHCs is a convenient setting to estimate causal relationships that would otherwise be difficult to identify. The SSO assignment has implications for both the physicians who were selected randomly and the communities that were assigned physicians with different qualities. The latter set of implications is the focus of the present paper; the implications for physicians are studied in Guarin et al. (2023). In this paper, we use the exogenous rule of assignment to compare the 14 This lottery system covered all four of the 2013 cohorts and the first three cohorts of 2014. 15 We reviewed the manual of functions for five LHCs included in our sample. The reviewed institutions were LHC Salazar de Villeta, LHC Francisco Valderrama, Subred de Servicios de Salud sur, Red de servicios del primer nivel, and Guaviare. 10 birth outcomes of patients in LHCs that were assigned physicians with different levels of medical skill but who are otherwise comparable.16 3 Data We use five main sets of administrative data. The primary data set comes from the reports written and published by the Ministry of Health for each of the state-level SSO lottery draws, which were conducted in January, April, July, and October 2013 and January, April, and July 2014 (Ministry of Health, 2014). From these data, we obtain individual identifications, the lottery draw date, the state to which each physician applied, whether the physician was selected by the lottery or not, and, importantly, the LHC to which each physician was randomly assigned and the proposed start date. For our period of analysis, 45 percent of the LHCs in the SSO program show up in only one lottery draw, while 29 percent of the LHCs appear in two lottery draws and 26 percent of the LHCs appear in three to five lottery draws. The second administrative data set comes from the Colombian Institute for Educational Evaluation (Instituto Colombiano para la Evaluación de la Educación, or ICFES). ICFES is the institution that administers SABER PRO, the exam that all professionals, including physicians, must take before college graduation (Colombian Institute for Educational Evaluation, 2014). Using national ID numbers, we are able to link the physicians who participated in the SSO program to the ICFES records and recover information on their performance in SABER PRO. From SABER PRO, we glean data on physicians’ individual performance in two health-related modules, one that tests their knowledge of health care and another that tests their knowledge of disease prevention, as well as detailed sociodemographic information about each physician.17 Our estimations use the scores in the two health-specific modules as proxies for physicians’ medical skills before the SSO program.18 The objective of the health-specific modules in SABER PRO is to measure the skills and knowledge of medical professionals. 16 While service in the SSO is mandatory for health graduates in nursing, bacteriology, and dentistry as well as medicine, in this paper we focus on physicians for three reasons. First, the excess demand for the state-level lottery draws was mandatory for physician positions, creating suitable conditions for lotteries. Second, as previously mentioned, prenatal checkups in Colombia must be carried out by physicians (Gomez et al., 2013). Finally, physicians arguably make the greatest contribution to the health of the patient (Das and Hammer, 2005) and to birth outcomes in particular. 17 We also recover data on physicians’ individual performance in two other modules: one that tests reading comprehension and another that tests quantitative reasoning. Graduation exam scores are only available for the newly appointed physicians (i.e., we do not have the exam scores of physicians who graduated before 2009). 18 The correlation between a physician’s medical skills and their test performance has previously been documented in the literature. For example, Norcini et al. (2002) and Norcini et al. (2014) show a strong correlation between mortality and a physician’s certifying examination performance. Similarly, Tamblyn et al. (2002) find a relationship between examination scores and the primary care practice of doctors in Quebec. Wenghofer et al. (2009) find an association between medical examination scores and the quality of health care in Canada, while Tamblyn et al. (2007) find a relationship between physicians’ exam scores and patients’ complaints to the medical regulatory authorities. 11 According to ICFES, the health care module assesses whether the physician has the competence to provide care that integrates both disease prevention and proper diagnosis with medical treatment and patient rehabilitation at all levels of complexity. The disease prevention module evaluates the physician’s competence to apply basic concepts of health promotion and disease prevention to prioritize actions according to individuals’ health conditions. ICFES ranks physicians according to four levels of quality. Physicians who score in the lowest level of the health care module only understand basic concepts and elements of epidemiology and public health. On the other hand, physicians who score in the highest level understand public health concepts (actions aimed at mitigating the health problems of communities), can assess patients’ health conditions, and can analyze social, cultural, and economic factors that may influence differences across patients’ health. Similarly, for the disease prevention module, physicians who score in the lowest level understand basic concepts of biosafety and occupational risk. Those who score in the highest level can analyze complex health situations in a given context and select appropriate actions following current regulations and standards in medicine. Because the SSO program is the physicians’ first real work experience, and because SABER PRO is taken just before graduation, we consider their scores a good measure of their medical skills at the time they start their SSO service and their professional career.19 In Colombia, as in many other developing countries, there is high heterogeneity in the quality of medical education. In 2009, only 30 percent of medicine programs in Colombia had been accredited as high-quality programs by the Ministry of Education (Fernández Ávila et al., 2011). Figure 1 shows high heterogeneity in average scores on the health-specific SABER PRO modules between and within universities for the physicians in our sample.20 The figure shows the mean score for each university and an interval of one standard deviation to each side of the mean. Note that there is a difference of almost two standard deviations between the averages of the best and the worst programs. This high heterogeneity plays in our favor because it allows us to compare the outcomes of patients who were randomly exposed to physicians with very different baseline levels of knowledge and skills.21 Using the scores and demographic characteristics from SABER PRO, Guarin et al. (2023) have shown that the SSO lotteries in our sample are well balanced between SSO physicians and those who were randomly exempted from participation in the SSO program. They use individual regressions correlating physicians’ characteristics and lottery status as well as machine learning techniques and a classification permutation test to provide evidence of the equality of multivariate distributions between the treatment and control groups 19 Schnell and Currie (2018) provide evidence on the important link between physicians’ education and their professional performance. 20 In Colombia, each university has no more than one medicine program. 21 Similarly, figure A.4 shows substantial heterogeneity in scores on the quantitative and reading modules for the universities the physicians in our sample attended. 12 (Gagnon-Bartsch et al., 2019) and the randomness of selection into the program in general. The third administrative data set comes from VSRs collected by the Administrative Department of Statistics (Departamento Administrativo Nacional de Estadística, or DANE) (Administrative Department of Statistics, 2018b). The VSRs contain rich information for all birth certificates filed in LHCs within Colombia’s 1,120 municipalities (subdivisions of the 32 states) from 1998 to 2018. Using LHCs’ identification codes, we are able to link physicians and the birth records of the LHCs to which they were assigned. Using the birth date and number of gestation weeks from the VSRs, we are able to identify children born between 2013 and 2016 who were exposed to each team of physicians. We also use the VSR data from 2009 to 2012 to create mother- and LHC-level controls to provide evidence of covariate balance at the LHC level and to run placebo tests. The fourth administrative data set comes from the 2005 National Census, also collected by DANE (Administrative Department of Statistics, 2005). From the census, we get the population and other variables at the municipality level that we use to test the randomization of the program and as controls in the robustness checks. Finally, we collect information from the National Registry of Human Resources in Health (Registro Único Nacional del Talento Humano en Salud, or ReTHUS). The Ministry of Health designed ReTHUS through Law 1164 of 2007 (Congress of Colombia, 2007). ReTHUS registers all individuals authorized to practice a health profession or occupation. These data contains detailed information on the date of degrees, the date on which the medical license was granted, and postgraduate degrees. We also collect additional data at the LHC level from the Colombian Ministry of Health. 13 Figure 1: Heterogeneity in SABER PRO Scores between and within Medicine Programs Note: This figure reports the health care and disease prevention module scores on SABER PRO for the universities (Ministry of Education, 2019) that the physicians in our sample attended. The data account for 44 different universities. The figure shows the mean score for each university and an interval of one standard deviation. The dashed horizontal line represents the overall median. The figure shows substantial heterogeneity both within and between programs. For all the fields reported, there is a difference of almost two standard deviations between the averages of the best and the worst programs and a difference of almost one standard deviation between the averages of the worst and the median program and the averages of the median program and the best program. 3.1 Main Sample As noted above, our cohorts of SSO physicians were chosen at random in state-level lottery draws conducted in January, April, July, and October 2013 and January, April, and July 2014. We exclude physicians assigned to metropolitan areas (MA) because the presence of larger hospitals and other LHCs may introduce selection biases that we do not expect in smaller municipalities.22 Additionally, SSO physicians play a less pivotal role in metropolitan areas.23 Our sample of 598 municipalities covers about 58 percent of the Colombian population. The main sample consists of all babies whose mothers were exposed to our randomly assigned cohorts of SSO physicians in non-metropolitan areas. Although regulations stipulate 22 To determine which municipalities are not part of a metropolitan area, we restrict our sample to those outside the 23 metropolitan areas defined by DANE, which bases its definition on population size and the degree of integration of urban centers with surrounding municipalities. 23 In Colombia, patients are assigned to a nearby Level 1 LHC as their primary facility for basic care. Level 1 LHCs, which are typically staffed by SSO practitioners and offer basic health care services with low-complexity technology, often are the only facilities in smaller municipalities. In contrast, metropolitan areas have health centers and hospitals of all levels, allowing mothers to easily substitute among multiple providers. While national regulations specify that SSO physicians are responsible for maternal care, including family planning and prenatal checkups (Ministry of Health, 1990, 2001), SSO physicians may be less likely to perform prenatal care in metropolitan areas due to the presence of more experienced and specialized doctors. The SSO program’s objective is to provide professional services in mostly rural areas with limited access to health services (Ministry of Health, 2010, Resolution 1058/2010); accordingly, between 2013 and 2014, 77.3 percent of the available positions for assigned physicians were in small cities outside metropolitan areas. 14 that SSO physicians should be the ones treating pregnant mothers in their assigned hospitals, we do not observe which physicians actually treated the mothers. Instead, we consider a mother to be exposed to an SSO cohort if her gestation period overlaps with the time of a cohort’s assignment to the LHC where she gave birth. This implies that a mother can be exposed to multiple cohorts; in fact, 50 percent of the babies in our sample are exposed to more than one cohort. As detailed in the empirical strategy section, our main variable of interest—which serves as a proxy for the level of skill of the physicians to which a baby was exposed—is calculated based on the graduation exam scores of the SSO cohorts to which mothers were exposed. For the 50 percent of cases in which a mother was exposed to only one cohort, we use the average of that cohort’s exam scores. For the other 50 percent of cases, in which a mother was exposed to more than one cohort, we compute a weighted average of the exam scores of the different cohorts, where the weight for each cohort is the number of overlapping days between her gestation period and the time of the cohort’s assignment to her LHC. Since there is usually only one LHC per municipality in our non-metropolitan areas main sample, mothers are not expected to be exposed to more than one LHC. We exclude from the analysis 53 babies for whom gestational age information is missing. Our main sample contains 255,089 babies and 2,126 physicians. For each baby, we observe the birth certificate, which includes information on low birth weight, Apgar score, weeks of gestation, prenatal checkups, and demographic information for the mother and the child. For each physician, we observe their scores on the SABER PRO health care, disease prevention, reading comprehension, and quantitative reasoning modules, as well as sociodemographic information they provided at the time of the graduation exam. Table 1 provides basic descriptive statistics for the main health outcomes we measure using data from the VSRs.24 It also shows how our main sample compares to the full sample of mothers and babies exposed to SSO physicians. The binary variable unhealthy takes a value of 1 if the infant has a birth weight below 2,500 grams, is born before 37 weeks of gestation, or has an Apgar score below 7. We use the variable unhealthy as our main measure of a newborn infant’s health at birth, while also analyzing birth weight, prematurity, and Apgar score individually. Columns 1 and 2 show the mean and standard deviation, respectively, for babies in LHCs to which at least one SSO physician was assigned (the full SSO sample); columns 3 and 4 show the same statistics when we constrain the sample to municipalities outside of the main metropolitan areas (the rural SSO sample, which is our main sample). The last two columns (3 and 4) correspond to our main sample. In our main sample, 4.27 percent of infants had a low 24 Unfortunately, we do not have continuous measures for birth weight or Apgar scores. However, we do have data on gestational weeks. To keep consistency across analyses, we have opted to use binary outcome variables throughout. That said, we also conducted analyses using gestational weeks as a continuous variable and explored alternative definitions of the binary variable for prematurity. The results are consistent with the results reported in the main analysis. 15 birth weight, 4.11 percent were born prematurely, 3.75 percent had an Apgar score below 7, and 9.52 percent of newborn infants experienced at least one of these three medical conditions, meaning they were classified as unhealthy. The share of female infants is 48.84 percent. Moreover, 16.3 percent of the mothers had insufficient prenatal care, which is an indicator variable that takes the value of 1 if the mother received fewer than four prenatal checkups. Teenage pregnancy accounts for 28.46 percent of total births in the main sample. Finally, the average number of LHCs by municipality is around 1.2. Table 1: Descriptive Statistics for Mothers and Babies Exposed to SSO Physicians, 2013–2016 Covariate Description Full SSO No MA SSO sample sample Mean SD Mean SD (1) (2) (3) (4) Low birth weight 1(Weight < 2500 g) 0.0594 0.2364 0.0427 0.2022 Prematurity 1(Gestational weeks < 37) 0.0615 0.2402 0.0411 0.1985 Low Apgar score 1(Apgar score < 7) 0.0379 0.1911 0.0375 0.1900 Unhealthy max (LBW.P remature.AP GAR) 0.1175 0.3220 0.0952 0.2935 Insufficient prenatal care 1(Prenatal checkups < 4) 0.1755 0.3804 0.1630 0.3693 Number of observations 363,744 255,089 Note: This table presents the mean and standard deviation (SD) for the main birth statistics of the mothers and babies affected by the SSO program. The data come from the 2013–2016 DANE VSRs, which collect information about all births and deaths in Colombia. The full SSO sample covers all the LHCs that had an SSO physician assigned to them in our sample, while the rural SSO sample, our main sample, is restricted to municipalities outside metropolitan areas. Low birth weight is the proportion of newborn infants whose birth weight was less than 2,500 grams. Prematurity is the proportion of newborn infants who were born after fewer than 37 weeks of gestation. Low Apgar score is the proportion of newborn infants whose Apgar score was lower than 7. Unhealthy, our main measure of health at birth, is the proportion of newborn infants with at least one of the three previous conditions. Female infants is the proportion of female infants. Insufficient prenatal care is the proportion of mothers who had fewer than four prenatal checkups. Teenage mothers is the proportion of mothers who were 19 years old or younger at the time they gave birth. Number of LHCs per municipality is the count of LHCs in the birthplace municipality. 3.1.1 Municipalities As previously mentioned, we restrict our sample to municipalities in rural areas—outside of the main 23 Colombian metropolitan areas—where we expect fewer physicians per municipality. There are 598 municipalities included in our sample (see figure A.2). The median number of people living in each municipality is 14,049 (the mean is 22,042). The average share of people living with unsatisfied basic needs (UBN) is almost 50 percent, including some municipalities where the whole population lives with UBN.25 These figures indicate that SSO physicians provide their services in LHCs located in underserved areas. 25 As a reference, the average share of people living with UBN for the 23 and 7 largest cities and their metropolitan areas is 21.5 percent and 17.4 percent, respectively. 16 We obtain the total number of physicians per municipality from ReTHUS.26 From the 598 municipalities included in our sample, only 16 have more than one LHC per municipality. The median number of physicians per LHC is three, and around 94 percent of the LHCs have fewer than 20 physicians per LHC.27 Most deliveries are attended by general practitioners and SSO physicians. In fact, approximately 90 percent (527 out of 582) of the municipalities with one LHC and available data on specialist availability do not have an obstetrician or gynecologist working in their LHCs at any time during our sample period. While this highlights the limited access to specialists in these areas, it is significant for our study, as SSO physicians play a crucial role in providing maternal health care in their LHCs. 3.1.2 SSO Physicians As noted above, our main sample includes 2,126 physicians who were selected in one of seven lottery draws between 2013 and 2014. Table A.1 presents baseline summary statistics for the physicians in our sample. Nearly 56 percent of the physicians are women. While 29 percent of physicians lived in lower socioeconomic neighborhoods (strata 1 and 2), 36 percent lived in stratum 3, and 35 percent resided in higher-income neighborhoods (strata 4–6). Given that less than 10 percent of Colombians live in strata 4–6, this indicates that physicians in our sample generally come from households with significantly better economic conditions than the median Colombian. Physicians’ average household size is four people. Looking at the parents of SSO physicians, 64.4 percent (63.4 percent) of fathers (mothers) have completed tertiary education. Almost 45 percent of these households have a monthly income of less than three times the monthly minimum wage (22.9 percent earn less than two). Finally, the average score on the health care module for the physicians in our sample is 10.4, with a maximum of 13.9 and a standard deviation of 1, and the average score on the disease prevention module for the physicians in our sample is 10.4, with a maximum of 13.4 and a standard deviation of 1. 3.1.3 Compliance We use the ID numbers of all the physicians in the SSO program between 2013 and 2014 and merge them with ReTHUS to get the dates on which the physicians graduated and obtained their medical licenses.28 We define as compliers those physicians who obtained their licenses more than three months but less than two years after their graduation date. The share of compliers is 26 Unfortunately, ReTHUS provides information at the municipality level, so we can only match SSO physicians, not every physician, to the LHC at which they work. 27 Figure A.3 shows the distribution of physicians per municipality for the sample of 582 municipalities with one LHC per municipality. 28 In addition to requiring physicians to receive their license between three months and two years after graduation, we limit the definition to those who do not appear in subsequent lottery draws within the same time frame. 17 94 percent. 4 Empirical Analysis The aim of our empirical analysis is to identify the impact of more-skilled physicians on birth outcomes. We estimate this impact on the 255,089 children whose mothers received care from 2,126 physicians randomly allocated to 616 LHCs in rural areas in 2013 and 2014. As previously noted, the principal outcome that our empirical approach focuses on is an aggregate measure of health at birth that incorporates the three main indicators commonly studied in the literature: low birth weight, prematurity, and Apgar score. To proxy physicians’ level of medical skills, we focus on the average score of the two health-specific SABER PRO exam modules. We also provide robustness checks using the first principal component of the two health-specific scores– which statistically combines them into a single index capturing the largest shared variation–and we also consider each score individually. We first test the internal validity of our identification strategy. Next, we present our main results on birth outcomes. We also explore whether physicians’ effects are more pronounced among different subgroups. Finally, we compute a relative measure of value-added and regress it on several physician characteristics, including our measure of physicians’ skill levels. 4.1 Empirical Strategy Our empirical strategy examines a health production function linking birth outcomes to physicians’ skills. Specifically, we consider the following linear model: Yi = α + βZi,h,t + ϵi , (1) where Yi is the birth outcome of child i, Zi,h,t represents the weighted average skill level of the physicians’ cohort working at LHC h during the gestation period t of child i, and ϵi is the error term. Note that the analysis is conducted at the child level and that, while we denote the LHC by h and the gestation period by t, these depend on child i. 29 In our setting, some mothers were exposed to more than one SSO cohort during their pregnancy. Estimating equation (1) directly using ordinary least squares (OLS) may result in biased estimates of β due to potential correlation between the assignment of physicians and unobserved characteristics of patients. To address this endogeneity, we leverage the random 29 For simplicity, we denote the LHC by h, the gestation period by t, and the draw-by-state fixed effects by γd1 , omitting their dependence on child i. Formally, we would write h(i), t(i), and γd(i) but we use simplified notation to enhance 1 readability. Note also that the gestation period t represents the time interval corresponding to the pregnancy period of child i. Additionally, the draw-by-state fixed effect, γd 1 , depends on i through the timing of the first SSO cohort the mother was exposed to during her pregnancy. 18 assignment of physicians to LHCs and employ an IV approach. To isolate the causal variation associated with the random assignment, we use the skill level of the first SSO cohort a mother was exposed to during her pregnancy, Zi,h,t 1 , as an instrument for the weighted average skill level of all SSO cohorts she was exposed to over the course of her pregnancy, Zi,h,t . The first-stage equation is 1 Zi,h,t = η + πZi,h,t 1 + γd + νi,h,t , (2) where Zi,h,t is calculated as a weighted average of the graduation exam scores of the different cohorts, with weights given by the number of overlapping days between the gestation period and the period of each cohort’s assignment to the mother’s LHC, Zi,h,t 1 is the skill level of the first cohort of physicians the mother is exposed to, proxied by the average of their scores, γd 1 are draw-by-state fixed effects corresponding to the first cohort, and νi,h,t is the error term. The reduced-form equation is 1 Yi = θ + ρZi,h,t 1 + γd + εi , (3) and the second-stage equation is ˆi,h,t + γd Yi = α + β Z 1 + ϵi . (4) In equation (4), β identifies the impact on the child’s health at birth of a mother’s being treated at an LHC h that has been randomly assigned a more-skilled SSO cohort. Similarly, ρ in the reduced-form equation (3) captures the overall effect of the skill level of the first SSO cohort on birth outcomes. This parameter reflects the total impact of the mother’s initial exposure to more- skilled physicians on birth outcomes, combining both the effect through the average physician skill level during gestation and any direct effects mediated by the instrument. A key identifying assumption behind our IV approach is that conditional on the draw-by- state fixed effects of the first SSO cohort, the skill level of the first cohort predicts the average skill level of the physicians the mother was exposed to and affects birth outcomes only through this channel. 4.2 Internal Validity Our identification relies on the assumption that conditional on the design fixed effects, the allocation of physicians to LHCs, h, is independent of potential outcomes, Yi . To assess the internal validity of our identification strategy, we conduct two tests. First, we examine whether any characteristics of the LHCs, municipalities, mothers, or children—including pre-treatment LHC birth outcomes—are correlated with the skill level of the physicians who were randomly assigned in 2013 and 2014. Second, we implement placebo tests by assigning a “placebo 19 treatment” to births recorded in the VSRs during the four years prior to the program (2009—2012) instead of the years used in our main estimation sample (2013–2016). 4.2.1 Balance Tests on Pretreatment and Concurrent Covariates To test for any correlation between physicians’ skill levels and LHC, municipal, and demographic characteristics, we conduct balance tests on two separate sets of variables: pretreatment characteristics measured from 2010 to 2012 and concurrent characteristics during the period of each SSO cohort’s assignment to the LHC. The pretreatment variables include LHC-level covariates, such as municipality population, number of LHCs in the municipality, average birth outcomes, and predetermined demographics of mothers (e.g., education, age, and marital status) and children (e.g., sex), for births occurring from 2010 to 2012 in those LHCs. The concurrent variables include municipality population, number of LHCs in the municipality, and predetermined demographics of mothers and children born during the period when the SSO cohorts were assigned to each LHC. For each set of variables, we estimate the following equation using OLS: Xh(j ),τ = µ + ϕZj + γd(j ) + ζh(j ),τ , (5) where Xh(j ),τ represents the LHC, municipal, or demographic characteristic for LHC h during the relevant variable-specific time interval τ (i.e., pretreatment or concurrent); Zj is the proxy for the medical skills of cohort j assigned to LHC h in lottery draw d, measured as the average of their health-specific graduation exam scores; γd(j ) are draw-by-state fixed effects; and ζh(j ),τ is the error term. Under our identification assumption, we expect that there should be no significant correlation between the proxy for physicians’ skill level and the baseline characteristics of LHCs, municipalities, and demographics. A lack of significant relationships in these balance tests would suggest that the random assignment of physicians is indeed independent of the pre-existing characteristics of the LHCs and the populations they serve. Table 2 presents the coefficients (ϕ) from estimating equation (5) using OLS, with standard errors clustered at the LHC level. The results show no significant correlation between physicians’ skill level and either the pretreatment or concurrent LHC, municipal, or demographic characteristics, supporting the assumption that physician assignment is independent of these baseline characteristics.30 30 In table A.2, we replicate the analysis using the average of all four modules of the SABER PRO exam as a proxy for physicians’ skill levels, and we find consistent results. 20 Table 2: Covariate Balance at the LHC Level Covariate Coefficient Standard error a. Pretreatment variables (2010–2012) Unhealthy 0.00744 0.00700 Low birth weight 0.00071 0.00198 Prematurity 0.00017 0.00330 Low Apgar score −0.00246 0.0037 Insufficient prenatal care −0.00255 0.00574 Female infants −0.00201 0.00315 Mothers with basic education −0.00220 0.00631 Married mothers 0.00004 0.00510 Teenage mothers 0.00607 0.00415 Number of LHCs per municipality −0.01687 0.01677 Municipality population −888.07 1,848.66 b. Concurrent variables Female infants 0.00075 0.00266 Mothers with basic education 0.00181 0.00699 Married mothers 0.00095 0.00498 Teenage mothers −0.00047 0.00358 Number of LHCs per municipality −0.00777 0.02058 Municipality population −93.30 2,411.90 Note: This table presents the results of different LHC-by-cohort level regressions (equation 5) of the LHC-level variables, listed in the first column, on the measure of physicians’ skill level and the draw-by-state fixed effects. The coefficient and the standard error of the physicians’ skill variable are reported in the second and third columns, respectively. Standard errors are clustered at the LHC level. LHCs’ characteristics in panel a come from the 2010–2012 DANE VSRs, using a total of 1,837 LHC-by-cohort observations. LHCs’ characteristics in panel b come from the 2013– 2015 DANE VSRs, using a total of 1,714 LHC-by-cohort observations.Unhealthy, our main measure of health at birth, is the proportion of newborn infants with at least one of the three following conditions: low birth weight, prematurity, or low Apgar score. Low birth weight is the proportion of newborn infants whose birth weight was less than 2,500 grams. Prematurity is the proportion of newborn infants who were born after fewer than 37 weeks of gestation. Low Apgar score is the proportion of newborn infants whose Apgar score was lower than 7. Insufficient prenatal care is the proportion of mothers who had fewer than four prenatal checkups. Female infants is the proportion of female infants. Mothers with basic education is the proportion of mothers with at least secondary education at the time they gave birth. Married mothers is the proportion of mothers that were married at the time they gave birth. Teenage mothers is the proportion of mothers who were 19 years old or younger at the time they gave birth. Number of LHCs per municipality is the count of LHCs in the birthplace municipality. We interpret the non-significance of these estimates as evidence in favor of the randomness of the assignment of physicians. 4.2.2 Placebo Tests To further assess internal validity, we conduct placebo tests by applying our estimation strategy to data from the four years preceding the SSO program (2009–2012) rather than to data from 2013–2016, the actual years of SSO physician assignments. Specifically, we shift the physicians’ arrival times four years earlier, simulating the same lottery draw dates, proposed start dates, and LHC assignments as in the main analysis, but for the period before the program began. We then estimate equation (4) (LATE) and (3) (reduced form) using the same outcomes and fixed effects as in our main analysis. 21 Since physicians in our sample did not treat children born in 2009, 2010, 2011, and 2012, we would expect null effects. Table 3 shows that the point estimates are not statistically significant for our main outcome measure, unhealthy, and for each of the other birth outcome measures (low birth weight, prematurity, and low Apgar score).31 Our results are robust to the use of the first principal component as a proxy for skill, as well as to the inclusion of a set of controls, such as ex ante LHC and mother characteristics as well as a vector of mother-child sociodemographic information (figure A.5 and table A.4). Table 3: Placebo Test Unhealthy LBW Prematurity Low Apgar Average health scores (1) (2) (3) (4) a. Reduced-form estimates Coefficient −0.0015 −0.0011 −0.0019 < 0.0001 SE (0.0020) (0.0010) (0.0013) (0.0014) Relative effect −1.28% −2.38% −3.71% 0.04% b. 2SLS estimates Coefficient −0.0019 −0.0014 −0.0024 < 0.0001 SE (0.0024) (0.0013) (0.0016) (0.0017) Relative effect −1.58% −2.93% −4.59% 0.05% Average dependent variable 0.118 0.046 0.052 0.046 Number of observations 261,216 Note: This table presents a placebo test in which we estimate equations (3) and (4) but move the arrival date of the physician back four years (2009–2012). The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Numbers in parentheses are LHC-level clustered standard errors. We read the results of this placebo test as additional evidence in favor of the randomness of the assignment of the physicians to LHCs. 2SLS = two-stage least squares * p < 0.1, ** p < 0.05, *** p < 0.01 4.3 Impacts on Birth Outcomes In this section, we present our main results on the impact of physicians’ level of medical skill on birth outcomes. Figure 2 shows the first-stage estimate of π in equation (2). The average graduation exam score of the first SSO cohort is a relevant instrument for the weighted average graduation exam score of all SSO cohorts to which the mother was exposed during pregnancy. There is a positive and statistically significant relationship, with an estimated coefficient of 31 In table A.3, we repeat the same exercise and present the results for windows 3.5, 3, 2.5, and 2 years before the start of the SSO program. 22 0.823 (standard error of 0.0013). This strong correlation is expected because most mothers were exposed to only one or two cohorts (50 percent were exposed to only one), so the correlation is high, often close to one. Table 4 presents the reduced-form (panel a) and LATE (panel b) estimates, while figure 3 displays the reduced form estimates graphically. We find a substantial improvement in children’s health at birth when mothers are treated at an LHC randomly assigned a more-skilled SSO cohort. In particular, our main skill measure has a negative and significant effect on the unhealthy outcome measure and on each health outcome measure individually. In table 4, both panel a and panel b present the coefficients with standard errors in parentheses. Below the standard errors, the relative (percent) effect is shown by dividing each coefficient by the mean of the dependent variable. Figure 2: Average Exam Score of All SSO Cohorts by Average Exam Score of First SSO Cohort Note: This figure presents a binned scatter plot of the average graduation exam score of all SSO cohorts a mother was exposed to during pregnancy against the first cohort’s average exam score for the physicians in our sample. Regressions fit stands for the first stage estimate (π ) presented in equation (2). Regression controls for draw-by-state fixed effects. The number in parentheses is the LHC-level-clustered standard error. In the IV estimates in column (1) of panel b in table 4, we observe a significant negative relationship between the skill level of SSO physicians and the probability that a baby is born unhealthy—a decrease of 0.87 percentage points. That is, if a mother is assigned a cohort of SSO physicians whose scores in the health modules of the graduation exam were one standard deviation higher, the probability that her baby is born unhealthy decreases by 9.14 percent. Notably, in our context, an increase of one standard deviation is almost equivalent to moving from having a physician from the bottom-ranked program to one from a median-ranked program, or from having a physician from a median-ranked program to one from the top-ranked program (see figure 1). In the education context, the teacher value-added 23 literature (Chetty et al., 2014; Rothstein, 2017) has found that an increase in teacher quality of one standard deviation corresponded to an increase in students’ test scores of 0.19 standard deviations in math and 0.14 standard deviations in reading. Columns (2) to (4) of panel b examine each birth outcome measure individually. The point estimate indicates a decrease in the probability of low birth weight by 0.41 percentage points (9.57 percent), a decrease in the probability of premature birth by 0.45 percentage points (10.99 percent),32 and a decrease in the probability of a low Apgar score by 0.43 percentage points (11.56 percent).33 Our results align with Amarante et al. (2016), who explore in utero exposure to a social assistance program in Uruguay to estimate its effects on birth outcomes. They find that participation in the program led to a “sizeable” (19–25 percent) reduction in the incidence of low birth weight. Similarly, Currie and Schwandt (2016a) find that fetal exposure to the toxic dust release during the collapse of the World Trade Center in New York City on 9/11 negatively affected gestation length, prematurity, birth weight, and low birth weight. Barber and Gertler (2010) evaluate the impact of a cash transfer program in Mexico on birth weight and find a very large reduction in the incidence of low birth weight (44.5 percent lower among beneficiary mothers). 32 These results are consistent with prior findings in the literature and in the Colombian context that prematurity is an important determinant of birth weight (Almond et al., 2005). We find a strong correlation between prematurity and low birth weight in Colombia. Figure A.1 shows a monotonic negative correlation between the probability of low birth weight and the number of gestational weeks for all births in Colombia between 2009 and 2012. The figure presents the local polynomial regression fit of the probability of low birth weight over the number of gestational weeks using all birth records in Colombia from 2009 to 2012. 33 Colombia’s infant mortality rate is 6.7 percent in 2022, smaller than the average for middle-income countries and Latin America, but slightly larger than upper-middle-income countries. Due to substantial data limitations in mortality records, including over 30 percent of records that are missing information on the number of weeks of gestation as well as incomplete LHC data, we are compelled to conduct a cohort-level analysis instead of our preferred birth-level estimates. We compute cohort-level estimates of mortality in table A.6. As expected, the estimates subject to measurement error attenuation. The results indicate that more-skilled physicians have a negative effect on mortality, though it is not statistically significant, consistent with our main findings. 24 Table 4: Main Estimates of the Effect of Physicians’ Skill Level on Birth Outcomes Unhealthy LBW Prematurity Low Apgar Average exam scores (1) (2) (3) (4) a. Reduced-form estimates Coefficient −0.0072*** −0.0034* −0.0037*** −0.0036** SE (0.0022) (0.0017) (0.0014) (0.0015) Relative effect −7.52% −7.88% −9.05% −9.52% b. LATE estimates Coefficient −0.0087*** −0.0041** −0.0045*** −0.0043** SE (0.0026) (0.0021) (0.0017) (0.0019) Relative effect −9.14% −9.57% −10.99% −11.56% Average dependent variable 0.095 0.043 0.041 0.038 Number of observations 255,089 Notes: This table presents our main estimates from equations (3) and (4). The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level-clustered standard errors. We interpret the high significance and consistency of these results across the different measures of birth outcomes as evidence of the important role that skilled physicians play in determining an infant’s health at birth. LATE = local average treatment effect * p < 0.1, ** p < 0.05, *** p < 0.01 4.4 Physicians’ Impacts across Subgroups In this section, we explore whether the effects of skilled physicians on birth outcomes, presented in the previous sections, are more pronounced among some subgroups. We focus solely on the LATE estimates from equation (4), although the reduced-form equation yields similar (rescaled) conclusions. The economics literature has extensively explored heterogeneous effects across different socioeconomic groups, using measures such as mother’s education, age, marital status, and the sex of the infant (Almond and Mazumder, 2011; Amarante et al., 2016; Currie and Schwandt, 2016a; Dinkelman, 2017; Eriksson et al., 2010; Hoynes et al., 2011; Okeke and Abubakar, 2020; Persson and Rossin-Slater, 2018). Consistent with these studies, our data include information from the VSRs on the infant’s sex and the mother’s education, age, and marital status, as well as whether the mother is a first-time mother. We find that the effect of being assigned to a more-skilled physician on our main birth outcome measure, unhealthy, is slightly more pronounced among first-time mothers, teenage 25 Figure 3: Reduced-Form Estimates of the Effect of Physicians’ Skill Level on Birth Outcomes Note: This figure presents a binned scatter plot of our main birth outcome measures against the first-cohort average graduation exam score. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. Low birth weight is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. Regressions fit stands for the reduced-form estimates (ρ) presented in equation (3). All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level-clustered standard errors. Results are robust to the exclusion of outliers. 26 mothers, mothers with low education, and single mothers (see table 5). While this pattern suggests that more vulnerable mothers may benefit somewhat more from more-skilled physicians, none of these differences across mothers’ characteristics were statistically significant. In section 6, we will revisit this heterogeneity across mothers’ characteristics while discussing potential mechanisms. Table 5: Heterogeneity of the Effects on Birth Outcomes across Subgroups of Mothers and Babies Dependent variable: unhealthy First- Non-first- Teenage Non-teenage Mothers Mothers Married Single time time mothers mothers with low with high mothers mothers mothers mothers education education (1) (2) (3) (4) (5) (6) (7) (8) Coefficient −0.0110*** −0.0077*** −0.0127*** −0.0079*** −0.0101*** −0.0075** −0.0077*** −0.0106*** SE (0.0037) (0.0023) (0.0035) (0.0027) (0.0028) (0.0032) (0.0027) (0.0031) Relative effect −10.05% −9.02% −11.19% −8.93% −10.14% −8.42% −8.90% −9.81% Average dependent variable 0.109 0.086 0.113 0.088 0.100 0.089 0.087 0.109 Number of observations 103,557 151,531 72,608 182,478 151,513 103,574 154,288 100,801 Difference test (p-value) 0.46 0.28 0.54 0.47 Note: This table presents the heterogeneity of our estimated results from equation (4) when we divide the sample by mothers’ characteristics and infants’ gender. The coefficients represent the effect of being assigned a physician whose skill level is higher by one standard deviation for each subgroup. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. First-time refers to the group of of mothers who are giving birth to their first child, and non-first-time refers to the complementary group. A mother is a teenage mother if she is giving birth at age 19 or younger and a non-teenage mother otherwise. A mother is a married mother if she is married at the moment of giving birth and a single mother otherwise. All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level- clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 We also look at heterogeneity across infants’ and LHC characteristics. In table 6, columns 1 and 2, we examine whether the treatment effects vary by the infant’s sex. It has been established that male fetuses are more vulnerable to health shocks than female fetuses (Almond and Mazumder, 2011; Currie and Schwandt, 2016a; Eriksson et al., 2010; Kraemer, 2000; Naeye et al., 1971).34 It is possible that skilled physicians play an important role in mitigating negative shocks on more-vulnerable fetuses. Although the reduction in our measure of the number of unhealthy babies is particularly pronounced among male infants, we do not find any statistical difference between males and females. Finally, we examine heterogeneity associated with the share of physicians from the SSO program relative to a proxy for the entire physician workforce in their local settings.35 Although the Ministry of Health (1990, 2001) specifies that SSO physicians are responsible for maternal care, including family planning and prenatal checkups, if the randomly assigned physicians do not constitute the entire workforce at the LHCs, the coefficient in our main regression (equation 34 In medicine and epidemiology, this phenomenon is known as “fragile males" (Cameron, 2004; Eriksson et al., 2010; Kraemer, 2000; Mathews et al., 2008; Mizuno, 2000). 35 To calculate the share of physicians from the SSO program, we obtain the total number of physicians for each municipality using ReTHUS and PILA data (see section 3). While this share is calculated at the municipality level, it is equivalent to calculating at the LHC level for 97.3 percent of municipalities, as only 2.7 percent of municipalities in our sample have more than one LHC. 27 4) may show larger effects in LHCs with a greater share of physicians from the SSO program (and hence greater exposure to the random assignment). To quantitatively test this idea, we implement two exercises. First, we estimate separately for the subset of LHCs with a high and low share of physicians from the SSO program, where high (low) is defined as those LHCs above (below) the 75th percentile of the distribution of the shares. Table 6, columns 3 and 4, shows that, while the point estimate for LHCs with a higher share of SSO physicians is larger, there is not a significant difference between the two groups. In a second exercise, we re-estimate table 4 but add as a separate control the share of physicians from the SSO program. Table A.7 shows that the results are quantitatively the same. The point estimate’s lack of strong dependence on the share of SSO physicians may suggest that LHCs in our sample adhere closely to the regulation recommending that SSO physicians take primary responsibility for conducting prenatal care. Table 6: Heterogeneity of the Effects on Birth Outcomes across Subgroups of LHCs Dependent variable: unhealthy Female Male Higher Lower infants infants share of SSO share of SSO physicians physicians (1) (2) (3) (4) Coefficient −0.0072*** −0.0106*** −0.0145** −0.0087*** SE (0.0028) (0.0029) (0.0067) (0.0027) Relative effect −7.73% −10.92% −13.74% −9.19% Average dependent variable 0.093 0.097 0.106 0.095 Number of observations 124,577 130,508 14,894 240,191 Difference test (p-value) 0.40 0.42 Note: This table presents the heterogeneity of our estimated results from equation (4) when we divide the sample by LHC. The coefficients represent the effect of being assigned a physician whose skill level is higher by one standard deviation for each subgroup. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. An LHC with a higher incidence of unhealthy (lower incidence of unhealthy) is an LHC above (below) the 75th percentile of the ex ante unhealthy proportion distribution. An LHC with a higher share of SSO physicians (lower share of SSO physicians) is an LHC above (below) the 75th percentile of the ex ante share of the SSO physicians proportion distribution in the SSO sample. All regressions control for draw-by-state fixed effects. We interpret these results as evidence of a (weak) significant difference between the effect of physicians in LHCs with a high and low incidence of poor health. The numbers in parentheses are LHC-level-clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 4.5 Physicians’ Value-Added While our empirical setting links birth outcomes to physicians’ skill levels, the estimated coefficients should not be interpreted as the effect of exogenously increasing physicians’ skills while keeping all else constant. Instead, we identify the effect of being treated at an LHC that is randomly assigned a more-skilled cohort of SSO physicians compared to a less-skilled cohort, including all the characteristics that may differ between these two groups of 28 physicians. Our results can be informative for policy makers, as test scores are often an observable proxy for skills. However, it is important to note that test scores may not capture all the factors that influence clinical competence, so this measure may understate the role of physicians’ skill levels in determining patient health outcomes. To estimate physicians’ broader contribution to children’s health at birth, we take advantage of the random assignment of physicians to LHCs and compute a relative measure of value-added. Consider the model where child i’s potential birth outcome when assigned to a cohort of SSO physicians j , denoted by Yj,i , can be written as the sum of two components: Yj,i = vj + αi , (6) where vj is the average potential effect of physician j on child’s outcomes at birth and αi is the child’s latent health at birth. Let Dj,i be a dummy variable indicating whether child i’s mother was assigned to cohort of SSO physicians j . The observed birth outcome for child i can then be expressed as J Yi = Ys,i + (Yj,i − Ys,i )Dj,i j =1 J (7) = vs + βj Dj,i + αi , j =1 where vs represents the average potential outcome associated with a reference cohort of SSO physicians, indexed by s, and the parameter βj measures the value-added of cohort j relative to this reference cohort. In most settings, the match between physicians and patients is at risk of being correlated with other patients’ unobserved characteristics, implying that the estimation of equation (7) using OLS would result in potentially biased estimates of the physician’s value- added. Now, consider the projection version of equation (7) but controlling by the draw-by-state fixed effects γd : J Yi = γd + βj Dj,i + εi . (8) j =1 Since, in our setting, several cohorts of SSO physicians applied to a specific state and were randomly assigned to LHCs in that state, we have E [Dj,i εi |γd ] = 0 for all j = 1, ..., J , and OLS estimates can identify the causal effect of being randomly assigned a more-skilled cohort of SSO physicians on children’s outcomes. Due to the random nature of physicians’ assignment to LHCs at the draw-by-state level, we estimate physicians’ relative value-added by first running a regression of the unhealthy indicator on the draw-by-state fixed effects: 29 Yi = γd + ri . (9) ˆi from equation (9) and regress them on the J different assignment We then compute residuals r indicator dummies to recover the estimated physician effect: J r ˆi = βj Dj,i + ϵi , (10) j =1 where the β ˆj , estimated using OLS, is an unbiased estimate of physician j ’s effect on children’s health relative to the draw-by-state average. Since the outcome in equation (9) is the probability of being born unhealthy, a “smaller” value-added has a positive connotation. The empirical value-added literature typically shrinks the value-added estimates toward a common Bayesian prior (Herrmann et al., 2016). The benefit of the shrinkage procedure is to produce estimates of value-added for which the estimation error variance is reduced through the dependence on the stable prior. In practice, the prior is specified as the average value-added (Chetty et al., 2014; Kane et al., 2008).36 The weight applied to the prior for a cohort of physicians is an increasing function of the variance with which that value-added is estimated. The following formula describes the empirical shrinkage procedure estimated: ˆj + (1 − aj )β ˆEB = aj β β ¯ j ˆ2 σ aj = , ˆj ˆ2 + λ σ where σ ˆj . Our ˆ j is the squared standard error of β ˆ 2 is the estimated variance of value-added and λ shrunken value-added result implies that assigning a team of physicians at the 25th percentile of the skill distribution, compared to the 75th percentile, would increase the likelihood of a child’s being unhealthy by approximately 0.08 standard deviations. We then regress the unbiased value-added estimates on multiple physician characteristics, including average performance on the health modules of the graduation exam, to study which SSO physician characteristics correlate more with the estimated valued-added effects.37 Columns (1) and (2) of table 7 show the results of regressing the mentioned physicians’ estimated effects on different sets of physician characteristics. Columns (3) and (4) control for the additional observable characteristics (children and LHC) in equation (9) (e.g., LHC health indicators) to account for the quality of other, potentially longer-term-appointed physicians at the LHC. Table 7 shows that results are similar to the ones presented in columns (1) and (2). 36 This benefit is particularly valuable in applications where we want an estimator that performs well on average (Angrist et al., 2017; Chetty et al., 2014; Harris and Sass, 2014; Kane et al., 2008), reducing mean squared error. 37 Since we are working with cohorts of SSO physicians, these characteristics will be calculated as averages. 30 We interpret the results from table 7 as evidence of the relevance of the health-specific graduation exam scores for predicting physicians’ skills. Column (1) shows that the graduation exam score is negatively and significantly correlated with the physicians’ relative value-added. An increase of one standard deviation in the exam score is associated with a 0.0124 percentage point improvement in value-added.38 However, the significant relationship between the scores and the relative value-added could be the result of the graduation exam score’s correlation with other physician characteristics, which could be more relevant and closely associated with the physicians’ performance. To test this hypothesis, we regress the estimated relative value-added on the physicians’ exam scores and other characteristics that were observed at the same time as exam scores, including gender, family socioeconomic characteristics, and some proxies for the quality of the medicine program they attended. The results in column (2) show that, not only does the coefficient on test scores remain significant and statically similar to the one in column (1), but also, once we account for the exam score, none of the other observed physician characteristics have a significant correlation with the physicians’ performance. These two results highlight the relevance of the graduation exam scores as both a practical, observable tool and as an indicator with high predictive power. Finally, as expected, columns (3) and (4) indicate that the random assignment allows us to obtain similar results even when controlling for other physician characteristics in the value-added estimation. 38 Note that this coefficient should be similar to the one estimated in table 4 but does not have to be the same; the regression in table 4 is at the child level, whereas the regression in table 6 is at the cohort level. 31 Table 7: Physicians’ Observable Characteristics and Their Relative Value-Added Dependent variables Value-added without controls Value-added with controls (1) (2) (3) (4) Average exam scores −0.0141** −0.014** −0.0152** −0.0159*** (0.0063) (0.0063) (0.0061) (0.0061) Female −0.0016 −0.0036 (0.009) (0.0084) Father with tertiary education −0.0007 0.0035 (0.008) (0.0077) Mother with tertiary education −0.0082 −0.0122 (0.0103) (0.0097) Father or the mother has a job 0.0025 0.0075 (0.0101) (0.0094) Top program −0.0119 −0.0106 (0.0154) (0.0139) Top income 0.0059 0.0116 (0.0115) (0.0101) Public school 0.0019 0.0078 (0.0141) (0.0132) Accredited program 0.0123 0.0101 (0.0093) (0.0088) Note: This table reports the results of regressing physicians’ estimated relative value-added on observable characteristics across 1,248 cohorts of physicians. Each column from (1) to (4) refers to a different regression. The regressors, listed in the first column, are expressed in relative terms with respect to the by-draw and by- state average. Column (1) includes only the average graduation exam score as a regressor. Column (2) includes other physician characteristics as well. Columns (3) and (4) present the results of analogous exercises where relative value-added is estimated as in equation (8) but also using the following observed child and mother characteristics as controls: an indicator variable for the sex of the infant; an indicator variable that takes a value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes a value of 1 if the mother is 19 years old or younger and zero otherwise; marital status; number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. The numbers in parentheses are LHC- level-clustered standard errors. We interpret the results from this table as evidence of the distinctive relevance of the health-specific graduation exam scores in predicting physicians’ performance. * p < 0.1, ** p < 0.05, *** p < 0.01 5 Robustness Checks and Additional Exercises 5.1 Additional Controls For robustness, we run additional specifications, adding different sets of controls. We show that our results are robust to including ex ante LHC characteristics and a vector of sociodemographic information about the mother and child. The estimated coefficients are stable with the inclusion of controls. We report the results with and without controls in table A.9. 32 5.2 Alternative Definitions of Physicians’ Skill Levels First, we use a (standardized) principal component instead of the standardized average health-specific graduation exam scores as a proxy for physicians’ skill levels. In addition, we use the average of the four SABER PRO modules (health care, disease prevention, reading comprehension, and quantitative reasoning) and each individual test score as proxies for physicians’ skill levels before the SSO program. Figure A.8 (and table A.9) compares the estimated relative coefficient (dividing by the dependent variable mean), β , in equation (4) using the average (main specification) and the principal component of the graduation exam scores both with and without controls, while table A.5 presents the results using each individual test score. Our conclusions are similar across the different definitions of physicians’ skill levels and with the inclusion of controls. 5.3 Alternative Definitions of the Main Outcome We standardize, center, and aggregate the three main health outcomes (low birth weight, prematurity, and low Apgar score) using the inverse covariance index suggested by Anderson (2008) and repeat our main empirical analysis using the index as the dependent variable. In table A.10, we present the results using the covariance index and our main outcome, unhealthy (standardized), as dependent variables.39 As before, we see that our conclusions are similar regardless of the definition of the main outcome. 5.4 Nonlinear Estimations The average prevalence of the outcomes considered is relatively low and around 4 percent. One concern is that a linear regression may not fit the data well. To address this concern, we estimate an analogous logit model based on equation (3) and compute the average marginal effect associated with being treated in an LHC assigned an SSO cohort whose skill level is one standard deviation higher. Table A.11 shows that the marginal effects (signs and magnitudes) are very similar to those estimated using a linear regression model. 5.5 Impact across Distribution of Skills Finally, while OLS allows us to compute the average effect of physicians’ skill levels, it does not tell us much about the magnitude of this effect across the distribution of physicians’ skills. We divide the score into quartiles and estimate equation (4) using a set of dummy variables indicating the quartile of the score distribution to which physicians belong. The results are presented in table A.12. Columns (1) and (2) present the coefficients associated with the effect 39 Note that the adjusted standardized coefficients (in standard deviations) are very similar for both specifications. 33 belonging to the second, third, and fourth quartiles of the distribution of the average of the graduation exam scores and the first principal component, respectively, on our main birth outcome measure, unhealthy, relative to the first quartile. Although we lack the power to find statistically significant differences, we see that the point estimates are negative and monotonically decreasing with respect to the quartile. This suggests potential gains are associated with being assigned to more-skilled physicians across the whole distribution of skills. Finally, in table A.14, we also interact the average score with the university’s (program’s) average score to test whether top universities drive the estimated effect. We do not find evidence that top-ranked universities drive the effects presented earlier. 6 Potential Mechanisms Physicians differ systematically in the decisions they make when faced with similar cases (Chan et al., 2022). Likewise, the previous literature has found differences in practice patterns and identified how these practices affect health outcomes (Tsugawa et al., 2017). Some dimensions of these practices, such as the quality of the medical advice doctors provide, are unobservable (Das et al., 2008; Leonard and Masatu, 2007; Mullainathan and Obermeyer, 2022), whereas others, such as the number of prenatal checkups they offer, are observable. In this section, we study prenatal checkups as a potential mechanism for observed differences between more-skilled and less-skilled physicians. 6.1 Prenatal Checkups We first explore whether more-skilled physicians increase the number of prenatal checkups that mothers have, as a mechanism to improve the quality of health care and birth outcomes. Although most of the evidence from economics and medicine shows an important association between prenatal care and both birth weight and prematurity, some disagreements persist (Alexander and Korenbrot, 1995; Amarante et al., 2016; Barber and Gertler, 2010; Carrillo and Feres, 2019; Conway and Deb, 2005; Currie and Grogger, 2002; Grossman and Joyce, 1990; Kramer, 1987; McCormick and Siegel, 2001). According to the WHO (2016) and the Colombian government (Gomez et al., 2013), prenatal care improves the health status of both mother and child. As noted above, in Colombia, the Ministry of Health requires physicians to carry out prenatal monitoring (Gomez et al., 2013). We follow the standard recommended by the WHO (2016) for our period of analysis and measure “adequate prenatal care" as having at least four checkups during pregnancy. We do not find evidence that more-skilled doctors reduce the probability that mothers are scheduled for fewer than four prenatal checkups (see table A.8). 34 We expect that SSO physicians assigned to rural areas are time constrained, as they are usually the only physicians in those areas.40 Anecdotal evidence supports this argument: in various reports from Colombian medical associations, physicians describe their experience during the SSO year as characterized by an overwhelming workload and long working hours.41 In this setting, in which physicians are time constrained, it comes as no surprise that the overall likelihood that a mother has a sufficient number of prenatal checkups is not significantly affected by the skill level of the physicians. However, we might expect that more-skilled physicians could better target care, allocating resources more effectively to more-vulnerable mothers without compromising the care of lower-risk mothers. Therefore, using the graduation exam scores, we analyze whether more-skilled physicians target their prenatal checkups toward more-vulnerable mothers—–those who are more likely to give birth to unhealthy babies. Supporting this argument, one of the health-specific exam modules directly evaluates the physician’s skill to “analyze the personal, social, economic, and environmental determinants that influence the health status of the individual, family, and community, in order to prioritize actions to be taken.” Recent studies have focused on applying machine learning techniques to analyze physicians’ decision-making in diagnoses (Mullainathan and Obermeyer, 2022; Stern and Trajtenberg, 1998). Taking a similar approach, we conceptualize the likelihood that a baby is born unhealthy as a predictive problem, leveraging recent advancements42 in these techniques to generate two groups of predictions about the probability that a mother gives birth to an unhealthy baby, using a set of mother-LHC characteristics that are available to the physician at the time of prenatal care. Specifically, we incorporate in the prediction all the characteristics listed in tables 5 and 6. We apply algorithms that are commonly used in the machine learning literature: random forest and logistic regression models.43 The sample is clustered into training and testing subsets of randomly selected LHCs using a K-means algorithm. We repeat this procedure—splitting the main sample using K-means— 1,000 times. We run logit and random forest models on the training sets and use the models to predict the probability of giving birth to an unhealthy child for each testing subset.44 We then divide the test sample into two groups: low and high predicted probability, defined 40 The median number of physicians per LHC in these rural areas is three. 41 See, for example, reports from the Colegio Médico Colombiano (2018) and the Universidad del Rosario (2015). 42 Supervised machine learning seeks to solve the problem of prediction (Kleinberg et al., 2015). Athey and Imbens (2017) and Mullainathan and Spiess (2017) emphasize that machine learning is significantly better at making predictions, in part because it can use very flexible functional forms and fit complex data structures without imposing any specific restrictions in advance. According to Mullainathan and Spiess (2017), machine learning algorithms can do significantly better than traditional methods, even with moderate sample sizes and few covariates. 43 These methods are able to handle many covariates, and they provide natural estimators of parameters when these are highly complex. The focus in the machine learning literature is often on working properties of algorithms in specific settings. See Mullainathan and Spiess (2017) for a review of the literature and Breiman (2001) for a description of the methods. 44 We follow Chernozhukov et al. (2018) and rescale the outcomes and covariates to be between 0 and 1 before training. 35 as mothers with a probability of giving birth to an unhealthy child below and above the 75th percentile, respectively, for each of the two model predictions.45 We estimate the reduced form equation (3) using a dummy equal to 1 if the number of prenatal check-ups is fewer than four—as our main outcome—in each of the previously defined groups (i.e., low and high predicted probability of giving birth to an unhealthy child). Table 8 presents the average coefficient and standard error for the 1,000 repetitions.46 Columns (1) and (2) present the results for the sample of mothers with a low predicted probability of giving birth to an unhealthy child, and columns (3) and (4) for the sample of mothers with a high predicted probability of giving birth to an unhealthy child. We include the results both with and without controls. Table 8 shows that regardless of the method we use, more-skilled doctors do not seem to change the recommended number of prenatal checkups for mothers with a low predicted probability of giving birth to an unhealthy child. Instead, they target prenatal checkups towards the more-vulnerable mothers, measured as mothers with a high predicted probability of giving birth to an unhealthy baby, but without compromising the care of lower-risk mothers. Consistent with our suggested mechanism—that more-skilled physicians are better able to target care toward more-vulnerable mothers, we find stronger effects of physicians’ skill levels when we focus on mothers with a higher predicted probability than when we focus on those with a lower predicted probability. While the point estimate for the effect of physicians’ skill levels on the likelihood of having an unhealthy child in the lower predicted probability sample is between −0.0011 and 0.0003 percentage points, depending on the prediction used to divide the data, the point estimate for the higher predicted probability group is between −0.016 and −0.024 percentage points. These estimates suggest that an increase of one standard deviation in physicians’ average graduation exam score decreases the probability that mothers are scheduled for fewer than four prenatal checkups between 9.49 and 13.17 percent for mothers with a high predicted probability of giving birth to an unhealthy child. Taken together, the results from this section are consistent with a story of time-constrained physicians not being able to increase the average time they spend in prenatal checkups but they are better at targeting care toward more-vulnerable mothers. 6.1.1 Effect on Probability of Giving Birth to an Unhealthy Child We next ask—consistent with the idea that more-skilled physicians are better at targeting care toward more-vulnerable mothers without compromising the care of lower-risk mothers—whether being assigned to a more-skilled cohort of SSO physicians reduces the probability that a mother gives birth to an unhealthy child, particularly among the most 45 Liberman et al. (2018) and ? follow a similar strategy when studying the effects of information deletion and usury rates on consumer credit markets. 46 Figure A.6 shows the distribution of the estimated coefficients for the 1,000 repetitions. 36 Table 8: Number of Prenatal Checkups by Predicted Probability of an Unhealthy Child Dependent variable: prenatal checkups < 4 Low predicted probability High predicted probability of an unhealthy child of an unhealthy child Without With Without With controls controls controls controls (1) (2) (3) (4) a. Logit Coefficient 0.0015 0.0001 −0.0203** −0.0236** SE (0.0097) (0.0098) (0.009) (0.0094) Relative effect 0.94% 0.05% −11.29% −13.17% b. Random forest Coefficient 0.0003 −0.0011 −0.0163* −0.0194** SE (0.0099) (0.01) (0.0088) (0.009) Relative effect 0.20% −0.67% −9.49% −11.27% Note: This table reports the differential effects of physicians on the number of prenatal checkups a mother has by her predicted probability of giving birth to an unhealthy child. To predict the probability of an unhealthy child, we divide our data into training and testing subsets of randomly selected LHCs using a K-mean algorithm. On the training sets, we run logit and random forest models of the probability of being born unhealthy on our usual set of mother and LHC ex ante covariates, and we use the estimations to predict the probability of giving birth to an unhealthy child on each testing subset. Using the prediction on the testing sample, we divide each subset into groups with a high and low predicted probability of giving birth to an unhealthy child, defined as mothers with a probability of an unhealthy child below and above the median, respectively. The coefficients (β ) represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation on the probability of having insufficient (fewer than four) prenatal checkups. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level-clustered standard errors. We interpret the non-significant effect for the low predicted probability of an unhealthy child group and the significant effect for the high predicted probability of an unhealthy child group as evidence consistent with the idea that more- skilled physicians are better at targeting care toward more-vulnerable mothers. * p < 0.1, ** p < 0.05, *** p < 0.01 vulnerable mothers. Table 9 shows that being assigned to more-skilled doctors seems to improve the health at birth of children for all mothers (i.e., whether they have a low or high predicted probability of having an unhealthy child). However, the effect is more pronounced, regardless of the method we use to split the sample, for mothers with a (ex ante) high predicted probability of having an unhealthy child. In particular, for the more-vulnerable mothers, being assigned to a physician whose graduation exam scores were one standard deviation higher decreases the probability of an unhealthy baby by around 9.45 percent, while for mothers with (ex ante) low predicted probability of an unhealthy baby, the effects are smaller in magnitude, close to 8.71 percent.47 47 Figure A.7 presents the distribution of the estimated coefficients for the 1,000 repetitions for the four outcomes studied. 37 Table 9: Main Outcomes by Predicted Unhealthiness Dependent variable: unhealthy Low predicted probability High predicted probability of an unhealthy child of an unhealthy child Without With Without With controls controls controls controls (1) (2) (3) (4) a. Logit Coefficient −0.0079*** −0.0076*** −0.0111** −0.0115** SE (0.0026) (0.0026) (0.0045) (0.0046) Relative effect −9.11% −8.73% −9.20% −9.49% b. Random forest Coefficient −0.0079*** −0.0076*** −0.0111** −0.0115** SE (0.0026) (0.0026) (0.0045) (0.0046) Relative effect −9.07% −8.71% −9.41% −9.45% Note: This table reports the differential effects of physicians on the probability of a child’s being born unhealthy, by the mother’s predicted probability of giving birth to an unhealthy child. To predict the probability of an unhealthy child, we divide our data into training and testing subsets of randomly selected LHCs using a K-mean algorithm. On the training sets, we run logit and random forest models of the probability of being born unhealthy on our usual set of mother and LHC ex ante covariates, and we use the estimations to predict the probability of giving birth to an unhealthy child on each testing subset. Using the prediction on the testing sample, we divide each subset into groups with a high and low predicted probability of giving birth to an unhealthy child, defined as mothers with a probability of an unhealthy child below and above the median, respectively. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. The coefficients (β ) represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation on unhealthy. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level-clustered standard errors. The results show how, consistent with the idea that more-skilled physicians are better at targeting care toward more- vulnerable mothers, the negative effects on the probability of having an unhealthy child are particularly pronounced among more-vulnerable mothers. * p < 0.1, ** p < 0.05, *** p < 0.01 38 6.2 Other Mechanisms In addition to more-skilled physicians’ ability to target care, we discuss two alternative mechanisms through which assigning more-skilled SSO physicians to an LHC may impact birth outcomes: the potential sorting of patients to LHCs and the potential impact of practice styles on LHC outcomes. 6.2.1 Patient Assignment It is possible that the presence of more-skilled physicians in an LHC could influence the demographics of the mothers seeking care—for instance, by attracting more-vulnerable mothers from the local municipality and nearby areas. However, the evidence from our study suggests that this kind of sorting of patients by physicians’ skill level is limited. Moreover, because we focus on non-metropolitan areas that generally have no more than one LHC, there are few practical alternatives for mothers to seek care elsewhere, further reducing the likelihood of sorting. In particular, there are no significant correlations between the skill level of physicians in an LHC and various demographic characteristics of the mothers they treat, such as education level, marital status, and age (table 2). This absence of significant correlations suggests that the arrival of more-skilled physicians at an LHC does not systematically attract mothers with specific demographic profiles. Furthermore, the temporary nature of the government’s SSO program, which places physicians in an LHC for only 12 months, further constrains the potential for long-term patient sorting across LHCs. Given this limited duration, the opportunity for mothers to switch their care preferences based on physicians’ skill level is limited, suggesting that the observed improvements in birth outcomes are primarily attributable to the direct effects of physician competence rather than changes in the patient mix. 6.2.2 Practice Style The influence of physician practice styles and practice environments on treatment outcomes has been extensively documented in the literature. This body of work highlights how physician-specific factors such as personal preferences, training backgrounds, and accumulated experience can lead to significant variations in treatment approaches within the same local health care markets, resulting in persistent style differences among physicians (Epstein and Nicholson, 2009; Grytten and Sørensen, 2003; Molitor, 2018; Phelps, 1992). While these individual tendencies contribute to disparities in medical practices, environment-specific factors such as hospital resources, staff productivity, and financial incentives also shape practice styles, suggesting a complex interplay between individual and environmental influences (Molitor, 2018). 39 Our analysis suggests that the potential impact of matching individual physician backgrounds with the specific environments of LHCs is limited in this context. First, the random assignment of physicians through a within-state lottery controls for selection biases, distributing more- and less-skilled physicians across LHCs without preference. Thus, differences in outcomes are less likely to be due to the systematic matching of physician and LHC characteristics; rather, they likely reflect the intrinsic skill differences among physicians. Second, our study focuses on rural municipalities, where health care facilities are scarcer (only 16 out of 598 municipalities have more than one LHC). Facilities in these areas are likely to have similar practice environments, processes, and systems. Third, our examination of educational backgrounds (table A.14) indicates that differences in training institutions among top-ranked universities do not significantly drive the observed outcomes in our study. This finding suggests that, even if educational institutions impart distinct practice styles to their graduates, such differences are not the primary determinants of the variations in patient outcomes in our setting. Finally, the results presented in section 4.5 suggest that there is no relationship between the physicians’ value-added and several observable characteristics. 7 Conclusions Physicians are a key input in the production function of health at birth. Yet there is little evidence on the effect they can have on birth outcomes. The lack of causal evidence on this topic is related to the selection bias associated with the match between physicians and LHCs (Doyle Jr et al., 2010). In the present study, we provide experimental evidence to answer the difficult question of whether and how physicians’ skill level affects birth outcomes for the mothers and children they treat. In Colombia, medical school graduates must spend the first year of their careers working in the SSO. The SSO program randomly assigns physicians to their first jobs, providing a test for the effects of being treated at an LHC with more-skilled physicians. In this paper, we combine administrative records to match physicians in the SSO program, LHCs, VSRs, physician characteristics, and scores from mandatory health-specific college graduation exams to measure the skills of the physicians assigned to each LHC and the main birth outcomes. Using these data sets, we provide evidence of the covariate balance between LHCs and the skill level of physicians. Finally, we provide evidence of the causal relationship between more-skilled physicians and health at birth. We find that being treated at an LHC that is randomly assigned more-skilled SSO physicians has a negative and significant effect on the probability that a mother gives birth to an unhealthy child. We estimate that being assigned to a physician whose graduation exam score was one standard deviation higher reduces the probability that a mother gives birth to an unhealthy 40 child by 9.14 percent. Although we use an aggregate measure of health at birth as our main measure, the results are robust to other measures, such as low birth weight, prematurity and low Apgar score. Furthermore, we explore whether being assigned to more-skilled physicians increases the number of prenatal checkups a mother has, serving as a mechanism to improve the quality of health care and birth outcomes. According to WHO (2016) and the Colombian government, better and more frequent prenatal care improves a child’s health at birth. We find that more- skilled doctors do not schedule mothers for more prenatal checkups. Nonetheless, we provide evidence that these physicians target their prenatal checkups toward more-vulnerable mothers, measured as those with a higher predicted likelihood of giving birth to an unhealthy baby. Finally, we present several meaningful placebo tests. The results show the internal validity of our exercise. We conclude that more-skilled physicians play a crucial role in overall health at birth and that governments should consider these findings in developing policies to assign physicians optimally. References Abaluck, J., L. Agha, C. Kabrhel, A. Raja, and A. Venkatesh (2016). The determinants of productivity in medical testing: Intensity and allocation of care. American Economic Review 106(12), 3730–64. Abowd, J. M., F. Kramarz, and D. N. Margolis (1999). High wage workers and high wage firms. Econometrica 67(2), 251–333. Administrative Department of Statistics (2005). National census 2005. www.dane.gov.co. Administrative Department of Statistics (2018a). National geostatistical framework 2018. www.dane.gov.co. Administrative Department of Statistics (2018b). Vital statistics records. www.dane.gov.co. Alexander, G. R. and C. C. Korenbrot (1995). The role of prenatal care in preventing low birth weight. The Future of Children, 103–120. Almond, D., K. Y. Chay, and D. S. Lee (2005). The costs of low birth weight. The Quarterly Journal of Economics 120(3), 1031–1083. Almond, D., J. Currie, and V. Duque (2018). Childhood circumstances and adult outcomes: Act ii. Journal of Economic Literature 56(4), 1360–1446. Almond, D., J. J. Doyle Jr, A. E. Kowalski, and H. Williams (2010). Estimating marginal returns to medical care: Evidence from at-risk newborns. The Quarterly Journal of Economics 125(2), 591–634. Almond, D. and B. Mazumder (2011). Health capital and the prenatal environment: the effect of ramadan observance during pregnancy. American Economic Journal: Applied Economics 3(4), 56–85. Alsan, M., O. Garrick, and G. Graziani (2019). Does diversity matter for health? experimental evidence from oakland. American Economic Review 109(12), 4071–4111. 41 Amarante, V., M. Manacorda, E. Miguel, and A. Vigorito (2016). Do cash transfers improve birth outcomes? evidence from matched vital statistics, program, and social security data. American Economic Journal: Economic Policy 8(2), 1–43. Anderson, M. L. (2008). Multiple inference and gender differences in the effects of early intervention: A reevaluation of the abecedarian, perry preschool, and early training projects. Journal of the American statistical Association 103(484), 1481–1495. Anderson, M. L., C. Dobkin, and T. Gross (2014). The effect of health insurance on emergency department visits: Evidence from an age-based eligibility threshold. Review of Economics and Statistics 96(1), 189–195. Angrist, J. D., P. D. Hull, P. A. Pathak, and C. R. Walters (2017). Leveraging lotteries for school value-added: Testing and estimation. The Quarterly Journal of Economics 132(2), 871–919. Araujo, M. C., P. Carneiro, Y. Cruz-Aguayo, and N. Schady (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics 131(3), 1415–1453. Aron-Dine, A., L. Einav, A. Finkelstein, and M. Cullen (2015). Moral hazard in health insurance: do dynamic incentives matter? Review of Economics and Statistics 97(4), 725–741. Athey, S. and G. W. Imbens (2017). The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives 31(2), 3–32. Baicker, K. and A. Chandra (2004). The productivity of physician specialization: evidence from the medicare program. American Economic Review 94(2), 357–361. Barber, S. L. and P. J. Gertler (2010). Empowering women: how mexico’s conditional cash transfer programme raised prenatal care quality and birth weight. Journal of Development Effectiveness 2(1), 51–73. Bardach, N. S., J. J. Wang, S. F. De Leon, S. C. Shih, W. J. Boscardin, L. E. Goldman, and R. A. Dudley (2013). Effect of pay-for-performance incentives on quality of care in small practices with electronic health records: a randomized trial. Jama 310(10), 1051–1059. Basinga, P., P. J. Gertler, A. Binagwaho, A. L. Soucat, J. Sturdy, and C. M. Vermeersch (2011). Effect on maternal and child health services in rwanda of payment to primary health-care providers for performance: an impact evaluation. The Lancet 377 (9775), 1421–1428. Becker, G. S. (1973). A theory of marriage: Part i. Journal of Political Economy 81(4), 813–846. Black, S. E., P. J. Devereux, and K. G. Salvanes (2007). From the cradle to the labor market? the effect of birth weight on adult outcomes. The Quarterly Journal of Economics 122(1), 409–439. Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32. Butler, A. S., R. E. Behrman, et al. (2007). Preterm birth: causes, consequences, and prevention. National Academies Press. Cameron, E. Z. (2004). Facultative adjustment of mammalian sex ratios in support of the trivers– willard hypothesis: evidence for a mechanism. Proceedings of the Royal Society of London. Series B: Biological Sciences 271(1549), 1723–1728. Carrera, M., D. P. Goldman, G. Joyce, and N. Sood (2018). Do physicians respond to the costs and cost-sensitivity of their patients? American Economic Journal: Economic Policy 10(1), 113–52. Carrillo, B. and J. Feres (2019). Provider supply, utilization, and infant health: evidence from a physician distribution policy. American Economic Journal: Economic Policy 11(3), 156–96. Chan, D. C. and Y. Chen (2022). The productivity of professions: evidence from the emergency department. Technical report, National bureau of economic research. Chan, D. C., M. Gentzkow, and C. Yu (2022). Selection with variation in diagnostic skill: Evidence from radiologists. The Quarterly Journal of Economics 137(2), 729–783. Chandra, A. and D. Staiger (2020). Identifying sources of inefficiency in healthcare. The Quarterly Journal of Economics 135(2), 785–843. 42 Chen, Y. (2021). Team-specific human capital and team performance: evidence from doctors. American economic review 111(12), 3923–3962. Chernozhukov, V., M. Demirer, E. Duflo, and I. Fernandez-Val (2018). Generic machine learning inference on heterogenous treatment effects in randomized experiments. Technical report, National Bureau of Economic Research. Chetty, R., J. N. Friedman, N. Hilger, E. Saez, D. W. Schanzenbach, and D. Yagan (2011). How does your kindergarten classroom affect your earnings? evidence from project star. The Quarterly Journal of Economics 126(4), 1593–1660. Chetty, R., J. N. Friedman, and J. E. Rockoff (2014). Measuring the impacts of teachers ii: Teacher value-added and student outcomes in adulthood. American economic review 104(9), 2633–2679. Clemens, J. and J. D. Gottlieb (2014). Do physicians’ financial incentives affect medical treatment and patient health? American Economic Review 104(4), 1320–49. Colegio Médico Colombiano (2018). Historia del servicio social obligatorio. Retrieved from: https://www.colegiomedicocolombiano.org/web_cmc/upload/docs/ Epicrisis-7_web.pdf. Colombian Institute for Educational Evaluation (2014). Quality evaluation of higher education. Congress of Colombia (1993, December). Law 100 of 1993. por la cual se crea el sistema de seguridad social integral y se dictan otras disposiciones. Congress of Colombia (2007, October). Law 1164 of 2007. por la cual se dictan disposiciones en materia del talento humano en salud. Conway, K. S. and P. Deb (2005). Is prenatal care really ineffective? or, is the ‘devil’in the distribution? Journal of Health Economics 24(3), 489–513. Currie, J. (2011). Inequality at birth: Some causes and consequences. American Economic Review 101(3), 1–22. Currie, J. and D. Almond (2011). Human capital development before age five. In Handbook of Labor Economics, Volume 4, pp. 1315–1486. Elsevier. Currie, J. and J. Grogger (2002). Medicaid expansions and welfare contractions: offsetting effects on prenatal care and infant health? Journal of Health Economics 21(2), 313–335. Currie, J. and J. Gruber (1996). Saving babies: The efficacy and cost of recent changes in the medicaid eligibility of pregnant women. Journal of Political Economy 104(6), 1263–1296. Currie, J. and W. B. MacLeod (2017). Diagnosing expertise: Human capital, decision making, and performance among physicians. Journal of Labor Economics 35(1), 1–43. Currie, J. and W. B. MacLeod (2020). Understanding doctor decision making: The case of depression treatment. Econometrica 88(3), 847–878. Currie, J. and H. Schwandt (2016a). The 9/11 dust cloud and pregnancy outcomes: a reconsideration. Journal of Human Resources 51(4), 805–831. Currie, J. and H. Schwandt (2016b). Mortality inequality: The good news from a county-level approach. Journal of Economic Perspectives 30(2), 29–52. Currie, J. and R. Walker (2011). Traffic congestion and infant health: Evidence from e-zpass. American Economic Journal: Applied Economics 3(1), 65–90. Currie, J. and J. Zhang (2023). Doing more with less: Predicting primary care provider effectiveness. Review of Economics and Statistics, 1–45. Curtis, J. R., Q. Cai, S. W. Wade, B. S. Stolshek, J. L. Adams, A. Balasubramanian, H. N. Viswanathan, and J. D. Kallich (2013). Osteoporosis medication adherence: physician perceptions vs. patients’ utilization. Bone 55(1), 1–6. Dahlstrand, A. (2021). Defying distance? the provision of services in the digital age. Job Market Paper, London School of Economics and Political Science. 43 Das, J. and J. Hammer (2005). Which doctor? combining vignettes and item response to measure clinical competence. Journal of Development Economics 78(2), 348–383. Das, J. and J. Hammer (2007). Money for nothing: the dire straits of medical practice in delhi, india. Journal of Development Economics 83(1), 1–36. Das, J., J. Hammer, and K. Leonard (2008). The quality of medical advice in low-income countries. Journal of Economic Perspectives 22(2), 93–114. Das, J., A. Holla, A. Mohpal, and K. Muralidharan (2016). Quality and accountability in health care delivery: audit-study evidence from primary care in india. American Economic Review 106(12), 3765–99. Das, J. and T. P. Sohnesen (2007). Variations in doctor effort: Evidence from paraguay: Doctors in paraguay who expended less effort appear to have been paid more than doctors who expended more. Health Affairs 26(Suppl2), w324–w337. Davis, D. A., M. A. Thomson, A. D. Oxman, and R. B. Haynes (1995). Changing physician performance: a systematic review of the effect of continuing medical education strategies. Jama 274(9), 700–705. Dinkelman, T. (2017). Long-run health repercussions of drought shocks: Evidence from south african homelands. The Economic Journal 127(604), 1906–1939. Doyle Jr, J. J., S. M. Ewer, and T. H. Wagner (2010). Returns to physician human capital: Evidence from patients randomized to physician teams. Journal of health economics 29(6), 866–882. Ehrenstein, V. (2009). Association of apgar scores with death and neurologic disability. Clinical Epidemiology 1, 45. Epstein, A. J. and S. Nicholson (2009). The formation and evolution of physician treatment styles: an application to cesarean sections. Journal of health economics 28(6), 1126–1140. Eriksson, J. G., E. Kajantie, C. Osmond, K. Thornburg, and D. J. Barker (2010). Boys live dangerously in the womb. American Journal of Human Biology 22(3), 330–335. Fadlon, I. and J. Van Parys (2020). Primary care physician practice styles and patient care: Evidence from physician exits in medicare. Journal of health economics 71, 102304. Fernández Ávila, D. G., L. C. Mancipe García, D. C. Fernández Ávila, E. Reyes Sanmiguel, M. C. Díaz, and J. M. Gutiérrez (2011). Analysis of the supply of medicine undergraduate programs in colombia, during the past 30 years. Revista Colombiana de Reumatología 18(2), 109–120. Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. Neuse, H. Allen, K. Baicker, and O. H. S. Group (2012). The oregon health insurance experiment: evidence from the first year. The Quarterly Journal of Economics 127(3), 1057–1106. Fletcher, J. M., L. I. Horwitz, and E. Bradley (2014). Estimating the value added of attending physicians on patient outcomes. Technical report, National Bureau of Economic Research. Gagnon-Bartsch, J., Y. Shem-Tov, et al. (2019). The classification permutation test: A flexible approach to testing for covariate imbalance in observational studies. The Annals of Applied Statistics 13(3), 1464–1483. Gomez, P., I. Arevalo, et al. (2013). Guías de práctica clínica para la prevención, detección temprana y tratamiento de las complicaciones del embarazo, parto y puerperio. Ministerio de Salud y protección social Colombia 84, 74–82. Grossman, M. and T. J. Joyce (1990). Unobservables, pregnancy resolutions, and birth weight production functions in new york city. Journal of Political Economy 98(5, Part 1), 983–1007. Grytten, J. and R. Sørensen (2003). Practice variation and physician-specific effects. Journal of health economics 22(3), 403–418. Guarin, A., C. Posso, E. Saravia, and J. Tamayo (2023). Healing the gender gap: The impacts of randomized first-job on female physicians. 44 Harris, D. N. and T. R. Sass (2014). Skills, productivity and the evaluation of teacher performance. Economics of Education Review 40, 183–204. Herrmann, M., E. Walsh, and E. Isenberg (2016). Shrinkage of value-added estimates and characteristics of students with hard-to-predict achievement levels. Statistics and Public Policy 3(1), 1–10. Ho, K. and A. Pakes (2014a). Hospital choices, hospital prices, and financial incentives to physicians. American Economic Review 104(12), 3841–84. Ho, K. and A. Pakes (2014b). Physician payment reform and hospital referrals. American Economic Review 104(5), 200–205. Hoynes, H., M. Page, and A. H. Stevens (2011). Can targeted transfers improve birth outcomes?: Evidence from the introduction of the wic program. Journal of Public Economics 95(7-8), 813– 827. Iizuka, T. (2012). Physician agency and adoption of generic pharmaceuticals. American Economic Review 102(6), 2826–58. Jackson, C. K. (2018). What do test scores miss? the importance of teacher effects on non–test score outcomes. Journal of Political Economy 126(5), 2072–2107. Kane, T. J., J. E. Rockoff, and D. O. Staiger (2008). What does certification tell us about teacher effectiveness? evidence from new york city. Economics of Education review 27 (6), 615–631. Kane, T. J. and D. O. Staiger (2008). Estimating teacher impacts on student achievement: An experimental evaluation. Technical report, National Bureau of Economic Research. Kleinberg, J., J. Ludwig, S. Mullainathan, and Z. Obermeyer (2015). Prediction policy problems. American Economic Review 105(5), 491–95. Kraemer, S. (2000). The fragile male. Bmj 321(7276), 1609–1612. Kramer, M. S. (1987). Determinants of low birth weight: methodological assessment and meta- analysis. Bulletin of the World Health Organization 65(5), 663. Kremer, M. (1993). The o-ring theory of economic development. The Quarterly Journal of Economics 108(3), 551–575. Leonard, K. L. and M. C. Masatu (2007). Variations in the quality of care accessible to rural communities in tanzania: Some quality disparities might be amenable to policies that do not necessarily relate to funding levels. Health Affairs 26(Suppl2), w380–w392. Leonard, K. L., M. C. Masatu, and A. Vialou (2007). Getting doctors to do their best the roles of ability and motivation in health care quality. Journal of Human Resources 42(3), 682–700. Liberman, A., C. Neilson, L. Opazo, and S. Zimmerman (2018). The equilibrium effects of information deletion: Evidence from consumer credit markets. Technical report, National Bureau of Economic Research. Lin, W. (2009). Why has the health inequality among infants in the us declined? accounting for the shrinking gap. Health Economics 18(7), 823–841. Mathews, F., P. J. Johnson, and A. Neil (2008). You are what your mother eats: evidence for maternal preconception diet influencing foetal sex in humans. Proceedings of the Royal Society B: Biological Sciences 275(1643), 1661–1668. McCormick, M. C. and J. E. Siegel (2001). Recent evidence on the effectiveness of prenatal care. Ambulatory Pediatrics 1(6), 321–325. Michalopoulos, C., D. Wittenburg, D. A. Israel, and A. Warren (2012). The effects of health care benefits on health care use and health: a randomized trial for disability insurance beneficiaries. Medical Care, 764–771. Ministry of Education (2019). National higher education information system. 45 Ministry of Health (1990, June). Decree 1335 of 1990. por el cual se expide parcialmente el manual general de funciones y requisitos del subsector oficial del sector salud. Ministry of Health (2001). Reglamento del año de servicio de salud rural. Ministry of Health (2010, March). Resolution 1058 of 2010. por medio de la cual se reglamenta el servicio social obligatorio para los egresados de los programas de educación superior del área de la salud y se dictan otras disposiciones. Ministry of Health (2012a, December). Resolution 4503 of 2012. por la cual se modifica el artículo 6 de la resolución 274 de 2011 modificado por el artículo 2 de la resolución 566 de 2012. Ministry of Health (2012b, March). Resolution 566 of 2012. por la cual se modifica parcialmente la resolución 274 de 2011. Ministry of Health (2013, May). Resolution 1441 of 2013. por la cual se definen los procedimientos y condiciones que deben cumplir los prestadores de servicios de salud. Ministry of Health (2014). Reports of professionals registered and assigned to the process of assigning places in the mandatory social service. Mizuno, R. (2000). The male/female ratio of fetal deaths and births in japan. The Lancet 356(9231), 738–739. Molitor, D. (2018). The evolution of physician practice styles: evidence from cardiologist migration. American Economic Journal: Economic Policy 10(1), 326–56. Moore, E. A., F. Harris, K. R. Laurens, M. J. Green, S. Brinkman, R. K. Lenroot, and V. J. Carr (2014). Birth outcomes and academic achievement in childhood: A population record linkage study. Journal of Early Childhood Research 12(3), 234–250. Mullainathan, S. and Z. Obermeyer (2022). Diagnosing physician error: A machine learning approach to low-value health care. The Quarterly Journal of Economics 137(2), 679–727. Mullainathan, S. and J. Spiess (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives 31(2), 87–106. Naeye, R. L., L. S. Burt, D. L. Wright, W. A. Blanc, and D. Tatter (1971). Neonatal mortality, the male disadvantage. Pediatrics 48(6), 902–906. Norcini, J. J., J. R. Boulet, A. Opalek, and W. D. Dauphinee (2014). The relationship between licensing examination performance and the outcomes of care by international medical school graduates. Academic Medicine 89(8), 1157–1162. Norcini, J. J., R. S. Lipner, and H. R. Kimball (2002). Certifying examination performance and patient outcomes following acute myocardial infarction. Medical education 36(9), 853–859. Okeke, E. N. (2023). When a doctor falls from the sky: The impact of easing doctor supply constraints on mortality. American Economic Review 113(3), 585–627. Okeke, E. N. and I. S. Abubakar (2020). Healthcare at the beginning of life and child survival: Evidence from a cash transfer experiment in nigeria. Journal of Development Economics 143, 102426. Páez, G., L. Jaramillo, C. Franco, and L. Arregoces (2007). Estudio sobre el modo de gestionar la salud en colombia. Persson, P. and M. Rossin-Slater (2018). Family ruptures, stress, and the mental health of the next generation. American Economic Review 108(4-5), 1214–52. Phelps, C. E. (1992). Diffusion of information in medical care. Journal of Economic Perspectives 6(3), 23–42. Pongou, R., B. Kuate Defo, and Z. Tsala Dimbuene (2017). Excess male infant mortality: The gene-institution interactions. American Economic Review 107(5), 541–45. Rivkin, S. G., E. A. Hanushek, and J. F. Kain (2005). Teachers, schools, and academic achievement. Econometrica 73(2), 417–458. 46 Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from panel data. American Economic Review 94(2), 247–252. Rothstein, J. (2017). Measuring the impacts of teachers: Comment. American Economic Review 107 (6), 1656–84. Roy, A. D. (1951). Some thoughts on the distribution of earnings. Oxford Economic Papers 3(2), 135–146. Schnell, M. and J. Currie (2018). Addressing the opioid epidemic: is there a role for physician education? American Journal of Health Economics 4(3), 383–410. Shimer, R. and L. Smith (2000). Assortative matching and search. Econometrica 68(2), 343–369. Simeonova, E., N. Skipper, and P. R. Thingholm (2020). Physician health management skills and patient outcomes. Technical report, National Bureau of Economic Research. Stern, S. and M. Trajtenberg (1998). Empirical implications of physician authority in pharmaceutical decisionmaking. Stoye, G. (2022). The distribution of doctor quality: Evidence from cardiologists in england. Technical report, IFS Working Papers. Tamblyn, R., M. Abrahamowicz, D. Dauphinee, E. Wenghofer, A. Jacques, D. Klass, S. Smee, D. Blackmore, N. Winslade, N. Girard, et al. (2007). Physician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. Jama 298(9), 993–1001. Tamblyn, R., M. Abrahamowicz, W. D. Dauphinee, J. A. Hanley, J. Norcini, N. Girard, P. Grand’Maison, and C. Brailovsky (2002). Association between licensure examination scores and practice in primary care. Jama 288(23), 3019–3026. Taylor, H. G., N. Klein, N. M. Minich, and M. Hack (2001). Long-term family outcomes for children with very low birth weights. Archives of Pediatrics & Adolescent Medicine 155(2), 155– 161. Tsugawa, Y., A. B. Jena, J. F. Figueroa, E. J. Orav, D. M. Blumenthal, and A. K. Jha (2017). Comparison of hospital mortality and readmission rates for medicare patients treated by male vs female physicians. JAMA Internal Medicine 177(2), 206–213. Universidad del Rosario (2015). El año rural: Realidad agridulce para los médicos recién graduados. un relato de quien lo vivió. Retrieved from: https://www.urosario.edu.co/Revista-Nova-Et-Vetera/Vol-1-Ed-2/Cultura/El-ano-rural- Realidad-agridulce-para-los-medicos-r.pdf. Veddovi, M., D. T. Kenny, F. Gibson, J. Bowen, and D. Starte (2001). The relationship between depressive symptoms following premature birth, mothers’ coping style, and knowledge of infant development. Journal of Reproductive and Infant Psychology 19(4), 313–323. Wenghofer, E., D. Klass, M. Abrahamowicz, D. Dauphinee, A. Jacques, S. Smee, D. Blackmore, N. Winslade, K. Reidel, I. Bartman, et al. (2009). Doctor scores on national qualifying examinations predict quality of care in future practice. Medical education 43(12), 1166–1173. WHO (2016). Pregnant women must be able to access the right care at the right time, says who. Retrieved from: https://www.who.int/news/item/07-11-2016-pregnant-women-must- be-able-to-access-the-right-care-at-the-right-time-says-who. Woodcock, S. D. (2008). Wage differentials in the presence of unobserved worker, firm, and match heterogeneity. Labour Economics 15(4), 771–793. 47 Online Appendix Not for Publication A1 A Appendix Figure A.1: Probability of low birth weight vs. gestational weeks, 2009-2012 Notes: This figure presents the local polynomial regression fit of the probability of having low birth weight over the number of gestational weeks using all birth records for Colombia from 2009 to 2012. A2 Figure A.2: Population (per 100,000) for municipalities included in our main sample Notes: This figure presents the map (Administrative Department of Statistics, 2018a) of the population per 100,000 people for the municipalities included in our main sample in 2005. The municipalities in orange are not included in our sample or do not have SSO. A3 Figure A.3: Distribution of physicians per municipalities Notes: This figure shows the distribution of physicians per municipality for the sample of 582 municipalities with only one LHC. The data spans from January 2012 to December 2012. Figure A.4: Heterogeneity in quantitative and reading SABER PRO scores Notes: This figure reports the quantitative and reading test scores for the universities (Ministry of Education, 2019) that the physicians in our sample attended. Data accounts for 44 different universities. The figure shows the mean score for each university/program and an interval of one standard deviation to each side of the average. The dashed horizontal line represents the overall percentile 50. The figure shows substantial heterogeneity both within and between programs. For all the fields reported, there is a difference of almost two standard deviations between the averages of the best and the worst programs. A4 Table A.1: Summary statistics - physicians in the main sample Covariate Mean Standard error Sex (female) 0.558 0.497 The household has a private car 0.483 0.500 Number of people in the household 4.025 1.659 Father with tertiary education 0.644 0.479 Mother with tertiary education 0.634 0.482 Socioeconomic strata: 1 or 2 or rural areas 0.292 0.455 Socioeconomic strata: 4, 5 or 6 0.349 0.477 The household has internet 0.831 0.375 Monthly household income: Less than 2 MW 0.229 0.420 Monthly household income: between 2 and 3 MW 0.220 0.414 The father or the mother has a job 0.872 0.335 The household has a washing machine 0.854 0.353 The household has a television 0.859 0.348 The household has a cellphone 0.963 0.188 The house has proper flooring 0.908 0.289 The household has an oven 0.671 0.470 Physician’s score on the Health care test 10.426 1.059 Physician’s score on the Disease prevention test 10.431 1.010 Physician’s score on the Reading test 10.624 1.007 Physician’s score on the Math test 10.572 1.123 Physician’s average score on SABER PRO 10.513 0.854 Observations 2,126 Notes: This table reports the summary statistics for the physicians included in our main sample. These characteristics are obtained at the time physicians took their SABER PRO exam (before the SSO). Sex is a binary variable that takes the value of 1 if the physician is female and zero otherwise; The household has a private car that takes the value of 1 if the household of the physician owns a private car at the time the physician took the SABER PRO test and zero otherwise; Number of people in the household counts the number of individuals living in the same house as the physician; Father with tertiary education is a binary variable that takes the value of 1 if the physician’s father has at least tertiary education and zero otherwise; Mother with tertiary education is a binary variable that takes the value of 1 if the physician’s mother has at least tertiary education and zero otherwise; Socioeconomic strata: 1 or 2 or rural areas takes the value of 1 if the physician’s household’s socioeconomic strata at the time of the SABER PRO test was 1, 2 or rural and zero otherwise; Socioeconomic strata: 4, 5 or 6 is a variable that takes the value of 1 if the physician’s household’s socioeconomic strata at the time of the SABER PRO test was 4, 5 or 6 and zero otherwise; The household has internet takes the value of 1 if the physician had internet service at home at the time of the test; Monthly household income: Less than 2MW takes the value of 1 if the physician’s household had an income lower than 2 times the national monthly minimum wage and zero otherwise; Monthly household income: between 2 and 3 MW takes the value of 1 if the physician’s household had an income between 2 and 3 times the national monthly minimum wage and zero otherwise; The father or the mother has a job takes value 1 if either of the physician’s parents have a job; The household has a washing machine, television, cellphone, proper flooring or oven, take value 1 if the household has that characteristic described and zero otherwise; physician’s scores are continuous variables of the score obtained on each SABER PRO test subgroup; physician’s average score on SABER PRO is the average of the four main components of the test, health care, disease prevention, reading and math. A5 Table A.2: Covariate balance at LHC level using all the areas tested in the SABER PRO Covariate Coefficient Standard Error a. Pretreatment variables (2010-2012) Unhealthy 0.0060 0.0094 Low birth weight −0.0027 0.0026 Prematurity 0.0044 0.0043 Low Apgar score −0.0020 0.0048 Insufficient prenatal care (Prop.) −0.0054 0.0078 Female infants −0.0011 0.0046 Mothers with basic education −0.0035 0.0082 Married mothers −0.0040 0.0070 Teenage mothers 0.0055 0.0053 Number of LHCs per municipality −0.0247 0.0203 Municipality population 1,196.04 2,767.35 b. Concurrent variables Female newborn −0.0008 0.0033 Mother with basic education 0.0053 0.0077 Married mother −0.0034 0.0059 Teenage mother 0.0006 0.0039 Number of LHCs by municipalities −0.0043 0.0209 Municipality population 1,260.14 2,807.26 Notes: This table presents the results of different LHC-by-cohort level regressions (equation 5) of the LHC-level variables, listed in the first column, on the measure of physicians’ skill level and the draw-by-state fixed effects. The coefficient and the standard error of the physicians’ skill variable are reported in the second and third columns, respectively. Standard errors are clustered at the LHC level. LHCs’ characteristics in panel a come from the 2010–2012 DANE VSRs, using a total of 1,837 LHC-by-cohort observations. LHCs’ characteristics in panel b come from the 2013– 2015 DANE VSRs, using a total of 1,714 LHC-by-cohort observations.Unhealthy, our main measure of health at birth, is the proportion of newborn infants with at least one of the three following conditions: low birth weight, prematurity, or low Apgar score. Low birth weight is the proportion of newborn infants whose birth weight was less than 2,500 grams. Prematurity is the proportion of newborn infants who were born after fewer than 37 weeks of gestation. Low Apgar score is the proportion of newborn infants whose Apgar score was lower than 7. Insufficient prenatal care is the proportion of mothers who had fewer than four prenatal checkups. Female infants is the proportion of female infants. Mothers with basic education is the proportion of mothers with at least secondary education at the time they gave birth. Married mothers is the proportion of mothers that were married at the time they gave birth. Teenage mothers is the proportion of mothers who were 19 years old or younger at the time they gave birth. Number of LHCs per municipality is the count of LHCs in the birthplace municipality. We interpret the non-significance of these estimates as evidence in favor of the randomness of the assignment of physicians. A6 Table A.3: Placebo other years Unhealthy LBW Prematurity Apgar < 7 Average Health PCA Health Average Health PCA Health Average Health PCA Health Average Health PCA Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (1) (2) (1) (2) (1) (2) a. 2 years Coefficient −0.0029 −0.0027 −0.003* −0.0029* −0.0005 −0.0004 −0.0022 −0.0021 SE (0.0026) (0.0026) (0.0016) (0.0016) (0.0016) (0.0016) (0.0019) (0.0019) Relative effect −2.90% −2.72% −6.57% −6.39% −1.09% −0.98% −5.44% −5.23% b. 2.5 years Coefficient −0.0022 −0.0021 −0.0009 −0.0008 −0.0006 −0.0006 −0.0022 −0.0022 SE (0.0023) (0.0023) (0.0013) (0.0013) (0.0015) (0.0015) (0.0019) (0.0019) Relative effect −2.18% −2.07% −1.95% −1.82% −1.41% −1.35% −5.58% −5.42% c. 3 years Coefficient −0.0035 −0.0034 −0.0007 −0.0007 −0.0016 −0.0016 −0.0026 −0.0026 SE (0.0022) (0.0022) (0.0012) (0.0013) (0.0014) (0.0014) (0.0018) (0.0018) Relative effect −3.33% −3.30% −1.40% −1.45% −3.85% −3.92% −6.27% −6.12% d. 3.5 years Coefficient −0.0014 −0.0014 −0.0007 −0.0007 −0.0012 −0.0013 −0.0006 −0.0006 SE (0.0023) (0.0023) (0.0011) (0.0012) (0.0016) (0.0016) (0.0018) (0.0018) Relative effect −1.26% −1.29% −1.41% −1.58% −2.80% −2.87% −1.47% −1.37% Notes: This table presents placebo test in which we estimate equation (4) but moving the arrival date 3.5, 3, 2.5 and 2 years before the start of the SSO program. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. The numbers in parentheses are LHC-level-clustered standard errors. The results show that regardless of the time window that we use for the calculation of the placebo test, the estimated coefficients are always precisely estimated zeros which we interpret as evidence of the randomness of the assignment of the physicians to the LHCs. * p < 0.1, ** p < 0.05, *** p < 0.01 Table A.4: Placebo robustness checks Unhealthy LBW Prematurity Apgar < 7 Average PCA Average PCA Average PCA Average PCA Health Health Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) (7) (8) a. Without controls Coefficient −0.0019 −0.0019 −0.0014 −0.0014 −0.0024 −0.0024 <0.0001 <0.0001 SE (0.0024) (0.0025) (0.0013) (0.0013) (0.0016) (0.0016) (0.0017) (0.0017) Relative effect −1.58% −1.59% −2.93% −3.05% −4.59% −4.68% 0.05% 0.13% b. With controls Coefficient 0.0013 0.0013 <0.0001 < −0.0001 −0.0017 −0.0017 0.002 0.002 SE (0.0018) (0.0018) (0.0008) (0.0008) (0.0011) (0.0011) (0.0016) (0.0016) Relative effect 1.09% 1.06% 0.09% −0.04% −3.23% −3.28% 4.42% 4.42% Average dependent variable 0.118 0.046 0.052 0.046 Number of observations 261,616 Notes: This table presents our placebo estimates from equation (4) with and without controls. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise.. Note that the results are robust to the inclusion/exclusion of controls and how we measure skills. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A7 Table A.5: Main estimates using all the areas tested in the SABER PRO Dependent variable: Unhealthy Average Average Health Care Prevention Average Reading Quantitative All Health Score Disease Academic Score Score Scores Score Scores (1) (2) (3) (4) (5) (6) (7) a. Without controls Coefficient -0.0109*** -0.0087*** -0.0089*** -0.0054* -0.0105*** -0.0065** -0.0106*** Stand. Err. (0.0026) (0.0026) (0.0026) (0.0027) (0.0027) (0.0027) (0.0024) Relative effect -11.44% -9.14% -9.33% -5.64% -11.01% -6.78% -11.12% b. With controls Coefficient -0.0096*** -0.0076*** -0.0075*** -0.0051** -0.0094*** -0.0066** -0.0091*** Stand. Err. (0.0022) (0.0023) (0.0023) (0.0025) (0.0023) (0.0026) (0.0022) Relative effect -10.11% -7.94% -7.90% -5.33% -9.86% -6.89% -9.56% Average Dependent Variable 0.095 Number of Observations 255,089 Notes: This table presents our main estimates from equation (4) using all areas tested in the SABER PRO. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status; number of inhabitants in the municipality; number of LHCs per municipality; area; an indicator variable that takes the value of 1 if the LHC is above the 75th percentile of the distribution of low birth weight measured in 2010-2012 and zero otherwise; an indicator variable that takes the value of 1 if the LHC is above the 75th percentile of the distribution of prematurity measured in 2010-2012 and zero otherwise; and an indicator variable that takes the value of 1 if the LHC is above the 75th percentile of the distribution of the Apgar score measured in 2010-2012 and zero otherwise. These results show that the estimated effects are robust to using the average of the four areas tested in the SABER PRO (health management, public health, reading, quantitative) as well as each individual (except for reading) score as proxies of the physician’s skills before the SSO program. The results are also robust to the inclusion/exclusion of controls and how we measure skills. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 Table A.6: Cohort-level mortality estimates Fetal deaths Fetal and neonatal Infant Mortality deaths Ratio Average PCA Average PCA Average PCA Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) Coefficient -0.9408 -0.6315 -0.9582 -0.6421 -0.0005 -0.0003 Stand. Err. (2.4024) (1.5996) (2.4228) (1.613) (0.0015) (0.0010) Relative effect -6.39% -4.29% -6.10% -4.09% -2.18% -1.43% Average Dependent Variable 14.728 15.705 0.024 Number of Observations 1,073 1,073 1,073 Notes: This table presents our cohort-level estimates on mortality following equation (4). Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Fetal deaths is the total number of fetal deaths registered at the LHC during the timeframe when the cohort was assigned. Fetal and neonatal deaths represent the total number of fetal deaths and fatalities of children under one year old registered in a LHC during the cohort’s assignment period (ideally, we would have preferred to focus on shorter-term mortality, but under one year was the most granular definition of infant mortality available in our data). Infant Mortality Ratio, represents the number of fetal and neonatal deaths divided by the total number of births births during the cohort’s assignment period. These variables are regressed on either the cohort’s average health score (columns 1, 3, 5) or the cohort’s PCA for the health scores (columns 2, 4, 6). We restrict to cohorts assigned to LHC where there are at least 5 births during their assignment period, but the results are similar when this threshold is increased/decreased/ignored. While these results are expected to be subject to high measurement error attenuation bias, we still observe negative, albeit not statistically significant, point estimates, which aligns with our main results. * p < 0.1, ** p < 0.05, *** p < 0.01 A8 Table A.7: Controlling by Share of SSOs on LHC Unhealthy LBW Prematurity Apgar < 7 Average Health Scores (1) (2) (3) (4) Coefficient -0.0087*** -0.004** -0.0044*** -0.0044** SE (0.0026) (0.002) (0.0016) (0.0019) Relative effect -9.11% -9.45% -10.82% -11.68% Average Dependent Variable 0.095 0.043 0.041 0.038 Number of Observations 255,089 Note: This table presents our main estimates from equation (4) controlling for the share of SSO physicians at the LHC. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Numbers in parentheses are LHC-level clustered standard errors. We interpret the high significance and consistency of these results across the different measures of health at birth as evidence of the important role that skilled physicians play in determining an infant’s health at birth. * p < 0.1, ** p < 0.05, *** p < 0.01 A9 Table A.8: Antenatal consultations < 4 Dependent variable: Antenatal consultations < 4 Average Health PCA Health Scores Scores (1) (2) a. Without controls Coefficient -0.0029 -0.0031 Stand. Err. (0.0093) (0.0094) Relative effect -1.77% -1.93% b. With controls Coefficient -0.0055 -0.0058 Stand. Err. (0.0092) (0.0093) Relative effect -3.39% -3.53% Average Dependent Variable 0.163 Number of Observations 255,089 Notes: This Figure presents our main estimates from equation (4). The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation, on the probability that mothers are scheduled for less than four prenatal checkups (Insufficient antenatal consultations). Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Antenatal consultations < 4 takes value one if the mother attended to less than 4 consultations while pregnant, an zero otherwise. All regressions control for draw-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. Note that the results are robust to the inclusion/exclusion of controls and how we measure skills. Numbers in parentheses are LHC-level clustered standard errors. The results show there is not a significant average effect of more- skilled doctors on the probability that mothers are scheduled for less than four prenatal checkups. * p < 0.1, ** p < 0.05, *** p < 0.01 A10 Figure A.5: Placebo using all samples and average scores Notes: This Figure presents our placebo estimates from equation (4). The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. These results show that the estimated effects are robust to the inclusion/exclusion of controls and the way we measure of skills. These results support the ones presented in Table 3 on the robustness of the estimated zero effect for the placebo tests. A11 Figure A.6: Distribution of Logit simulations on antenatal consultations by predicted probability of unhealthy newborn Notes: This figure plots the distribution of the estimated effects of physicians on antenatal consultations by mother’s predicted probability of giving birth to an unhealthy child from 1,000 different random repetitions. In each of the 1,000 repetitions, to predict the probability of an unhealthy child, we divided our data into training and testing subsets of randomly selected LHCs using a K-mean algorithm. On the training sets, we run a Logit model of the probability of being born unhealthy on our usual set of mother and LHC ex-ante covariates, and use the estimations to predict the probability of giving birth to an unhealthy child on each testing subset. Using the prediction on the testing sample, we divide each subset into high and low predicted probability of giving birth to an unhealthy child, defined as mothers with a probability of an unhealthy child below and above the 75th percentile, respectively. Unhealthy is a binary variable that takes the value of 1 if the newborn has low birth weight or if the newborn is premature (fewer than 37 weeks of gestation) or if the Apgar score of the newborn is lower than 7, and zero otherwise. The plotted coefficients represent the effect of being assigned a physician with one standard deviation higher quality (proxied by the average score) on the probability of having insufficient (less than four) antenatal consultations. All regressions control for draw-by-state fixed effects. The figure shows that there is almost no overlap between the distributions and that most of the mass of the distribution for the coefficient associated with the low predicted Unhealthy is around zero. This is consistent with the idea that more skilled physicians are better at targeting the care towards the more vulnerable mothers. A12 Table A.9: Main estimates without and with controls Unhealthy LBW Prematurity Apgar < 7 Average PCA Average PCA Average PCA Average PCA Health Health Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) (7) (8) a. Without controls Coefficient -0.0087*** -0.0086*** -0.0041** -0.0040* -0.0045*** -0.0045*** -0.0043** -0.0043** Stand. Err. (0.0026) (0.0026) (0.0021) (0.0021) (0.0017) (0.0016) (0.0019) (0.0019) Relative effect -9.14% -9.02% -9.57% -9.38% -10.99% -10.89% -11.56% -11.46% b. With controls Coefficient -0.0076*** -0.0075*** -0.0045** -0.0045** -0.0050*** -0.0050*** -0.0024 -0.0023 Stand. Err. (0.0023) (0.0023) (0.0018) (0.0018) (0.0015) (0.0015) (0.0018) (0.0018) Relative effect -7.94% -7.85% -10.60% -10.53% -12.17% -12.22% -6.39% -6.20% Average Dependent Variable 0.095 0.043 0.041 0.038 Number of Observations 255,089 Notes: This table presents our main estimates from equation (4) with and without controls. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. These results show that the estimated effects are robust to the inclusion/exclusion of controls and the way we measure quality. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A13 Figure A.7: Distribution of the coefficient logit simulations on the probability of being born unhealthy by the (ex-ante) predicted probability of an unhealthy newborn Notes: This figure plots the distribution of the estimated effects of physicians on the probability of being born unhealthy by mother’s predicted probability of giving birth to an unhealthy child from 1,000 different random repetitions. In each of the 1,000 repetitions, to predict the probability of an unhealthy child, we divided our data into training and testing subsets of randomly selected LHCs using a K-mean algorithm. On the training sets, we run a Logit model of the probability of being born unhealthy on our usual set of mother and LHC ex-ante covariates, and use the estimations to predict the probability of giving birth to an unhealthy child on each testing subset. Using the prediction on the testing sample, we divide each subset into high and low predicted probability of giving birth to an unhealthy child, defined as mothers with a probability of an unhealthy child below and above the 75th percentile, respectively. Unhealthy is a binary variable that takes the value of 1 if the newborn has low birth weight or if the newborn is premature (fewer than 37 weeks of gestation) or if the Apgar score of the newborn is lower than 7, and zero otherwise. The plotted coefficients represent the effect of being assigned a physician with one standard deviation higher quality (proxied by the average score) on the probability of having insufficient (less than four) antenatal consultations. All regressions control for draw-by-state fixed effects. The figure shows that there is almost no overlap between the distributions and that the estimated effects of the more skilled physicians are consistently stronger for the population with higher predicted probability of being born unhealthy. A14 Figure A.8: Main estimates using all sample Notes: presents our main estimates from equation (4). The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that takes the value of 1 if the LHC is above the 75th percentile of the distribution of low birth weight measured in 2010-2012 and zero otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. These results show that the estimated effects are robust to the inclusion/exclusion of controls and the way we measure physicians’ skills (Averages vs. principal components). Standard errors are clustered at the LHC level. 95% confidence intervals. A15 Table A.10: Main results using covariance index (Anderson, 2008) Unhealthy Cov index Unhealthy standarized Average Scores PCA Scores Average Scores PCA Scores (1) (2) (3) (4) a. Without controls Coefficient -0.0220*** -0.0218*** -0.0297*** -0.0292*** Stand. Err. (0.0065) (0.0065) (0.0088) (0.0088) Relative effect -3.23% -3.19% -2.97% -2.92% b. With controls Coefficient -0.0192*** -0.0190*** -0.0257*** -0.0255*** Stand. Err. (0.0060) (0.0060) (0.0078) (0.0079) Relative effect -2.82% -2.79% -2.57% -2.55% Number of Observations 255,089 Notes: This table presents our main estimates from equation (4) using (Anderson, 2008) covariance index. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. These results show that the estimated effects are robust to using the covariance index as an outcome instead of unhealthy. The results are also robust to the inclusion/exclusion of controls and how we measure quality. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A16 Table A.11: Main estimates using a Logit model Unhealthy LBW Prematurity Apgar < 7 Average PCA Average PCA Average PCA Average PCA Health Health Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) (7) (8) a. Without controls Coefficient -0.0070*** -0.0069*** -0.0032** -0.0031** -0.0039*** -0.0039*** -0.0034** -0.0033** Stand. Err. (0.0021) (0.0021) (0.0015) (0.0015) (0.0014) (0.0014) (0.0015) (0.0015) Relative effect -7.34% -7.22% -7.45% -7.27% -9.54% -9.41% -9.03% -8.92% b. With controls Coefficient -0.0059*** -0.0058*** -0.0036*** -0.0035*** -0.0046*** -0.0046*** -0.0020 -0.0019 Stand. Err. (0.0018) (0.0018) (0.0013) (0.0013) (0.0012) (0.0012) (0.0012) (0.0012) Relative effect -6.20% -6.13% -8.37% -8.29% -11.17% -11.18% -5.22% -5.08% Average Dependent Variable 0.095 0.043 0.041 0.038 Number of Observations 255,079 Notes: This table presents our main estimates from equation (3) using a logit model. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. These results show that the estimated effects are robust to using an analogous Logit model and compute the average marginal effect associated with an increase in one standard deviation of the skill measure. The results are also robust to the inclusion/exclusion of controls and how we measure quality. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A17 Table A.12: Main estimates linearity Dependent variable: Unhealthy Average Health PCA Health Scores Scores (1) (2) Coefficient -0.0150 -0.0147 Quartile 2 Stand. Err. (0.010) (0.0105) Relative effect -15.79% -15.45% Coefficient -0.0217** -0.0227 Quartile 3 Stand. Err. (0.0104) (0.0145) Relative effect -22.74% -23.82% Coefficient -0.0196** -0.0199** Quartile 4 Stand. Err. (0.0089) (0.0079) Relative effect -20.62% -20.94% Notes: This table presents our main estimates from equation (4) using the quartiles of the quality distribution. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level was at the 2nd, 3rd, or 4th quartile of the physicians’ quality distribution compared to being treated at an LHC that was randomly assigned SSO physicians whose skill level was at the 1st quartile. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Numbers in parentheses are LHC-level clustered standard errors. While not all the coefficients are statistically different from each other, we do observe increases in the point estimates associated with higher quartiles and cannot discard linearity of the effects. * p < 0.1, ** p < 0.05, *** p < 0.01 A18 Table A.13: Main estimates without, with dummy and continuous controls Unhealthy LBW Prematurity Apgar < 7 Average PCA Average PCA Average PCA Average PCA Health Health Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) (7) (8) b. With dummy controls Coefficient -0.0087*** -0.0086*** -0.0041** -0.0040* -0.0045*** -0.0045*** -0.0043** -0.0043** Stand. Err. (0.0026) (0.0026) (0.0021) (0.0021) (0.0017) (0.0016) (0.0019) (0.0019) Relative effect -9.14% -9.02% -9.57% -9.38% -10.99% -10.89% -11.56% -11.46% b. With dummy controls Coefficient -0.0076*** -0.0075*** -0.0045** -0.0045** -0.0050*** -0.0050*** -0.0024 -0.0023 Stand. Err. (0.0023) (0.0023) (0.0018) (0.0018) (0.0015) (0.0015) (0.0018) (0.0018) Relative effect -7.94% -7.85% -10.60% -10.53% -12.17% -12.22% -6.39% -6.20% c. With continuous controls Coefficient -0.0077*** -0.0077*** -0.0036** -0.0036** -0.0041*** -0.0042*** -0.0037** -0.0037** Stand. Err. (0.0021) (0.0021) (0.0017) (0.0017) (0.0012) (0.0012) (0.0017) (0.0017) Relative effect -8.11% -8.06% -8.53% -8.52% -9.95% -10.12% -9.95% -9.73% Average Dependent Variable 0.095 0.043 0.041 0.038 Number of Observations 255,089 Notes: This table presents our main estimates from equation (4) without controls, and with dummy and continuous controls. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. First stage coefficient and standard error is shown in figure 2. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-by-state fixed effects. Regressions for the coefficients labeled as With dummy controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. Regressions for the coefficients labeled as With continuous controls include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is adolescent and zero otherwise; marital status; the LHC’s low birth weight average measured in 2010-2012; the LHC’s prematurity percentage measured in 2010-2012; and the LHC’s Apgar average measured in 2010-2012 and zero otherwise . These results show that the estimated effects are robust to the inclusion/exclusion of controls and the way we measure of skills. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 Table A.14: Interaction between cohort scores and program scores Unhealthy LBW Prematurity Apgar < 7 (1) (2) (3) (4) Coefficient -0.0090** -0.0046** -0.0047** -0.0035 Average Health Score Stand. Err. (0.0035) (0.0019) (0.0021) (0.0029) Relative effect -9.40% -10.83% -11.47% -9.36% Coefficient 0.0023 0.0028** 0.0023 -0.0011 Program Average Stand. Err. (0.0028) (0.0014) (0.0016) (0.0026) Relative effect 2.39% 6.45% 5.53% -2.83% Coefficient 0.0013 0.0012 0.0014 0.0005 Av. Health Sc. x Prog. Av. Stand. Err. (0.0017) (0.0015) (0.0011) (0.0013) Relative effect 1.35% 2.75% 3.32% 1.42% Average Dependent Variable 0.095 0.043 0.041 0.038 Number of Observations 255,089 Notes: This table presents our main estimates from equation (4) using the interaction between cohort and program scores. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw state fixed effects. These results show that the effects presented in Table 4 are not driven by top-ranked universities. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A19 Table A.15: Main results using municipalities with one LHC Unhealthy LBW Prematurity Apgar < 7 Average PCA Average PCA Average PCA Average PCA Health Health Health Health Health Health Health Health Scores Scores Scores Scores Scores Scores Scores Scores (1) (2) (3) (4) (5) (6) (7) (8) a. Without controls Coefficient -0.0077*** -0.0075*** -0.0040* -0.0039* -0.0045*** -0.0044*** -0.0033* -0.0033* Stand. Err. (0.0025) (0.0026) (0.0021) (0.0021) (0.0017) (0.0017) (0.0017) (0.0017) Relative effect -7.97% -7.84% -9.21% -8.95% -10.69% -10.57% -8.71% -8.72% b. With controls Coefficient -0.0068*** -0.0068*** -0.0038** -0.0038** -0.0046*** -0.0046*** -0.0023 -0.0023 Stand. Err. (0.0024) (0.0024) (0.0019) (0.0019) (0.0015) (0.0015) (0.0018) (0.0018) Relative effect -7.07% -7.02% -8.91% -8.88% -10.89% -11.02% -6.14% -6.01% Average Dependent Variable 0.096 0.043 0.042 0.038 Number of Observations 238,296 Notes: This table presents our main estimates from equation (4) using municipalities with only one LHC. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. The table shows that the results presented in Table 4 are almost identical if we exclude from our main sample the ten municipalities with more than two LHCs per municipality. The results are also robust to the inclusion/exclusion of controls and how we measure skills. Numbers in parentheses are LHC- level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A20 Table A.16: Main results using the weighted score without and with controls Unhealthy LBW Prematurity Apgar < 7 Average Health Scores (1) (2) (3) (4) a. Without controls Coefficient -0.0084*** -0.0041* -0.0047*** -0.0041** Stand. Err. (0.0026) (0.0021) (0.0017) (0.0019) Relative effect -8.82% -9.54% -11.35% -10.92% b. With controls Coefficient -0.0070*** -0.0044** -0.0049*** -0.0021 Stand. Err. (0.0023) (0.0018) (0.0015) (0.0018) Relative effect -7.39% -10.18% -12.00% -5.70% Average Dependent Variable 0.095 0.043 0.041 0.037 Number of Observations 252,159 Notes: This table presents our main estimates from equation (4) using the weighted score. The coefficients represent the effect of being treated at an LHC that was randomly assigned SSO physicians whose skill level is higher by one standard deviation. Relative (percent) effects are computed as the coefficient divided by the average of the dependent variable. Unhealthy is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams, if the newborn infant is born after fewer than 37 weeks of gestation, or if the Apgar score of the newborn infant is lower than 7 and zero otherwise. LBW is a binary variable that takes a value of 1 if the newborn infant has a birth weight below 2,500 grams and zero otherwise. Prematurity is a binary variable that takes a value of 1 if the newborn infant is born after fewer than 37 weeks of gestation and zero otherwise. Low Apgar is a binary variable that takes a value of 1 if the Apgar score of the newborn infant is lower than 7 and zero otherwise. All regressions control for draw-state fixed effects. Regressions for the coefficients labeled as With controls also include the following controls: an indicator variable for the sex of the newborn; an indicator variable that takes the value of 1 if the mother has at least secondary education and zero otherwise; an indicator variable that takes the value of 1 if the mother is 19 years old or younger and zero otherwise; marital status, number of inhabitants in the municipality; number of LHCs per municipality; an indicator variable that equals 1 if the LHC is above the 75th percentile of the low birth weight distribution for the country in 2010–2012, and 0 otherwise; an indicator variable that equals 1 if the LHC is above the 75th percentile of the prematurity distribution for the country in 2010–2012, and 0 otherwise; and an indicator variable that equals 1 if the LHC is above the 75th percentile of the Apgar score distribution for the country in 2010–2012, and 0 otherwise. The table shows that the results are very similar when the weighted score is used as a proxy of physicians’ skills. The results are also robust to the inclusion/exclusion of controls and how we measure skills. Numbers in parentheses are LHC-level clustered standard errors. * p < 0.1, ** p < 0.05, *** p < 0.01 A21