Policy Research Working Paper 10931 Using Post-Double Selection Lasso in Field Experiments Jacobus Cilliers Nour Elashmawy David McKenzie Development Economics Development Research Group September 2024 Policy Research Working Paper 10931 Abstract The post-double selection Lasso estimator has become a approach makes in practice. PDS Lasso is found to reduce popular way of selecting control variables when analyzing standard errors by less than one percent compared to stan- randomized experiments. This is done to try to improve dard Ancova on average and does not select variables to precision, and reduce bias from attrition or chance imbal- model treatment in over half the cases. The authors dis- ances. This paper re-estimates 780 treatment effects from cuss and provide evidence on the key practical decisions published papers to examine how much difference this researchers face in using this method. This paper is a product of the Development Research Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dmckenzie@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Using Post-Double Selection Lasso in Field Experiments ∗ Jacobus Cilliers, Georgetown University Nour Elashmawy, Development Research Group, World Bank David McKenzie, Development Research Group, World Bank Keywords: Treatment effect; Randomized Experiment; Post-Double Selection Lasso; Attrition; Statistical Power. JEL classification codes: C93; C21; O12. ∗ We thank authors of the different papers we replicated for answering queries on their code; Christian Hansen, Kaspar Wuthrich, Carolina Lopez, Anja Sautmann, Greg Lane, Erin Kelley, and participants in the World Bank half-baked seminar for useful comments. Elashmawy was funded by the Robert S. McNamara Fellowships Program. 1 Introduction In a simple randomized experiment, regressing the outcome of interest on an indicator for treatment will give the difference-in-means estimator, which provides an unbiased estimate of the average impact of being assigned to treatment. However, although this is true in expectation, in any given random draw, the means of the treatment and control groups can differ from one another for many covariates, with the probability that these differences are large falling with sample size. One approach to improve balance and estimation precision has been through ex-ante design choices, such as the use of stratification or pairwise matching (Bai, 2022; Bruhn and McKenzie, 2009). A complementary approach is ex-post adjustment of the difference in means, by controlling for different covariates in a regression. For example, adding the lagged outcome as a covariate in the regression for Ancova estimation can greatly improve power when the outcome is highly autocorrelated (McKenzie, 2012). However, once one moves beyond controlling for randomization strata and the lagged dependent variable to including other baseline variables as controls, the question arises of how many other covariates should be considered, and how they should be selected. Often the concern is raised that such covariate adjustments are ad hoc, involve substantial researcher degrees of freedom, and can give rise to p-hacking (Simmons et al., 2011). The post-double selection lasso (PDS Lasso) estimator of Belloni et al. (2014b) has rapidly grown in popularity as a principled way of selecting control variables in field experiments. At first this may seem surprising, since PDS Lasso was originally developed for causal inference in observational studies. The method selects controls by using the least absolute shrinkage and selection operator (Lasso) twice: once to select covariates that help predict the outcome of interest, and once to select covariates that help predict treatment status. It then takes the union of these two sets as controls in the treatment regression. But if the goal is just to improve the efficiency of treatment estimation, then it is unclear why modeling treatment status is necessary, and why field experiment researchers do not instead use modern machine-learning approaches that just select covariates based on their ability to predict the outcome (Bloniarz et al., 2016; Guo et al., 2021; List et al., 2022; Wager et al., 2016; Wu and Gagnon-Bartsch, 2018). This paper examines why and how PDS Lasso is being used and should be used in randomized field experiments. Two features that help distinguish field experiments from the large A/B online platform experiments studied in Guo et al. (2021) and List et al. (2022) are sample size and attrition. Field experiment researchers often work with relatively small sample sizes, often in the range of 100 to 1,000 observations, where adding sample size is expensive and statistical power a key concern. Large chance imbalances by treatment status may arise in small samples, leading to a desire to adjust differences in means for these baseline differences. Moreover, small sample sizes increase the importance of being able to improve precision 2 by adding covariates that help reduce the variance in the outcome of interest. But smaller samples limit the ability to benefit from non-linear machine learning approaches, and also raises questions about how well asymptotic results used to justify the choice of regularization parameters in Lasso work with the sample uthrich and Zhu (2023) recently showed that PDS Lasso sizes common in field experiment applications. W¨ can underselect variables in finite samples when the number of variables is reasonably large, potentially leading to omitted variable bias when these variables have moderate relationships with treatment. Second, the majority of field experiments in developing countries rely on survey data for many outcomes, which can be subject to attrition. In a survey of 96 field experiments, Ghanem et al. (2023) find an average attrition rate of 15 percent. This raises the concern that if the determinants of attrition differ with treatment status, then the sample for which data are available may no longer be comparable across treatment groups. This offers a further potential rationale for double selection methods, which can select covariates that predict treatment status in the attrited data. But the question remains as to whether this makes much difference in practice. We replicate and re-analyze field experiment papers published in three economic journals between 2017 and 2022 which used PDS Lasso, resulting in 780 treatment estimates. We use this analysis to identify the key practical issues and performance of PDS Lasso, and compare the estimates and their standard errors to those that would be obtained using simple Ancova. We find that while authors typically include a long list of variables to be inputted as potential controls in PDS Lasso (a median of 182 controls), PDS Lasso typically ends up selecting very few control variables. The median is three controls, and in over half the cases, no variables at all are selected in the treatment regression step. When variables are selected for treatment, they are almost never those that are also selected for predicting the outcome of interest. As a result, PDS Lasso leads to minimal changes to treatment estimates and standard errors on average, with a median change in the coefficient of 0.01 standard deviations, and a median standard error that is 99.2% of that with Ancova. In over a quarter of the cases, standard errors are actually slightly larger than they would be just using Ancova. Researchers should therefore not anticipate significant power gains on average from using this method. We combine this re-analysis with simulations to look at when PDS Lasso performs better or worse, and to help answer practical questions facing applied researchers. In our re-analysis, we do find the treatment regression step to be more likely to select control variables when there is attrition, but even then, typical changes in coefficients are small, reflecting that attrition in field experiments is often due to reasons uncor- related with the outcome. We show that PDS Lasso sometimes ends up being less precise than Ancova (and having a higher mean-squared error in some simulations), since by inputting very many control variables, there is a risk that the Lasso penalization results in key variables such as the lagged dependent variable not being selected. We recommend inserting such a variable in the amelioration set that must be included. 3 We examine whether performance improves by using a different penalty parameter such as one selected by cross-validation. Although it gives slightly smaller standard errors on average, we find that it can overfit, sometimes resulting in substantially larger standard errors. Given these limited average gains, and risk of poor performance, using the standard plug-in penalty seems preferable. We suggest that researchers need to be considerably more judicious in their choice of control variables to input into this procedure, especially with the small samples typical in practice. We then conclude by discussing applications of this method with multiple outcomes, multiple treatments, and treatment interactions. We find two common mistakes arise in applied work, concerning how missing variables in potential controls are handled, and how the interacting variable is entered, and provide recommendations to avoid these errors. We conclude with a checklist for applying this approach in practice. 2 The Post-Double Selection Lasso Method We summarize the most common existing approaches used to estimate treatment effects in field experiments (difference-in-means and Ancova), and then compare to the post-double selection lasso method of Belloni et al. (2014b). Consider the following partial linear model for outcome y, for observations i=1,2,. . . ,n: yi = α + γTi + g (zi ) + ϵi (1) Where Ti is a dummy variable which takes value one if unit i was assigned to treatment, and zero otherwise; zi are a set of control variables, and ϵi is an unobservable that satisfies E (ϵi |Ti , zi ) = 0. 2.1 Difference-in-means estimator With pure random assignment, we have E (Ti g (zi )) = 0, and so we can obtain an unbiased estimate of the average impact of being assigned to treatment, γ , through a simple difference-in-means equation: yi = α + γTi + ωi (2) The variance of this difference-in-means estimator will then depend on the variance of the residual term ωi . With more complicated random assignment designs, treatment assignment may depend on control vari- ables that are used to define randomization strata or matched pairs. Then one can add controls for these 4 strata or pairs to equation 2. 2.2 Ancova While the difference-in-means estimator is unbiased, efficiency can be improved through the inclusion of control variables that help explain the outcome of interest y . One such variable that takes a special place in applied work is the baseline value of the outcome of interest, y0 . Approximating g (zi ) with y0 gives the Ancova estimator: yi = α + γTi + δyi0 + λ′ Si + υi (3) Where the Si are a set of dummy variables for any randomization strata used. McKenzie (2012) shows this estimator improves power over the difference-in-means estimator, with the gain in power larger the more autocorrelated the outcome of interest is. This basic Ancova specification has become the default specification for many randomized field experiments. In addition to improving power, it adjusts the estimate for any baseline differences in the key outcome of interest that may have arisen from chance imbalances in the random draw, or from attrition, with the amount of adjustment data-driven and depending on how much this baseline difference predicts the future outcome. In many applications we expect the baseline value of the outcome to have the strongest predictive power for future outcome values out of any baseline covariates. However, baseline measurement of the outcome may not always be available, or in some cases may be the same for everyone in the sample. For example, an experiment on young job-seekers may have employment and income as the main outcomes of interest, but at baseline everyone may be unemployed and earning zero income. 2.3 PDS Lasso in Theory The baseline outcome is only one of many potential control variables that researchers could use. Belloni et al. (2014b) consider the problem of selecting controls from a set of p potential regressors xi = P (zi ), which can consist of zi and different transformations and interactions of zi that aim to approximate the function g (zi ). They note that it is possible that p > n, that is, that the number of potential controls is high-dimensional and may even exceed the number of observations in the dataset. There are three reasons this could occur in a randomized field experiment. First, it is common for baseline surveys or application forms for programs to collect data on many characteristics of individuals or firms applying for a program. This may be augmented further by community or geographical characteristics. Second, since the functional form g (.) is unknown, one may wish to consider interactions, polynomials, or other non-linear transformations of the baseline variables. 5 Third, as List et al. (2022) note, in some settings there may be many rounds of pre-treatment data, and then the question arises of how best to control for multiple lags of the variables. Since their focus is on causal inference in observational studies, Belloni et al. (2014b) also use a partially linear model to model treatment: Ti = m(zi ) + vi (4) Where E (vi |zi ) = 0. The functions g (.) and m(.) are unknown. The key assumption Belloni et al. (2014b) make is that these models are approximately sparse, which means that there are linear approximations x′ i βg 0 to g (zi ) and x′ i βm0 to m(zi ) that require only a small number of non-zero coefficients to approximate these functions up to a small approximation error. The vector xi here includes a constant, as well as all the covariates to be considered. The PDS Lasso method then follows a three-step procedure: Step 1 : Select control variables that predict the outcome y through a Lasso regression of y on the covariates x, not including an indicator for treatment: yi = β1 xi,1 + β2 xi,2 + · · · + βj xi,j + βp xi,p + εi (5) The Lasso estimator solves the following equation: 2 λ min E (yi − x′ iβ) + ∥Ψβ ∥ (6) β n Where Ψ is a diagonal matrix of penalty loadings,1 and the key Lasso tuning parameter is given by λ. λ determines the penalty applied to selecting additional covariates: a small value of λ does not penalize additional covariates very much, and will result in more controls being selected, while a large value of λ penalizes each additional control a lot, and will result in fewer controls being selected. We discuss how to choose λ in section 2.4. Let I1 denote the set of control variables chosen in this step. Step 2 : Also select control variables that predict the treatment T through a Lasso regression of T on the covariates x, using the same λ as in Step 1: T = α1 xi,1 + α2 xi,2 + · · · + αj xi,j + αp xi,p + εi (7) 1 The penalty loadings incorporate the standard deviations of the regressors, which is equivalent to standardizing the regres- sors to have unit variance. The penalty parameters developed by Belloni et al. (2012) can also adjust account for heteroscedas- ticity and clustering. 6 Let I2 denote the set of control variables chosen in this step. Step 3 : Then the PDS Lasso estimator of the treatment effect comes from regressing y on the treatment indicator, along with the union of the set of controls selected in the first two steps. That is from the union of I1 and I2 . Belloni et al. (2014b) note that researchers may also wish to force the model to include additional variables from a set I3 , that he or she has theoretical reasons for wishing to include for robustness, even if they are not selected in either of the first two steps. They call this third set of variables the amelioration set I3 . The full set of controls selected is then the union of I1 , I2 and I3 , which is denoted I. The treatment effect is estimated by the following least squares regression: yi = α + γTi + x′ i β + εi with βj = 0, ∀j ∈ /I (8) Inference then uses the standard heteroscedasticity robust standard errors from this linear regression. Belloni et al. (2014b) show that the resulting estimator allows for imperfect selection of the controls (does not assume that the approximation is exact), and provides confidence intervals that are valid uniformly across a large class of models, and achieves the semi-parametric efficiency bound under homoscedasticity. The most common way of implementing this estimation in practice is through code developed by Ahrens et al. (2018), who developed the Stata program pdslasso. Stata versions 17 and up also include the built- in command dsregress which also implements double-selection Lasso. The two packages will give slightly different results when using their defaults for two reasons. First, the standard errors reported in pdslasso are slightly smaller because they do not adjust for the degrees of freedom from the selected controls.2 Second, the default tuning parameter λ also differs.3 It is worth noting the similarities and differences with a common, but criticized, approach of deciding which control variables to include in a randomized experiment on the basis of a balance test. Here researchers do a t-test for difference in means, and then include as controls variables which are found to have a significant difference in means between the treatment and control groups. Permutt (1990) notes that this pre-testing can affect the size of subsequent tests, and be worse for power than simply randomly choosing covariates to control for. He instead argues for controlling for the covariates most highly correlated with the outcome. Step 2 of the PDS Lasso method is akin to a penalized balance test, while Step 1 does aim for find the variables most highly predictive of the outcome, irrespective of how balanced they are. G 2 To G−1 make these comparable, one needs to multiple the standard errors reported in pdslasso by N , where G is the N −k−1 number of clusters, N the number of variables, and k the number of variables included in the regression. 3 Both use a default plug-in λ described by equation (9), except that dsregress uses a version that allows for heteroskedastic errors. 7 2.4 How should the penalty parameter λ be chosen? Belloni et al. (2012) provide a theoretically motivated choice of the tuning parameter by deriving an asymp- totically optimal ”plug-in” estimator for λ which sets: √ α λ = 2c nΦ−1 1 − (9) 2p where Φ−1 (.) is the inverse standard normal distribution, and the authors propose setting the constant c = 1.1, and α = 0.1/log (max(p, n)).4 Remark 1: we see from equation 9 that the larger is p, the number of covariates being considered, the larger is λ. That is, the penalty increases with the number of potential covariates. λ also increases with the sample size n, but λ/n in equation 6 will decrease with n, so that the larger the sample, the lower the penalty Lasso will place on adding more covariates. Remark 2: with clustered randomized experiments, the effective amount of independent information in the data is reduced, and so more regularization is needed. The command pdslasso uses the cluster-lasso with default α = 0.1/log (nclusters ) in this case, where nclusters is the number of clusters. All else equal, this means fewer controls will get selected in a clustered experiment than in an unclustered experiment of the same sample size. Although this formula for λ ensures optimal asymptotic properties, it is less clear how well it performs in uthrich and Zhu (2023) provide simulation evidence and finite sample theory to show that small samples. W¨ a finite sample omitted variables bias can arise from the Lasso underselecting variables that have moderate explanatory power in cases where p is relatively large, and they suggest examining robustness of results to a 50% change in the plug-in λ. Note that the issue of omitted variable bias may be less of a concern in most experiments, and instead underselection may just result in less of the residual variance in the outcome being soaked up by selected regressors than is optimal. An alternative is to choose the tuning parameter by cross-validation. Wager et al. (2016) suggest using cross-validation for single post-Lasso estimation in randomized experiments. Belloni et al. (2014a) note that while cross-validation is commonly used when the goal is prediction, it may not result in good performance when the goal is instead causal inference. Another practical issue with the use of cross-validation is that it can introduce further researcher discretion and complicate pre-specified replicability: the choice of cross- validated λ, and thus which controls are chosen, can depend on how many folds are used to split the data, the choice of random seed, and the stopping criteria used for selecting the optimal λ within a grid search. 4 In practice, pdslasso appears to use 0.1/log (n), whereas dsregress uses α = 0.1/log (max(p, n)). 8 2.5 Why might researchers want to use double- rather than single-selection with an RCT? PDS Lasso was developed for causal inference in observational studies, where the assumption is that treat- ment is only exogenous after conditioning on a set of controls. Equation 7 helps select which controls to condition treatment on. But if treatment is randomly assigned, then in expectation treatment is orthogonal to the controls. This raises the question of why we need double-selection, rather than just using Lasso (or other machine-learning approaches) for single-selection by choosing which covariates are most predictive of the outcome. There are two main reasons why researchers might worry that treatment assignment is still correlated with baseline variables in a randomized field experiment. The first is the possibility of chance imbalances from random assignment with relatively small sample sizes. The second is attrition causing imbalances. However, even then, our intuition is that if either of these leads treatment to be correlated with a variable xH , we would expect that this should not cause omitted variables bias unless xH is also is correlated with y and so would be selected in equation 5 anyway. For example, suppose in a job training program we find that treatment assignment is correlated with hair color, but that hair color is unrelated to employment outcomes. Then a double-selection step which leads us to also add hair color to the regression could end up inflating standard errors and reducing precision by conditioning on what is effectively noise. However, while it is true that we need not correct for variables which have no correlation with the error term in equation 1, Belloni et al. (2014a) explain why this logic can break down with regularization. The Lasso in step 1 will tend to select x variables that have large coefficients, but because of regularization, not select variables that have moderate associations with y . However, if these variables that are moderately associated with y are strong predictors of treatment status, then not including them can still lead to substantial omitted variables bias. The second step regression provides a second chance to select such variables, providing additional robustness against chance imbalances and selective attrition. 3 Empirical approach to examining how PDS Lasso performs in practice How does PDS Lasso perform in practice when applied to the types of settings common in field experiments? We use two complementary approaches to answer this question and to provide guidance on a number of practical and implementation questions which arise when using this method. The first is to replicate and re-examine a set of published papers which have used this method. The second is to use simulations. 9 3.1 Replications and Re-analysis of Published PDS Lasso papers Our replication and re-analysis focuses on field experiment papers published from 2017 to 2022 in the following three journals: the American Economic Journal: Applied Economics, the American Economic Review, and the Journal of Development Economics. We chose these journals because the AEA journals have had data replication packages in place for this whole time, and because field experiments are commonly published in all three journals. Out of the 90 field experiment papers identified, 19 papers (22 percent) use the post-double selection approach to select their control variables. The remainder of the papers primarily used one or more of these other approaches such as difference-in-means, Ancova, and/or selection of covariates on an ad hoc basis or based on a balance test.5 . A replication package was not available for one of these papers, giving a total of 18 papers for our analysis. Table 1 provides some descriptive statistics of the selected papers. The median paper estimates treatment effects using PDS Lasso on 36 different outcomes, with the number of outcomes ranging from 3 to 189.6 These include an extremely broad array of outcomes including employment, income, consumption, mental health, physical health, agriculture, environment, business, aspirations, soft skills, education, voting, livestock, and many other outcomes. They therefore encompass many of the key outcomes that appear in field experiments, and cover outcomes which we might expect to differ substantially in their ability to be predicted by baseline variables. In addition, 56% of the papers involve multiple treatments. This yields a total of 780 treatment estimates across the papers. The median sample size is 1,913, and ranges from 418 to 13,987 observations. However, 15 of the 18 papers are clustered randomizations, with a median of 206 clusters and a minimum of 14 clusters. These sample sizes are further reduced for some outcomes by attrition. While the median attrition rate is 10 percent, with median differential attrition of 2 percent, there can be large variation in attrition across outcomes within the same study. This can occur due to item non-response, to some outcomes only be available conditional on another outcome (e.g. income may only be measured for those who work), or to a survey or administrative data source only being present for a subset of observations. As a consequence, one-quarter of the treatment estimates involve at least 24% attrition, and at least 7% differential attrition, and attrition rates exceed 50 percent in some cases. This provides us with many treatment estimates in which there may be concerns about potential imbalances between treatment and control due to small samples or attrition. Authors of these studies tend to include a long list of variables to be inputted as potential control variables in PDS Lasso. The mean is 405 controls, and median is 182 controls. Notably these are both larger than the 5 In particular, zero of the papers used the double-debiased machine learning approach of Chernozhukov et al. (2018), nor did we find papers using other machine learning approaches 6 We include only the outcomes in which the authors use a PDS Lasso specification: some papers only do this for a subset of their outcomes as a robustness test. 10 median number of clusters in the study. The number of controls also shows substantial variation, ranging from six to 3,340 variables. We also look at two ratios: first, the ratio of inputted controls to the sample size of the experiment, which ranges from 0.003 to 1.73 with a median equal to 0.16. It is worth noting that the ratio exceeds 1 in fewer than 1 percent of the cases. Second, the ratio of inputted controls to the number of clusters which shows massive disparity with a minimum of 0.04 and a maximum of 17.58. We find that the mean is 2.27 and the median is 0.86. For each of these 18 papers and 780 treatment effects, we re-estimate the treatment effect using three different specifications: (i) using PDS Lasso, with the control variable input set specified by the authors, using the pdslasso command and its default plug-in estimator for λ, but then adjusting the standard errors for a degrees of freedom correction; (ii) using Ancova, where we control for the lagged dependent variable (if available) and strata fixed effects; and (iii) using PDS Lasso, but with cross-validation to select the tuning parameter λ.7 This enables us to examine how authors use PDS Lasso with experiments in practice, and what additional complications may arise that are not emphasized in theoretical work. We see how frequently PDS Lasso selects variables at each step of the estimation process, and how many variables typically are selected. We then compare the PDS Lasso and Ancova estimates to see how much of a change in standard errors (precision) is achieved by using this method, and how much of a change in the estimated treatment effect coefficients occurs. We then see how much these differences vary with features of the studies, such as the attrition rate. 3.2 Simulations We supplement our re-analysis of real field experiments with additional evidence from simulations. In analyzing real experiments, we do not know the true treatment effect, and so while we can examine how much the coefficients and standard errors change, are not able to directly estimate bias or mean-squared error. The simulations also enable us to vary some design features such as the sample size, number of controls inputted, and degree of correlation between covariates and the outcome. In our simulations, we use the following data generating process: p yi = γTi + βj xi,j + ϵi , (10) j =1 iid iid where xi,j ∼ N (0p , Ip ), ϵi ∼ (0, 1), and each observation has an equal probability of being randomly assigned ˆ , the to treatment, Ti . Each new draw of the data randomly varies xi,j , ϵi , and Ti . The object of interest is γ coefficient on treatment, which is set equal to 0.5 in expectation. Standard errors are adjusted to account 7 10 folds, selecting the λ that gives the minimum of the CV-function. 11 for degrees of freedom. In all the simulations we allow for one variable, xi,1 , to be moderately or highly correlated with yi . This variable can be thought of as the lagged dependent variable, which the researcher expects ex ante to be correlated with yi , and would include in an Ancova estimation. In addition, four variables have a low level of correlation with yi : βj = 0.05 for 2 ≤ j ≤ 5. The remaining xi variables in the model are only correlated with the dependent variable by chance, and should not be included when estimating the treatment effect. This DGP thus satisfies the exact sparsity assumption, since only a few number of the coefficients are non-zero—a condition under which the PDS Lasso method is expected to perform well. In Section 4 we compare the performance of PDS Lasso with the ANCOVA specification, defined as: yi = γTi + β1 xi,1 + ϵi , (11) Note that this is not the oracle estimator, since it does not include xi,2 to xi,5 . The PDS lasso can thus out-perform a model using researcher-determined control variables, but it might also under-perform, if xi,1 is not chosen. We then consider the performance of the PDS lasso under different features of the data generating process, such as the number of observations, n, and correlation between the regressors and the dependent variable, β1 . In Section 5 we vary different design decisions when implementing the PDS method, such as the number of potential regressors to include, p, and how to calculate the penalty parameter, λ. 4 How much difference does PDS Lasso make in practice? PDS Lasso will improve on the base specification of Ancova and strata controls in a field experiment setting if the outcome of interest can be predicted well by other baseline covariates (improving precision), or if there are imbalances in treatment assignment related to controls that also are correlated with the outcome (reducing bias). It is unclear in the typical practical application how much of a gain this provides, and therefore how much researchers should plan on benefiting from the use of PDS Lasso when designing studies and doing power calculations. We therefore examine how the method operates in practice, using both our replication analysis and simulations. 4.1 How often are variables selected for Y and T in field experiments? Recall that the field experiments in our sample included a mean (median) of 405 (182) potential control variables for PDS Lasso to select from. Figure 1 and Table 1 summarize how frequently the method selects variables from these input sets that help predict the outcome (step 1) and/or treatment status (step 2). 12 We see, despite the large number of potential control variables inputted, that PDS Lasso typically ends up selecting very few control variables: the mean is 3.62 and median is 2 variables, and the 75th percentile is only 5 variables. When we consider the two steps, we see that at least one variable is selected in the outcome regression (step 1) in 71% of cases, but the mean is for only 2.74 variables, and median for only 1 variable, to be selected. In over half (57%) of the cases, no variables are selected at all in the treatment regression (step 2), and when variables are selected, it is typically only one or two variables. Moreover, there is almost no overlap at all between the variables selected in step 1 to predict the outcome, and those selected in step 2 to predict the treatment. One interpretation is that any imbalances that occur in field experiments due to attrition or unlucky draws tend to be for idiosyncratic reasons, and do not result in selection into treatment on the basis of baseline variables that are strongly associated with future outcomes. Table A1 shows the results are not being driven by one or two papers with many outcomes, but are similar if we reweight the data at the paper level. Given these results, we might expect to see PDS Lasso result in quite modest changes in precision and in coefficient estimates compared to Ancova, since in practice the method is not selecting many variables to control for, and the second step is not selecting variables that are very strongly correlated with the outcome. We examine this next. 4.2 How much does PDS Lasso change standard errors and treatment estimates relative to Ancova? The bottom panel of Table 2 compares the estimated treatment effects and standard errors from PDS Lasso to those using Ancova8 . Since we have a wide variety of outcomes measured in different units, for each outcome, we calculate the absolute change in the treatment coefficient, relative to the control group standard deviation. We see very minimal changes in treatment estimates on average, with a mean of 0.05 S.D. and a median of 0.01 S.D. The 75th percentile is 0.04 S.D. The upper tail is larger, with a 95th percentile of 0.15 S.D., but this and the maximum can reflect some outcomes where the control group standard deviation is close to zero, so that even a large standard deviation increase can be small in absolute terms. To compare standard errors, we construct a ratio by dividing the standard error using PDS Lasso by the standard error using Ancova. A value smaller than one than indicates a smaller standard error, and hence an improvement in statistical power, from using PDS Lasso. The pdslasso command can also result in smaller standard errors even when the same variables as Ancova are used, due to not adjusting for degrees 8 Recall that we are using the term Ancova here to describe a regression which just controls for randomization strata and the lagged dependent variable if available, but for outcomes where no lagged dependent variable is available, the comparison is just to a regression with strata fixed effects. 13 of freedom. We therefore show this ratio with and without adjusting for degrees of freedom. Table 2 shows that the standard error ratio is typically very close to unity, with a median of 0.992. While on average PDS Lasso yields marginally smaller standard errors, in about 25 percent of cases it results in larger standard errors than Ancova. Figure 2 plots the ratio of PDS Lasso to Ancova coefficients, to this ratio of their standard errors. We see the mass is concentrated around 1 on both axes, suggesting that the majority of the time the two methods will give similar results. We examine the change in the statistical significance of outcomes when using PDS Lasso instead of Ancova and vice versa. Table A2 shows that using PDS Lasso instead of Ancova results in 49 (6.3%) and 42 (5.4%) insignificant outcomes that become significant at the 5 and 10 percent levels, respectively. There are a smaller number of treatment effects that are significant using Ancova that become insignificant when using PDS Lasso (15 (2%) at the 5 percent level, and 16 (2%) at the 10 percent level). A key reason applied researchers use PDS Lasso is the hope that it will enable them to improve statistical power through selecting additional variables to control for in their regressions. In pre-analysis plans and grant applications they may then write that they plan on using this method and expect it to yield a reduction in the minimal detectable effect (MDE) size, or improvement in power. However, these results suggest that such improvements will be incredibly modest, and not guaranteed, relative to just using Ancova. Since the MDE with 80 percent power is 2.8 times the standard error, the implied MDE with PDS Lasso will only be 0.9% smaller than with Ancova. This will likewise result in very small changes in the sample size required to detect a given effect size. As an example, a sample size of 786 (393 treatment, 393 control) is needed to have 80 percent power to detect a 20 percent increase in an outcome like profits or income, in which the mean and standard deviation are the same size. Using PDS Lasso reduces this by 14 units to 772 at the median ratio. 4.3 When does PDS Lasso make more of a difference? Although PDS Lasso does not select control variables for treatment in the majority of our cases, we might expect this to be more common when there is attrition. Panels (a) and (b) of Figure 3 shows this is the case. Panel (a) shows that the treatment regression selects at least one control variable only 12% of the time when there is zero attrition, and more often when there is attrition. However, the relationship is non-monotonic, which could indicate that many cases of high overall attrition still reflect balance on baseline variables by treatment status. In panel (b) we see a monotonic relationship between the differential attrition rate and likelihood of selecting control variables in the treatment regression. With no differential attrition, this occurs only 11% of the time, compared to 36% of the time with less than 5 percent differential attrition, and 67% 14 of the time with more than 5 percent differential attrition. Panels (c) and (d) of Figure 3 examine how much the estimated treatment coefficient changes using PDS Lasso compared to Ancova as the attrition and differential attrition rates vary. We see a significantly higher mean percentage change in the treatment coefficient when differential attrition is greater than 5 percent, but even so, the mean percent change is only 0.08 standard deviations. We would also expect PDS Lasso to make more of a difference to standard errors over Ancova when there is no lagged dependent variable available, and PDS Lasso is then potentially able to predict the outcome with other controls. We compared the standard error ratio (PDS Lasso/Ancova) for cases where no lagged dependent variable was available, to those where it was available and also included in PDS Lasso. There is not much difference at the median (0.996 when there is a lagged variable compared to 0.99 with no lagged variable available). However, at the 25th percentile the ratio is smaller (meaning more improvement) at 0.938 with no lag, compared to 0.974 with a lagged dependent variable available. Finally, using the replication data, we also looked in more detail at the outliers, where PDS Lasso had resulted in the biggest change in standard errors compared to Ancova. Qualitatively we see these arise from cases where there was no lagged dependent variable available, and where just a few9 controls are chosen by PDS Lasso that help predict the outcome well. For example, the largest reductions in standard errors compared to PDS Lasso occurs in Garbiras-Diaz and Montenegro (2022). They conduct an experiment to look at the effects of encouraging citizen oversight in Colombia. No lagged outcome is available, and the standard error after using PDS Lasso is 50-60% of that using just strata fixed effects. However, almost all of this improvement is coming from the inclusion of a single control: the number of candidates in the election strongly predicts the vote share (a regression of the outcome on this single regressor in the control group has a R2 of 0.17). Hence in many cases, the gains from PDS Lasso are coming from adding a couple of controls that researchers might otherwise have used contextual knowledge to add. 4.4 Why is Ancova sometimes more precise than PDS Lasso? We see that in one quarter of the empirical cases, PDS Lasso resulted in larger standard errors than Ancova. There are two potential reasons for this. The first is that by including many potential control variables for PDS Lasso to select among, it may fail to select the lagged dependent variable, or indeed any variables at all. In 23% of the replications in which a lagged dependent variable was available, but was not included by researchers in the amelioration set, the lagged variable was not selected by PDS Lasso.10 Second, it could end up adding noise and reducing precision by selecting variables that are irrelevant. 9 Of the 25 largest reductions in standard error, PDS Lasso selected a median of 3 controls to add. 10 The lagged dependent variable was not selected in 21 out of 93 such cases. 15 We examine this in simulations reported in in Figure 4. In these simulations We set p = 20, vary the autocorrelation of the lagged dependent variable β1 ∈ (0.2, 0.4), have four weakly relevant controls, βj = 0.05, for 2 ≤ j ≤ 5, and 15 completely irrelevant controls βj = 0 for j > 5, and then vary n. Figure 4(a) then shows that PDS Lasso fails to select the lagged dependent variable in moderately sized samples that are typical of many field experiments. When the correlation between xi,1 and y is relatively strong (β1 = 0.4) then the variable always gets selected in samples of 300 or larger. But when the correlation is relatively weak (β1 = 0.2), then the variable only gets selected with 100 percent probability after the sample reaches 1,000 observations or more, and gets selected less than half the time with samples of 300. This problem becomes 11 even more acute in clustered randomized trials, especially if the intra-cluster correlation is high. In contrast, we see in Figure 4(b) that PDS Lasso rarely selects the completely irrelevant variables included in the control set, and only starts selecting the weakly relevant controls (βj = 0.05) with samples sizes of 1, 000 or larger. So the main issue for precision is failure to select the most important controls in small and moderate sample sizes. As a result, Figure 4(c) shows that the standard errors are larger compared to the ANCOVA specification in small samples, especially if the correlation between the lagged dependent variable dependent variable is relatively low. However, it is in small samples where statistical power is typically of most concern, and adding control variables to improve precision is of most interest. One solution is then to include the lagged dependent variable in the amelioration set and partial it out before PDS Lasso then selects other controls. Figure 4(d) shows that when this is done, PDS Lasso does at least as well as Ancova at all but the smallest sample sizes (where it occasionally selects an irrelevant variable), and starts to be more precise for samples of 300 or more in this simulation. 5 Practical Issues in Implementation There are several questions that arise in practice when implementing PDS Lasso with field experiments. These include the appropriate choice of penalty parameter, how many and which controls to include, what to place in the amelioration set, whether the double step is needed at all, and how to deal with multiple treatments and treatment interactions. We discuss the theory and evidence for how researchers should decide on these questions. 11 For example, Figure A1 in the appendix shows that with 50 clusters per treatment arm, 20 observations per cluster, an ˆ intra-cluster correlation (ICC) of 0.4, β 1 = 0.3, the variable has about a 80 percent chance of being selected. 16 5.1 Should researchers use a penalty parameter other than the plug-in default? uthrich and Zhu As discussed above, the “plug-in” penalty parameter behaves well asymptotically, but W¨ (2023) show that it can underselect variables in finite samples when the number of control variables in reasonably large. The alternative is cross-validation. Chernozhukov et al. (2024) suggest that cross-validation instead raises the risk of overfitting, and can perform poorly in moderately sized samples as a result. We examine how cross-validation performs in practice using both our replication data and simulations. Figure 5 shows simulation results, varying the number of observations between 100 and 10, 000.12 Panels (a) and (b) show that the cross-validation method tends to select far more variables compared to the plug-in approach, for both steps 1 and 2. For example, when n = 500 the median number of variables selected that are predictive of the dependent variable using the plug-in penalty parameter is one, compared to five for cross-validation. Since the oracle estimator includes five control variables, this suggest that there might be some power advantage of using cross-validation to select the penalty parameter in smaller samples. However, panels (a) and (b) also show that cross-validation often selects too many variables. The method is also unstable, with a high variation in the number of variables selected. This could reduce precision, due to reasons discussed above. Overall, panel (c) shows that the median standard error tends to be smaller for cross-validation in smaller samples, and equal to the plug-in parameter in very large samples. However, the variation in standard errors is larger, especially in smaller samples. That is, using cross-validation risks overfitting.13 Table 3 examines the performance of cross-validation in real field experiments by re-estimating each treatment effect using cross-validation instead of the plug-in penalty parameter. Comparing the number of variables selected to those presented in Table 1 we confirm that cross-validation tends to select substantially more variables: a mean (median) of 26.96 (13) controls compared to 3.7 (2) controls with the plug-in penalty parameter. The risk of overfitting is seen with a maximum of 264 controls selected. The median standard error is marginally smaller with cross-validation, at 98.4% of the plug-in standard error. However, there are some extreme cases where the standard error is substantially larger. Overall, given that cross-validation does not offer clear improvements, lacks theoretical justification, and has this risk of overfitting, there does not seem to be a compelling case for applied researchers to use cross-validation instead of the plug-in penalty. Table A2 shows the change in significance when using cross-validation as compared to Ancova and PDS Lasso. Compared to the plug-in penalty parameter, the cross-validation approach leads to just as many 12 We further set p = 20, β1 = 0.3, and βj = 0.05 for j ∈ (2, 5). 13 Two other approaches to reduce the risk of over-fitting from cross-validation are the (i) adaptive Lasso (Zou, 2006), or (ii) setting the penalty parameter such that cross-validated error is one standard error larger than the minimum. They can be specified in the dsregress command using the ”selection(adaptive) ” and ”selection(cv, serule) ” options, respectively. Our simulations showed that the adaptive lasso performed similar to simple cross-validation, although tended to still over-select in small samples. The 1 SE rule selected fewer variables than using the plug-in penalty parameter. 17 losses in significance as gains. In contrast, using cross-validation instead of Ancova is much more likely to result in a gain of statistical significance than a loss: there are 31 (4%) of the treatment estimates which lose significance at the 5 percent level, and 59 (7.8%) which would gain significance by using cross-validation instead of Ancova. 5.2 How many and which covariates should researchers include in the control set? The theoretical treatments of PDS Lasso take the vector of potential covariates x, and its dimension p, as given, and provide a method for choosing which subset of controls from this larger set should be included. However, in practice, researchers typically have considerable choice over which variables they consider as potential covariates. In order to reduce further the p-hacking concerns of Simmons et al. (2011), this set should ideally be pre-specified. But then researchers need guidance on how many and which covariates to choose. One approach is the “kitchen sink” approach, of choosing a very large number of baseline covariates, potentially with p even larger than n. This approach seems particularly important in observational studies, where the assumption of exogeneity of treatment conditional on observables may be deemed more plausible if a very large set of observables is considered. Applied researchers then sometimes see this as showing robustness with no downside. For example, in introducing the Lasso for inference, StataCorp gives an example with 104 covariates and writes “we do not worry about overfitting the model, because the control variables that we specify are potential control variables. Lasso will select the relevant ones”.14 This kitchen sink approach describes a lot of the papers we replicate, given a mean (median) of 405 (182) inputted controls. However, as equation 9 shows, the penalty parameter λ does increase with p, and so by starting with a very large set of potential controls, it is possible that none get selected, whereas some might get selected if PDS Lasso was run on a smaller subset. Belloni et al. (2014a, p. 40) note that distinguishing true predictive power from spurious associations becomes more difficult as more variables get considered, and so advise selection over a collection of variables that is “not overly extensive” and “where some persuasive economic intuition exists” for their inclusion. However, at the other extreme, if the number of controls being considered is not very large at all (p small relative to n), then there may be no advantage at all of running PDS Lasso. Belloni et al. (2014b) note that post-double-selection is first-order equivalent to a regression that just includes the full set of controls when p is small relative to n. We illustrate this via simulations which vary the sample size, and number of inputted controls. Figure 6 shows the mean and median number of controls selected by PDS Lasso when we include 5, 40, and 1,600 14 https://www.stata.com/features/overview/lasso-inferential-methods Accessed February 21, 2023. 18 control variables as inputs. Fewer variables get selected when more variables are included in the input set. This is particularly concerning with smaller samples, where by putting in many potential controls, we risk having none at all selected. But even with a sample size of 10,000 observations, inputting a large number of variables (1,600 in this example) leads to too few variables getting selected. We have three additional pieces of advice for empirical practice. A first and important note is that researchers should ensure that their sample size n available for estimation does not change as one changes the set of variables in x. This means ensuring that there are no missing values in any of the variables included in x, which can be accomplished by dummying out missing values and including missing value indicator variables as additional covariates. We discovered this issue affects multiple published papers during our replications. In one extreme case, there was one paper where the variables were selected using a subsample that was a tenth the size of the overall sample due to this issue. One common issue that can then arise is that when a variable x1 and a dummy variable indicating that x1 is missing are both included in the set of controls inputted, PDS Lasso may only end up selecting one of the two. If researchers wish to include both, they can of course re-estimate the model by adding in the unselected variable. However, our view is that researchers should be comfortable including only one of the two. For example, age may not have any predictive power for the outcome nor be associated with treatment in an RCT, but refusing to tell your age (and hence having it missing), could be a strong predictor of differential attrition or of some outcomes of interest. Since we are not trying to interpret the coefficient on x1 , but just use it to improve precision or reduce imbalance, it does not matter that it may include some zeros for missings, or just be an indicator for missing responses on a variable not included in the regression. A second practical recommendation is that, especially with small samples, researchers might want to discretize some continuous control variables that have highly skewed distributions or outliers (e.g., as a dummy variable for above or below the median), rather than include as continuous variables. Otherwise, there is a risk of overfitting based on just a few observations. A third practical issue that comes up from applied researchers is how they should consider a set of fixed effects. For example, they may have fixed effects for each survey enumerator, or for different geographic regions, or different industries. They then wonder whether they should take an “all or nothing” approach, whereby either they include a full set of dummies for these different categories, or none at all. Inputting these fixed effects as control variables into PDS Lasso may instead select only one or two enumerator, industry, or region fixed effects to include. This may seem sensible if perhaps only one or two enumerators, or one or two ar et al. (2024) regions may be predictive of the outcome or treatment status. However, recent work by Koles´ notes that the approximate sparsity assumption can be problematic with categorical variables, and results are quite sensitive to the normalization used (e.g. to which category is set as the base category). Unless 19 researchers have domain-specific knowledge to help prune and/or combine many categories into a smaller subset, and to suggest sparseness, we therefore urge caution in using PDS Lasso to select only a subset of fixed effects, and instead suggest partialling out the full set of any fixed effects the authors wish to include. 5.3 What should go into the amelioration set? Recall the amelioration set I3 is the set of additional variables that researchers want included in the regression, uthrich and even if they do not get selected in either of the two double-selection steps for predicting y or T . W¨ Zhu (2023) recommend using variables suggested by theory and prior knowledge in this set to help mitigate concerns about finite sample omitted variable bias. This seems less of a concern in the experimental setting, if the goal is mostly to reduce residual variance rather than explain treatment, and it seems rare that researchers would have strong priors on variables that predict treatment given random assignment. Belloni et al. (2014b) note that one of the regularity conditions required for the approximate sparsity condition underlying their results is that the size of the amelioration set should not be substantially larger than the size of the set of variables selected by Lasso. Our view is that the most likely variables to be chosen by researchers are those in the Ancova specification in equation (3): that is, the lagged dependent variable, and any randomization strata fixed effects. This guards against the underselection of the lagged dependent variable shown in section 4.4. As discussed by Bruhn and McKenzie (2009) it is also prudent to always include strata fixed effects in the regression. Belloni et al. (2016) consider the case of panel data and individual fixed effects, and note that approximate sparseness is likely to be inappropriate for dealing with individual specific heterogeneity, and that one should instead partial out individual fixed effects and assume approximate sparseness on demeaned data in the panel setting. Likewise, with strata fixed effects, one will usually want to include them all, and then require approximate sparseness conditional on the randomization strata. And as discussed in the prior section, given the recent ar et al. (2024), any other fixed effects such as enumerator or regional fixed effects may also work by Koles´ want to be included in this amelioration set rather than having PDS Lasso select only a subset of them. There are then two ways of including this amelioration set. The first is to include them, but penalize the selection of other variables accordingly, so that p in equation (9) includes the variables in the amelioration set. This is implemented by the aset() option with the pdslasso command. The alternative is to apply zero penalty weight to them, which can be done by partialling out these variables before applying Lasso, in which case they will not be included as part of p. The partial() option does this. In practice, we find that out of the 18 replicated papers, only one paper used the aset() option and 12 papers used the partial() option. The 5 remaining papers used neither of the options, however, we partialled out the constant in order to get the 20 correct number of controls selected by PDS Lasso.15 We explore the difference in the total number of controls selected using the partial() option in comparison with the aset() option as shown in table A3 in the appendix. We notice that the overall number of variables selected is almost the same using either options. The only main distinction is in the maximum number of controls selected, which is 68 variables using partial() and 78 variables using aset(). We also check the difference in the number of variables selected between both options and we find that while in rare cases there can be a large difference in the number of control variables selected between the two approaches, the mean of the difference is only 1 variable, and the median is no difference. It is not clear ex ante which method is most appropriate. On the one hand, since the plug-in penalty parameter can underselect in small samples, it seems appropriate not to further penalize the variables, especially in smaller samples. On the other hand, the increase in standard error due to degrees of freedom adjustment is particularly costly in small samples with many control variables. It could also depend on whether the strata dummies explain any variation in yi , or how correlated it is with other potential control variables. To answer this question we ran simulations for different numbers of strata (10 or 100), different number of observations (200 or 600), and varied whether the strata are correlated with the dependent variable or not as shown in table A4 in the appendix. Overall, we find almost no difference in standard errors. The decision whether to penalize the other variables or not therefore does not seem to matter much, at least for the DGP used in our simulations. 5.4 Is the double step even necessary? We have seen that in over half the empirical cases, no variables at all are selected in the second step (predicting treatment). When variables are selected in this second step, Table 2 shows that they are almost entirely variables that are not selected in step one, so are not strongly correlated with the outcome. This raises the question of whether it is worth having this second step at all in a randomized experiment. We examine this through simulations, reported in Figure 7. We hold the sample size constant, and iid introduce selective imbalance, by setting x1,i = α1 Ti + ei before constructing equation 10, where ei ∼ N (0, 1) and α ∈ (0.1, 0.2, 0.3). Such imbalance would occur if the attrition rates were different between the treatment and control groups and the probability of attrition is correlated with xi,1 .16 These are the conditions under which the double selection would potentially be valuable. The main take-away from Figure 7 is that including Step 2 reduces the Mean Square Error (MSE) when 15 The dsregress program also allows one to specify a set of variables that must be included in the final regression, but there is no function to partial these variables in advance of performing variable selection. 16 For example α ≊ 0.2 if 11 percent of observations in the treatment group, but those with with highest values of x i,1 , attrite. 21 there is a moderate degree of correlation between xi,1 and y , but increases the MSE if the correlation is very low; it makes no difference when β1 is high. When β1 is very low, xi,1 is not a source of omitted variable bias, so including the variable does not change the extent of bias (panel (b)), but it increases the standard errors when β1 is small (panel (c)). In contrast, when β1 is very high then the imbalanced variable gets selected in Step 1, so the double selection does not make any difference (panel (a)). Taken together, it is only in these intermediate cases—β1 is sufficiently high, so that it is a source of omitted variable bias, but sufficiently small so it is not selected in the first step—that the second step reduces the MSE. The exact range of β1 where Step 2 leads to a reduction in the MSE depends on the sample size and extent of imbalance. This then suggests there can be value in the double-selection step, but that researchers should again be judicious in the choice of which variables to include in the control variables input set. Including a whole host of variables that might be linked to attrition, but that are at most very weakly correlated with the key outcome can increase mean-squared errors. 5.5 What if there are multiple outcomes and/or multiple treatments? Our discussion so far has been limited to the canonical case of a single outcome y, and a single treatment. But all of the studies have multiple outcomes, and Table 1 shows, 56% of the studies employ multiple treatments. With multiple outcomes, PDS Lasso can simply be run separately for each outcome. However, there are two implications of this for empirical work that are worth noting. First, this does mean that a different set of control variables can be potentially chosen for each outcome, or for the same outcome measured at different points in time. For example, in one paper where the authors input 115 controls, PDS Lasso does not select any controls for one outcome but selects 13 controls for another outcome. In another paper with 3340 inputted controls, the range gets bigger and we find that while PDS Lasso does not select any controls for one of the outcomes, it selects 51 controls for another outcome. This is not a problem, and is indeed sensible, since, for example, the variables that best predict labor force participation may be different from those that predict earnings or fertility. But this does differ from the ad hoc “robustness check” approach to including covariates, which tends to add the same set of covariates for each outcome. Second, if the sample size differs across outcomes (as is often the case with item non-response in surveys), then the treatment selection regression in equation (7) might also select different variables for different outcomes. As a result, researchers should not simply run the treatment regression selection equation once and see whether any variables are selected, but ensure it is run for each sample being used for analysis. With multiple treatments, we can extend equation 1 to include D treatments: 22 D yi = α + γD TD,i + g (zi ) + εi (12) d=1 Then Urminsky et al. (2016) note that the method easily extends to incorporate these additional treat- ments by simply repeating step 2 (the Lasso treatment selection regression) for each treatment. Note here that each step 2 regression will be comparing a particular treatment to the combined group of the control and all other treatments. Then the post-double-selection regression will include the union of all covariates chosen to predict y , as well as those chosen to predict anyone of the different treatments. In the typical field experiment this may still select few, if any, variables, given random assignment sets each treatment orthogonal to these controls in expectation. 5.6 How should PDS Lasso be used with treatment interactions? In addition to estimating the average effect of being assigned to treatment, it is common for researchers to examine treatment heterogeneity by interacting treatment with one or more covariates. For example, they may augment equation 1 to allow treatment to vary with variable xi,1 , to give the following partial linear model: yi = α + γTi + δTi x1,i + µx1,i + g (zi ) + εi (13) Where xi,1 could also be part of the vector x used to approximate g (zi ). Lasso can then be used to select variables that predict y and T as before, but now there is the additional term T x1 to also consider. Running step 2 (equation 7) to predict T x1 can be done to select additional covariates to add as controls, where x1 should then be considered in the amelioration set, since we would always want to include the level of a variable when including its interaction with treatment.17 If this is not done, then if x1 is correlated with many of the other covariates, this step could end up selecting a large number of variables. In practice it is common to see two different types of interactions. The first occurs when multiple follow- up periods are used, and treatment is interacted with time period to allow the effect of treatment to vary over time. We should then only expect to see different variables potentially chosen for the interaction terms if the attrition varies across rounds. The second type of interaction is when authors interact treatment with baseline variables like gender, education, or region to examine treatment heterogeneity. Then it is possible that the sample is balanced on observables overall, but once one looks within subgroups is imbalanced, or vice versa. We examine whether PDS Lasso selects different controls that predict T than the ones that predict T x1 in three papers with a total of 8 regressions and 9 treatments. We find varying results across 17 We recommend partialling it out, so that the inclusion of this interacting variable is not penalized by lasso. 23 the three papers; while in one of them we find that PDS Lasso selects more controls that predict T x1 than the ones that predict T , we find the opposite in another paper. The third paper shows no difference between the set of controls chosen in both scenarios, as PDS Lasso did not select any controls except in few cases. In the replications, one-third of the papers use interactions, resulting in 15 percent of the estimated coefficients in our replication sample being on an interaction. However, one common issue we encountered when examining replications of papers with interactions is that authors include the interacted variable in their pdslasso command as a treatment instead of partialling it out or adding it to the amelioration set. This led to the selection of more controls. For instance, including the interacted variable as a treatment in one paper resulted in having between 38 and 41 controls selected for several outcomes. However, after partialling out the interacted variable instead, we find that PDS Lasso selected between 8 and 11 controls. Appendix 6 shows an example of the incorrect syntax used, and the correct way to include interactions in this case. 6 A checklist for best empirical practice Based on the above analysis, we offer the following guidelines for empirical researchers considering using PDS Lasso in their experimental analysis. 1. Be realistic as to how much power one is likely to gain. In the typical RCT we find very few variables get selected, and standard errors are less than one percent smaller than simply using Ancova. Power calculations should therefore not anticipate large improvements in power from using PDS Lasso. 2. Include the lagged dependent variable and any randomization strata in the amelioration set. This will help prevent underfitting where the lagged variable does not get selected. 3. Be judicious in the choice of the number of control variables inputted into PDS Lasso, and do not use a kitchen sink approach. Including very many variables can make it less likely the more important ones get chosen, and may worsen mean-squared error. For example, chance imbalances or attrition related to variables that are not correlated at all with the outcome need not be controlled for. 4. Be very careful about missing values, and ensure that all control variables being inputted have no missing values (by dummying out if necessary). Eight of the 18 papers we examined had the final sample using PDS Lasso smaller than the sample used for OLS because of this issue. 5. When including treatment interactions, ensure the interacting variable is included in the amelioration set, and not accidentally modelled as if it were another treatment. Multiple papers made coding syntax errors of this sort. 24 6. When using the pdslasso command, be aware of the lack of a degrees of freedom adjustment. Otherwise one can end up with smaller standard errors after using this command compared with using OLS even when no additional controls are selected. More generally, the implementation details can differ across software packages, and so it is useful to pre-specify exactly how the method will be implemented. After following these steps, we see PDS Lasso as a useful robustness check for many field experiments, providing a less ad hoc way of selecting additional control variables on top of any lagged variable and randomization strata. Most of the time it should make very little difference to the estimated coefficients and standard errors. When it does make a sizeable change in the coefficients, this can provide a useful warning for researchers that they can no longer rely simply on random assignment to justify their results. References Ahrens, Achim, Christian Hansen, and Mark Schaffer, “Pdslasso and ivlasso: Progams for post-selection and post-regularization OLS or IV estimation and inference,” http://ideas.repec.org/c/boc/bocode/s458459.html 2018. updated 2020. Bai, Yuehao, “Optimality of matched-pair designs in Randomized Controlled Trials,” American Economic Review, 2022, 112 (12), 3911–3940. Belloni, Alexandre and Victor Chernozhukov, “High Dimensional Sparse Econometric Models: An Introduction,” 2011. arXiv:1106.5242v2. , Daniel Chen, Victor Chernozhukov, and Christian Hansen, “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica, 2012, 80, 2369–2429. , Victor Chernozhukov, and Christian Hansen, “High-Dimensional Methods and Inference on Struc- tural and Treatment Effects,” Journal of Economic Perspectives, 2014, 28 (2), 29–50. , , and , “Inference on treatment effects after selection among high-dimensional controls,” Review of Economic Studies, 2014, 81, 608–650. , , , and Damian Kozbur, “Inference in High-Dimensional Panel Models With an Application to Gun Control,” Journal of Business and Economic Statistics, 2016, 34 (4), 590–605. Bloniarz, Adam, Hanzhong Liu, Cun-Hui Zhang, Jasjeet Sekhon, and Bin Yu, “Lasso adjustments of treatment effect estimates in randomized experiments,” PNAS, 2016, 113 (27), 7383–7390. 25 Bruhn, Miriam and David McKenzie, “In pursuit of balance: Randomization in practice in development field experiments,” American Economic Journal: Applied Economics, 2009, 1 (4), 200–232. Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgka- nis, “Applied Causal Inference Powered by ML and AI,” 2024. , Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, Econometrics Journal, 2018, 21, C1–C68. Garbiras-Diaz, Natalia and Mateo Montenegro, “All Eyes on Them: A Field Experiment on Citizen Oversight and Electoral Integrity,” American Economic Review, 2022, 112 (8), 2631–2668. Ghanem, Dalia, Sarojini Hirshleifer, and Karen Ortiz-Becerra, “Testing Attrition Bias in Field Experiments,” Journal of Human Resources, forthcoming, 2023. Guo, Yongyi, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Gold- man, “Machine Learning for Variance Reduction in Online Experiments,” in “35th Conference on Neural Information Processing Systems” 2021. ar, Michal, Ulrich M¨ Koles´ uller, and Sebastian Roelsgaard, “The Fragility of Sparsity,” 2024. arXiv:2311.02299v2. List, John, Ian Muir, and Gregory Sun, “Using Machine Learning for Efficient Flexible Regression Adjustment in Economic Experiments,” 2022. Working Paper. McKenzie, David, “Beyond Baseline and Follow-up: The Case for more T in Experiments,” Journal of Development Economics, 2012, 99 (2), 210–21. Permutt, Thomas, “Testing for Imbalance of Covariates in Controlled Experiments,” Statistics in Medicine, 1990, 9 (12), 1455–62. Simmons, Joseph, Leif Nelson, and Uri Simonsohn, “False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant,” Psychological Science, 2011, 22, 1359–1366. Urminsky, Oleg, Christian Hansen, and Victor Chernozhukov, “Using Double-Lasso Regression for Principled Variable Selection,” SSRN 2016. Wager, Stefan, Wenfei Du, Jonathan Taylor, and Robert Tibshirani, “High-dimensional regression adjustments in randomized experiments,” PNAS, 2016, 113 (45), 12673–12678. 26 Wu, Edward and Johann Gagnon-Bartsch, “The loop estimator: Adjusting for covariates in randomized experiments,” Evaluation Review, 2018, 42 (4), 458–488. uthrich, Kaspar and Ying Zhu, “Omitted Variable Bias of Lasso-Based Inference Methods: A Finite W¨ Sample Analysis,” Review of Economics and Statistics, 2023, 105 (4), 982–997. Zou, Hui, “The adaptive lasso and its oracle properties,” Journal of the American statistical association, 2006, 101 (476), 1418–1429. 27 7 Tables Table 1: Descriptive statistics Mean SD Min P25 Median P75 P95 Max N Study level Sample Size 3,374 3,800 418 1,038 1,913 4,079 13,987 13,987 18 Clusters 296 413 14 40 206 314 1,637 1,637 15 Number of Outcomes 43 46 3 9 37 54 189 189 18 Multiple treatments 0.56 0.51 0.00 0.00 1.00 1.00 1.00 1.00 18 Outcome × treatment level Inputted Controls 405 814 6 39 182 331 3,340 3,340 780 Inputted Controls to Sample Size 0.17 0.19 0.00 0.06 0.16 0.19 0.69 1.73 780 Inputted Controls to Clusters 2.27 4.40 0.04 0.20 0.86 2.00 16.06 17.58 694 Overall Attrition 0.14 0.16 0.00 0.04 0.10 0.23 0.34 0.93 780 Treatment Attrition 0.14 0.17 0.00 0.03 0.09 0.24 0.39 0.94 780 Control Attrition 0.16 0.16 0.00 0.04 0.11 0.24 0.35 0.92 780 Differential Attrition 0.05 0.08 0.00 0.00 0.02 0.07 0.17 0.56 780 Notes. Based on 780 outcome by treatment combinations from 18 studies. 28 Table 2: Empirical Performance of PDS Lasso and comparison with ANCOVA Mean SD Min P25 Median P75 P95 Max N Number of variables selected Overall 3.62 4.72 0 1 2 5 11 51 780 Selected for y 2.74 4.57 0 0 1 3 9 51 780 Selected for T 0.90 1.59 0 0 0 1 4 17 780 Selected for T , but not y 0.88 1.59 0 0 0 1 4 17 780 At least one variable selected Overall 0.83 0.37 0 1 1 1 1 1 780 Selected for y 0.71 0.45 0 0 1 1 1 1 780 Selected for T 0.43 0.50 0 0 0 1 1 1 780 Selected for T , but not y 0.43 0.50 0 0 0 1 1 1 780 Performance (vs ANCOVA) Change in stand. coef.i 0.05 0.16 0.00 0.00 0.01 0.04 0.15 2.20 779 SE ratioii 0.970 0.103 0.502 0.949 0.992 1.004 1.045 2.438 776 SE ratio (no d.f. adj.)iii 0.950 0.104 0.501 0.932 0.964 0.989 1.021 2.426 776 Notes. Data is at an outcome × treatment level. The penalty parameter is calculated using the method proposed by Belloni and Chernozhukov (2011). The final panel compares the coefficient estimates and standard errors (SEs) with estimates using ANCOVA (equation 11). i Absolute value of the difference in coefficient estimates, divided by the control group standard deviation ii SE using PDS divided by the SE using ANCOVA iii No adjustment to degrees of freedom are made to the SEs when using the PDS method. 29 Table 3: Comparing cross-validation with ANCOVA and with the plug-in penalty Mean SD Min P25 Median P75 P95 Max N Number of variables selected Overall 26.96 47.81 0 4 13 27 150 264 760 Selected for y 14.93 18.65 0 2 9 22 46 167 760 Selected for T 13.94 46.15 0 0 0 5 133 238 760 Selected for T , but not y 12.03 43.58 0 0 0 2 128 229 760 At least one variable selected Overall 0.90 0.30 0 1 1 1 1 1 780 Selected for y 0.84 0.37 0 1 1 1 1 1 780 Selected for T 0.43 0.50 0 0 0 1 1 1 780 Selected for T , but not y 0.34 0.48 0 0 0 1 1 1 780 Performance (vs ANCOVA) Change in stand. coef. 0.12 0.56 0.00 0.01 0.03 0.06 0.39 10.27 759 SE ratioi 0.956 0.142 0.500 0.906 0.965 1.000 1.180 2.182 760 Performance (vs plug-in) Change in stand. coef. 0.11 0.52 0.00 0.01 0.02 0.04 0.32 8.07 759 SE ratioii 0.988 0.112 0.528 0.952 0.984 1.003 1.189 2.022 760 Notes. Performance of the PDS Lasso procedure when the penalty parameter is calculated using cross- validation (CV). Panels A and B shows number of variables selected. Panels C and D compares performance with ANCOVA, and when the penalty parameter is calculated using the method proposed by Belloni and Chernozhukov (2011) (referred to as the plug-in). i SE using PDS Lasso with CV divided by SE using Ancova ii SE using PDS Lasso with CV, divided by SE using PDS Lasso with the plug-in penalty parameter Data is at an outcome × treatment level. 30 8 Figures Figure 1: Number of variables selected in the first and second steps (a) Selected on dependent variable (step I) (b) Selected on treatment (Step II) Notes: Distributions are top-coded at 10. Figure (a) shows a histogram of the number of variables selected that are predictive of the dependent variable, data is at an study-by-outcome level (n = 271). Figure (b) shows a histogram of the number of variables selected that are predictive of treatment assignment, and data is at an study-by-outcome-by-treatment level (n = 780). 31 Figure 2: Ratio of coefficients and standard errors: ANCOVA vs post-double selection method Notes. The ratio is equal to the estimated coefficient (or standard error) using the post-double selection method, divided by the estimated coefficient (or standard error) using ANCOVA. Each color represents a different paper. Distributions of the two variables are shown in the histograms to the above and right of the graph. 13 observations with coefficient ratios smaller than -10 or larger than 10 are omitted. 32 Figure 3: The probability of selecting a variable predictive of treatment, and change in effect size, varies with the attrition rate (a) Select controls, by attrition (b) Select controls, by differential attrition (c) Change in standardized β , by attrition (d) Change in standardized β , by differential attrition Notes. In panels (a) and (b), the y-axis is the probability that a variable is selected in step II. In panels (c) and (d), the y-axis is the absolute value of the change in the standardized effect size, comparing PDS lasso with ANCOVA. In panels (a) and (c), data is split by the overall attrition rate. In panels (b) and (d), data is split by the absolute value of the difference in attrition rates between the treatment group and control. The length of the bars designate the mean in each sub-group. 33 Figure 4: PDS Lasso sometimes is less precise than Ancova due to failure to select the lagged dependent variable (a) Probability of selecting x1,i (b) No. of other variables selected (c) Standard error, relative to ANCOVA (d) Standard error (partial x1,i ), relative to ANCOVA iid iid Notes: Data is generated using equation 10, where xi,j ∼ N (0p , Ip ), and ϵi ∼ N (0, 1). p = 20. 1, 000 simulation replications. Results shown for for different values of n and β1 . In each figure the x-axis is the natural log of n. Panel (a) shows the probability of selecting the lagged dependent variable, xi,1 , and different lines indicate different values for β1 ∈ (0.2, 0.4). Panel (b) shows the number of other variables selected, where βj = 0.05 for 2 ≤ j ≤ 5, and βj = 0 for i > 5. Panel (c) shows the percentage change in the standard error, relative to ANCOVA: yi = γTi + β0 + β1 xi,1 + ϵi . Panel shows the percentage change in the standard error, relative to ANCOVA, when the lagged dependent variable is partial out in PDS lasso procedure. 34 Figure 5: Cross-validation selects more variables than the plug-in penalty parameter and gives more volatile standard errors (a) Variables selected (predict yi ) (b) Additional variables selected (predict Ti ) (c) Change in standard errors Notes: Number of variables selected using the post-double selection (PDS) procedure, comparing two approaches to calculate the penalty parameter: the method proposed by Belloni et al. (2012), and 10-fold cross validation. See Section 3.2 for the data generating process. p = 20, n ∈ (100, 500, 1000, 10, 000). β1 = 0.3, βj = 0.05 for j ∈ (2, 5). The remaining variables are only correlated with yi by change. 1, 000 simulation replications. Figure (a) shows box plots for the distributions of the number of variables selected in Step 1; (b) shows box plots for the number of additional variables selected using Step C. Panels C and D show the SE ratio: the SE using PDS Lasso, divided by the SE estimated using Ancova (i.e., only controlling for xi,1 ). 35 Figure 6: Inputting many controls can result in fewer controls being selected by PDS Lasso (a) Mean (b) Median Notes: The mean and median number of variables selected, depending, depending on the number of pre- iid iid dictors. Data is generated using equations (5) and (7), where xi,j ∼ N (0p , Ip ), and ϵi ∼ (0, 1). β1 = 0.3; βj = 0.05 for 2 ≤ j ≤ 4; and βj = 0 for 5 ≤ j ≤ 40. p ∈ (5, 40, 1600). In the first case (red line), variables xi,1 to x5 are inputed; in the second case (blue line), variables xi,1 to x40 are inputed; in the final case, all 40 variables are interacted with each other and 1,600 are inputed. Variables are selected using the post double selection method proposed by Belloni et al (2014). 36 Figure 7: When does double-selection improve upon single-selection lasso? (a) Imbalanced variable selected in Step 2, but not Step 1 (b) Change in bias (c) Change in standard errors (d) Change in mean square error iid iid Notes: Data is generated using equation 10, where xi,j ̸=1 ∼ N (0p , Ip ), and ϵi ∼ N (0, 1). We introduce imbalance by setting: iid xi,1 = α1 Ti + ei , where ei ∼ N (0, 1). Different lines in each panel show results for different levels levels of imbalance: α1 ∈ (0.1, 0.2, 0.3). β1 (on the x axis) ranges from 0.01 to 0.2. (n, p) = (1, 000, 20). All panels shows results of lasso regressions, using the penalty parameter of Belloni et al. (2012). Double selection =post-double selection method of Belloni et al. (2014b). Single selection only includes as controls xi,j ∈ I1 . Panel (a) indicates the probability that xi,1 is selected in Step 2 but not in Step 1. Panels (b) to (d) shows the difference between the Double and Single selection in terms of bias, the standard errors, 1 N ˆ 1 N ˆ and the mean square error, respectively. Bias=( N s=1 β1 − β1 ) and MSE= N s=1 (β1 − β1 ), where s indicates each random draw from the simulations. N = 1, 000 37 A Appendix Tables and Figures Figure A1: Variables selected with clustered randomized control trial (a) Varies at an individual level (b) Varies at a cluster level Notes: The DGP is: 10 10 I j C j yi,c = γTc + βj xi,c + βj xc + µc + ϵi,j , (14) j =1 j =1 We set 20 individuals per cluster and vary the number of clusters and the intra-cluster correlation, ρ ∈ 2 σµ 2 2 (0.2, 0.4, 0.6), where ρ = 2 +σ 2 σµ and σµ and σϵ are the variance of µc and ϵi,c , respectively. p = 20, ϵ I C β1 = β1 = 0.3, the other variables are correlated with yi,c by chance. Panel (a): probability that x1 i,c gets selected; Panel (b): probability that x1 c gets selected. 38 Table A1: Comparing post-double selection (PDS) method with ANCOVA: Weighted and Unweighted Unweighted Weightedi Mean SD Mean SD Number of variables selected Overall 3.62 4.72 3.97 4.71 Selected for y 2.74 4.57 3.26 4.68 Selected for T 0.90 1.59 0.73 1.36 Selected for T , but not y 0.88 1.59 0.71 1.35 At least one variable selected Overall 0.83 0.37 0.84 0.37 Selected for y 0.71 0.45 0.74 0.44 Selected for T 0.43 0.50 0.38 0.48 Selected for T , but not y 0.43 0.50 0.37 0.48 Performance (vs ANCOVA) Change in stand. coef.ii 0.05 0.16 0.05 0.15 SE ratioiii 0.970 0.103 0.975 0.153 SE ratio (no d.f. adj.)iv 0.950 0.104 0.952 0.160 Notes. Data is at an outcome × treatment level. Unweighted statistics give all 780 outcome by treatment estimates equal weight. Weighted estimates reweight at the paper level. The penalty parameter is calculated using the method proposed by Belloni and Chernozhukov (2011). The final panel compares the coefficient estimates and standard errors (SEs) with estimates using ANCOVA (equation 11). i The weight of each paper is calculated by dividing 1 over the number of outcomes by treatment estimates per paper ii Absolute value of the difference in coefficient estimates, divided by the control group standard deviation iii SE using PDS divided by the SE using ANCOVA iv No adjustment to degrees of freedom are made to the SEs when using the PDS method. Table A2: Change of Significance when Using Different Approaches Ancova vs Pdslasso Ancova vs Cvlasso Pdslasso vs Cvlasso insig to sig sig to insig insig to sig sig to insig insig to sig sig to insig Panel A. Change of Significance at 5 percent level Number of Observations 49 15 59 31 28 30 Percent 6.28 1.92 7.76 4.08 3.68 3.95 Panel B. Change of Significance at 10 percent level Number of Observations 42 16 52 31 28 28 Percent 5.38 2.05 6.84 4.08 3.68 3.68 Notes. In the first column, we show the number and percent of observations that are insignificant when using Ancova but significant when using PDS LASSO. The second column indicates the number and percent of observations that are significant when using Ancova but insignificant when using PDS LASSO. The same logic applies to the subsequent columns. 39 Table A3: Total Selected Controls using PDS Lasso When Using Partial vs Amelioration set Mean SD Min P5 P25 Median P75 P95 Max N Overall Number of Variables Selected Using Partial 19 14 1 2 9 17 24 45 68 604 Using Aset 20 15 1 2 8 18 26 48 78 604 Difference Overall Difference -1 4 -46 -7 -2 0 0 3 9 604 Conditional on not being zero -2 4 -46 -8 -4 -1 1 5 9 379 Notes. Some observations were dropped due to the difference in using partial or aset options to ensure comparability. Table A4: Standard errors are almost identical par- tialling out variables compared to including them in the amelioration set No. observations 200 600 No. of strata 10 100 10 100 Panel A. Strata not correlated with y Amelioration set 0.135 0.135 0.077 0.077 Partial 0.136 0.136 0.077 0.078 Panel B. Strata correlated with y Amelioration set 0.135 0.135 0.078 0.078 Partial 0.135 0.135 0.078 0.078 Notes. Data is generated using equation 10, where iid iid xi,j ∼ N (0p , Ip ), and ϵi ∼ N (0, 1). p = 20. βj = 0.05 for 2 ≤ j ≤ 5, βj = 0 for i > 5. n ∈ (200, 600). In Panel A strata fixed effects explain no variation in yi . In Panel B strata fixed effects are quantiles of xi,1 , so explain some variation in yi . 40 Table A5: List of Replicated Papers Clustered No. of No. of No. controls No. of Authors Title Journal Year trial Outcomes Treatments inputted Estimates Wheeler et al LinkedIn (to) Job Opportunities: Experimental Evidence from Job AEJ: 2022 Yes 3 1 16 3 Readiness Training Applied Lopez et al Does Patient Demand Contribute to the Overuse of Prescription AEJ: 2020 Yes 11 2 331 42 Drugs Applied Abel et al The Value of Reference Letters: Experimental Evidence from South AEJ: 2020 Yes 3 1 28 9 Africa Applied epon Bertrand & Cr´ Teaching Labor Laws: Evidence from a Randomized Control Trial in AEJ: 2021 No 12 1 115 12 South Africa Applied Hussam et al (a) The Psychosocial Value of Employment: Evidence from a Refugee AER 2022 Yes 14 2 6 35 Camp Hussam et al (b) Targeting High Ability Entrepreneurs Using Community Information: AER 2022 Yes 4 1 91 8 Mechanism Design in the Field Armand et al Does Information Break the Political Resource Curse? Experimental AER 2020 Yes 19 2 171 38 Evidence from Mozambique Andrabi et al Upping the Ante: The Equilibrium Effects of Unconditional Grants AER 2020 Yes 13 1 284 48 to Private Schools 41 Alsan et al Does Diversity Matter for Health? Experimental Evidence from Oak- AER 2019 Yes 21 3 12 63 land Carneiro et al The Impacts of a Multifaceted Prenatal Intervention on Human Cap- AER 2021 Yes 27 1 3340 54 ital Accumulation in Early Life Garbiras-D´ıaz & All Eyes on Them: A Field Experiment on Citizen Oversight and AER 2022 Yes 9 3 39 96 Montenegro Electoral Integrity Dhar et al Reshaping Adolescents’ Gender Attitudes: Evidence from a School- AER 2022 Yes 3 1 61 10 Based Experiment in India Augsburg et al When nature calls back: Sustaining behavioral change in rural Pak- JDE 2022 Yes 2 1 183 6 istan Heß et al Environmental effects of development programs: Experimental evi- JDE 2021 Yes 3 1 648 6 dence from West African dryland forests Islam et al The Effects of Chess Instruction on Academic and Non-cognitive Out- JDE 2021 Yes 30 1 14 30 comes: Field Experimental Evidence from a Developing Country Magnan et al Information, technology, and market rewards: Incentivizing aflatoxin JDE 2021 Yes 24 3 80 84 control in Ghana McIntosh & Zeitlin Using household grants to benchmark the cost effectiveness of a US- JDE 2022 Yes 18 2 332 189 AID workforce readiness program Fernando Seeking the treated: The impact of mobile extension on farmer infor- JDE 2021 No 21 2 409 47 mation exchange in India Appendix 6: An example of pdslasso syntax to use when including interactions Here treatment is the dummy variable for treatment, x is the interacting variable (e.g. female), and interac- tion is the interaction of treatment and x. Commonly used incorrect syntax: pdslasso y treatment interaction x ($controls), partial(fixedeffects laggedvariable) Correct syntax: pdslasso y treatment interaction ($controls x), partial(fixedeffects laggedvariable x) 42