Policy Research Working Paper 11059 Design of Partial Population Experiments with an Application to Spillovers in Tax Compliance Guillermo Cruces Dario Tortarolo Gonzalo Vazquez-Bare Development Economics A verified reproducibility package for this paper is Development Research Group available at http://reproducibility.worldbank.org, February 2025 click here for direct access. Policy Research Working Paper 11059 Abstract This paper develops a framework to analyze partial popula- effects, and optimal cluster assignment probabilities. All tion experiments, a generalization of the cluster experimental the results apply to cluster experiments, a particular case design where clusters are assigned to different treatment of the framework. The paper sets up a potential outcomes intensities. The framework allows for heterogeneity in clus- framework to interpret the OLS estimands as causal effects. ter sizes and outcome distributions. The paper studies the It implements the methods in a large-scale experiment to large-sample behavior of OLS estimators and cluster-ro- estimate the direct and spillover effects of a communication bust variance estimators and shows that (i) ignoring cluster campaign on property tax compliance. The analysis reveals heterogeneity may result in severely underpowered exper- an increase in tax compliance among individuals directly iments and (ii) the cluster-robust variance estimator may targeted with the mailing, as well as compliance spillovers be upward-biased when clusters are heterogeneous. The on untreated individuals in clusters with a high proportion paper derives formulas for power, minimum detectable of treated taxpayers. This paper is a product of the Development Research Group, Development Economic. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dtortarolo@worldbank.org. A verified reproducibility package for this paper is available at http:// reproducibility.worldbank.org, click here for direct access. RESEA CY LI R CH PO TRANSPARENT ANALYSIS S W R R E O KI P NG PA The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Design of Partial Population Experiments with an Application to Spillovers in Tax Compliance* Guillermo Cruces, U. of Nottingham & CONICET-CEDLAS-UNLP Dario Tortarolo, World Bank DECRG Gonzalo Vazquez-Bare, UC Santa Barbara JEL CODES: C01, C93, H71 , H26 , H21 , O23. KEYWORDS: partial population experiments, spillovers, randomized controlled trials, cluster ex- periments, two-stage designs, property tax, tax compliance. * We thank Yuehao Bai, Youssef Benzarti, Augustin Bergeron, Javier Birchenall, Matias Cattaneo, Max Farrell, Kelsey Jack, Heather Royer, Doug Steigerwald and Alisa Tazhitdinova for valuable discussions and suggestions, and seminar partici- pants at the 2021 National Tax Association conference, IFS, CEDLAS-UNLP, and the 2022 Advances with Field Experiments conference. We thank Julian Amendolaggine and Juan Luis Schiavoni for their invaluable support throughout the project. We thank Bruno Cr´ epon and Roland Rathelot for their help in obtaining their data. The views expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Develop- ment/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Corresponding author: Dario Tortarolo, E-mail: dtortarolo@worldbank.org. This project was reviewed and approved in advance by the Institutional Review Board at the University of Nottingham. The design for this experiment was preregistered in the AEA RCT Registry (RCT ID: AEARCTR-0006569). All remaining errors are our own. 1 Introduction Randomized controlled trials (RCTs) are extensively used in economics. A large fraction of these experi- ments are based on the assumption that the treatment assignment of one unit or subject does not influence the outcomes of others. The assumption of no interference, however, may be violated in many settings. In such cases, identifying and measuring spillovers between units is crucial for understanding the nature and magnitude of interactions between subjects, as well as for accurately assessing the direct impact of the treatment. While the early experimental literature considered the impact on untreated units in an ex-post manner (e.g. Miguel and Kremer, 2004), field experiments incorporating spillover effects into their design have gained traction in applied research. In settings where units are grouped into independent clusters, such as schools, villages, or firms, a common design is the partial population design. Partial population designs are a generalization of the clustered design wherein clusters assigned to different treatment intensities or saturations are compared to pure control clusters with no treated units (Moffit, 2001; Duflo and Saez, 2003; Hudgens and Halloran, 2008; Hirano and Hahn, 2010; Baird et al., 2018). The variation in treatment intensity allows researchers to disentangle the direct and indirect effects of a treatment. In this paper, we provide a framework to analyze this type of experiment when clusters are heterogeneous. We consider two dimensions of cluster heterogeneity that have important practical implications: het- erogeneity in cluster sizes and heterogeneity in outcome distributions across clusters (distributional het- erogeneity)1 . When analyzing an experiment with heterogeneous clusters, correctly accounting for this heterogeneity is crucial for several reasons. On the one hand, variance formulas have to be adjusted ac- cordingly, and failing to do so may result in severely underpowered experiments. On the other hand, cluster heterogeneity can affect the accuracy of the large sample normal approximation, and inference based on this approximation can be misleading when clusters are very heterogeneous (Carter, Schnepel and Steigerwald, 2017; Djogbenou, MacKinnon and Ørregaard Nielsen, 2019; Hansen and Lee, 2019; Sasaki and Wang, 2022; Chiang, Sasaki and Wang, 2023). With these challenges in mind, our paper provides five contributions. First, in Theorem 1, we derive an asymptotic distributional approximation for OLS regression estimators in a setting with between-cluster heterogeneity. We consider a double-array asymptotic setting where cluster sizes are allowed, but not re- quired, to grow with the sample size. We provide conditions under which OLS estimators are consistent for cluster-size-weighted averages of within-cluster differences in means, and are asymptotically normal. We also show that, in the presence of distributional heterogeneity, the usual cluster-robust variance estima- tor is generally upward-biased, and hence inference based on this estimator is conservative (Proposition 1). While similar results have been obtained in design-based settings with non-random potential outcomes (see e.g. Hudgens and Halloran, 2008; Basse and Feller, 2018; Abadie et al., 2022; Jiang, Imai and Malani, 1 We note that our framework allows for general forms of between-cluster heterogeneity, but assumes that outcomes are iden- tically distributed within each cluster. The generalization of our results to the case where outcome distributions are heteroge- neous within a cluster is left for future research. 2 2023), to our knowledge we are the first to show this result in a superpopulation setting under distributional heterogeneity. Our second contribution is to derive explicit, closed-form formulas to conduct power and minimum detectable effect (MDE) calculations under the two aforementioned sources of cluster heterogeneity. We then consider an intermediate setting where clusters differ in size but not in their outcome distributions, which simplifies power and minimum detectable effects calculations and can be applied more easily when baseline outcome data is not available. We show how our formulas generalize those available in the existing methodological literature on experimental design (Duflo, Glennerster and Kremer, 2007; Hirano and Hahn, 2010; Baird et al., 2018) by allowing for multiple treatment intensities, cluster heterogeneity, heteroskedasticity and general forms of intracluster correlation in outcomes and treatments. Our third contribution is to derive optimal assignment probabilities determining the proportion of clus- ters to be assigned to each treatment saturation (Theorem 2). We provide a tractable, closed-form solution to the optimal choice problem of minimizing a weighted average of estimators’ variances. We also discuss how alternative optimality criteria may be used in combination with our variance formulas using numerical methods. Our fourth contribution is to set up a potential outcomes framework with within-cluster spillovers, heterogeneous treatment effects, and heterogeneous clusters. We use this framework to provide sufficient conditions for OLS estimands to recover causal direct and spillover effects. Fifth, based on our framework, we designed and conducted a large-scale field experiment to estimate direct and spillover effects of a randomized communication campaign on property tax compliance in Ar- gentina. Our experiment sent personalized letters to randomly selected dwellings with reminders about taxes due, information about the status of the account, due dates, past due debt, and payment methods. While there is ample evidence on the effect of tax reminders on compliance and collection (Antinyan and Asatryan, 2024), our goal was to find evidence on relatively elusive spillover effects from information campaigns on tax collection. We designed the experiment based on our methodological results to capture spillover effects of our mailings on neighbors who live in the same street blocks of treated individuals but who did not receive a letter. Our results reveal higher payment rates for treated individuals, but also for their untreated neighbors in the same street block, compared to accounts in pure control blocks where no one received the letter. Spillover effects are lower in magnitude but still substantial and precisely estimated in high-saturation street blocks, especially when accounting for expected (pre-registered) heterogeneity in past compliance: payment rates of untreated accounts in high saturation blocks with above median past compliance increased by 2.6 percentage points, compared to direct effects of about 5.1 percentage points. Comparison with current literature. Our paper contributes to a growing literature on experimental design (Duflo, Glennerster and Kremer, 2007; Bruhn and McKenzie, 2009; Bugni, Canay and Shaikh, 2018, 2019; Bai, 2022) and in particular to the literature on design and analysis of experiments under 3 spillovers or interference (Hirano and Hahn, 2010; Athey, Eckles and Imbens, 2018; Baird et al., 2018; Basse, Feller and Toulis, 2019; Jiang, Imai and Malani, 2023; Puelz et al., 2022; Viviano, 2024; Leung, 2022; Liu, 2023). More specifically, our results generalize those of Hirano and Hahn (2010), Hudgens and Halloran (2008) and Baird et al. (2018) by allowing for cluster heterogeneity, heteroskedasticity, general treatment assignment mechanisms and within-group correlation structures and alternative criteria for optimal treatment assignment. In related work, Athey, Eckles and Imbens (2018), Basse, Feller and Toulis (2019) and Puelz et al. (2022) derive randomization inference tests for a general class of null hypotheses under interference. A closely related study is Jiang, Imai and Malani (2023), who analyze two-stage completely randomized experiments and provide randomization-based variance estimators and sample size formulas. Our re- sults complement this literature by considering different estimands, different assignment mechanisms and by conducting super-population-based large-sample (instead of design-based) inference in a double ar- ray asymptotic framework. Our approach allows us to determine the role of cluster heterogeneity in the asymptotic behavior of the treatment effect estimators. Our paper is also related to the literature on inference in clustered experiments, which are a particular case of partial population experiments with only two saturations and no within-cluster treatment variation. Bugni et al. (2023) study inference in clustered experiments with non-ignorable cluster sizes and derive variance estimators and valid inference procedures in a setup with random cluster sizes. We further discuss the relationship between our results and that paper in Section 3.5. We also contribute to a large empirical literature on property taxes and a small but growing empirical literature on spillover effects in tax compliance. On property taxes, recent contributions include Brock- meyer et al. (2020) study of Mexico City, Bergeron, Tourek and Weigel (2024) and Weigel (2020) for the Democratic Republic of Congo, and Krause (2020) for Haiti, among others. The latter two are ran- domized controlled trials, and in both cases, the authors address the presence of spillovers, but in ex-post analysis rather than in the experimental designs. The effect of social interactions in tax compliance inter- ventions has remained a relatively elusive issue in the broader experimental compliance literature. Some notable exceptions are Pomeranz (2015), who detects enforcement spillovers up the VAT chain in Chilean firms, Drago, Mengel and Traxler (2020) who study enforcement spillovers of TV licensing inspections on untreated households in Austria, and Boning et al. (2020) who analyze direct and network effects from in-person visits by revenue officers on visited and non-visited firms in the United States (see the review in Pomeranz and Vila-Belda, 2019, for more studies covering spillover effects). In Argentina, a recent study by Carrillo, Castro and Scartascini (2021) finds neighborhood spillover effects from a program that randomly awarded 400 taxpayers with the repair of a sidewalk. Whereas these papers find spillover effects in tax compliance, their original experiments were not designed to capture these effects. We build on these pioneering works with an intervention designed with the purpose of capturing spillovers. The paper is organized as follows. Section 2 illustrates the practical importance of cluster heterogeneity when conducting power calculations. In Section 3, we set up our framework and derive the main results. In 4 Section 4, we implement our methods in a large-scale randomized communication campaign, we describe the administrative data used in the analysis, the empirical strategy, and evidence of direct and spillover effects. Section 5 provides some practical recommendations for designing and analyzing partial population experiments. Section 6 concludes. 2 Why is Cluster Heterogeneity Important? We consider a population where units are grouped into mutually exclusive and independent clusters. Com- mon examples of this type of clustering are students in schools (Miguel and Kremer, 2004; Beuermann et al., 2015), family members in households (Barrera-Osorio et al., 2011; Foos and de Rooij, 2017), job seekers in local labor markets (Cr´epon et al., 2013), employees in firms or organizations (Duflo and Saez, 2003), or households in neighborhoods, villages or other geographic administrative units (Angelucci and De Giorgi, 2009; Ichino and Sch¨ undeln, 2012; Haushofer and Shapiro, 2016; Gin´ e and Mansuri, 2018). In our application, a local property tax reminder information campaign, the population of interest consists of taxpayers in residential blocks. Within this population, we study an experimental design where treatment assignments can vary both between and within clusters. Figure 1 shows the distribution of cluster sizes in six partial population experiments, including our analysis sample and five published papers (Cr´ epon et al., 2013; Gin´ e and Mansuri, 2018; Haushofer and Shapiro, 2016; Ichino and Sch¨ undeln, 2012; Imai, Jiang and Malani, 2021). The figure reveals substantial variation in cluster sizes. When cluster sizes are heterogeneous, it is likely that the distribution of outcomes will vary across clusters as well. For instance, one may expect the mean and the variance of the outcome to be different in large clusters compared to small clusters. We refer to the variation in outcome distributions across clusters as distributional heterogeneity. ˆ, such as a difference Intuitively, with heterogeneous clusters, the variance of an estimator of interest β in means between units in treated and untreated clusters (we define the estimators of interest precisely in the next section), can be decomposed into four parts: ˆ] ≈ variance under uncorrelated observations V[β (1) + clustering with equally-sized clusters (2) + cluster size heterogeneity (3) + cluster distributional heterogeneity (4) The first term is the variance that would be obtained if observations were uncorrelated within clusters. The second term is an adjustment factor that accounts for the within-cluster correlation, often known as the “design effect” or the “Moulton factor” (after Moulton, 1986) that depends on the average cluster size. The term in the third line represents the additional variation due to the heterogeneity in cluster sizes, 5 which intuitively accounts for the variance of cluster sizes (Moulton, 1986, also derives this adjustment for a random effects model). Finally, the last component accounts for the between-cluster heterogeneity in outcome distributions. While the need to account for within-cluster correlations (lines (1) and (2)) is well-understood for designing and analyzing clustered experiments, the adjustment terms that account for cluster heterogeneity are typically assumed away by the literature on experimental design (e.g. Bloom, 2005; Duflo, Glennerster and Kremer, 2007; Hirano and Hahn, 2010; Baird et al., 2018). To numerically illustrate the importance of appropriately accounting for cluster heterogeneity in this design, we consider the simple setting of a cluster RCT (which is a particular case of a partial population experiment) where “a few” clusters are “large”. Specifically, we consider a sample of 200 clusters, indexed by g = 1, . . . , 200, each having size ng . The first 10 clusters contain 100 units, ng = 100, and the remaining 190 clusters contain 25 units each, ng = 25 (these values are chosen to match the median values in the literature in Figure 1). We assume the treatment has no effect, and the outcome of unit iid i = 1, . . . , ng in cluster g is given by a random effects model: Yig = αg + νg + ωig , νg ∼ N (0, 1/2), iid ωig ∼ N (0, 1/2) with νg independent of ωig and where αg is a (non-random) intercept with αg = 0 if ng = 25 and αg = 1 if ng = 100. This model implies that the average outcome is E[Yig ] = 1 in large clusters and E[Yig ] = 0 in small clusters. In addition, V[Yig ] = 1 and the within-cluster correlation between outcomes is cor(Yig , Yjg ) = 0.5. Figure 2 plots three power functions for the difference in means between treated and untreated clusters that a researcher may consider when designing this experiment. The short-dashed curve represents the power function that is obtained when ignoring both sources of heterogeneity, that is, considering only the terms in lines (1) and (2) of the variance formula. Using this formula, the MDE at 80% power, given this sample size, is 0.29 standard deviations. However, when accounting for the variation in cluster sizes, the corresponding power function is represented by the long-dashed curve. According to this curve, the power to detect an effect of 0.29 is not 80% but 69%, so the experiment is underpowered. Furthermore, the true power function that accounts for both sources of heterogeneity (sizes and outcome distributions) is represented by the solid curve. This curve shows that the true power to detect an effect of 0.29 in this setting with heterogeneous clusters is 48%, significantly below the desired power of 80%. This numerical exercise shows how ignoring heterogeneity may result in severely underpowered experiments. We provide further examples of the importance of accounting for heterogeneity in Section 4. 3 Analysis of Partial Population Experiments 3.1 Setup We consider a sample of observations (units) that are divided into mutually independent clusters g = 1, . . . , G, where each cluster g contains ng observations i = 1, . . . , ng and the total sample size is n = 6 G ng . We view cluster sizes as non-random (see Bugni et al., 2023; Sasaki and Wang, 2022, for g =1 an alternative sampling approach where cluster sizes are random). In a partial population experiment, clusters are randomly divided into categories or saturations denoted by Tg ∈ {0, 1, 2, . . . , M }, where by convention Tg = 0 denotes a pure control cluster (i.e. a cluster where no unit is treated). Let P[Tg = t] = qt ∈ (0, 1) denote the probability that cluster g is assigned to saturation t. Within each cluster, a binary treatment Dig is assigned to units with probability P[Dig = 1|Tg = t] where P[Dig = 0|Tg = 0] = 1.2 We let Dg = (D1g , D2g , . . . , Dng g ) be the vector of unit-level treatment assignments in cluster g , D = (D1 , . . . , DG ) and T = (T1 , . . . , TG ) . Figure A.3 provides an example of a partial population design with four saturations. Notice that both standard RCTs with independent observations and cluster RCTs are particular cases of partial population experiments, as we further illustrate in Section 3.5. The observed outcome of interest for unit i in cluster g is denoted by Yig and we let Yg = (Y1g , . . . , Yng g ) be the vector of observed outcomes in cluster g . In partial population experiments, the estimands of in- terest are typically comparisons of average outcomes between treated or untreated units in treated clusters to pure control units, E[Yig |Dig = d, Tg = t] − E[Yig |Tg = 0], pooled across clusters. In the first part of the paper, we take these estimands as given since they are the most commonly analyzed estimands in the empirical literature. In Section 3.6, we set up a potential outcomes framework to rigorously justify the causal interpretation of these estimands. Let µg (d, t) = E[Yig |Dig = d, Tg = t] be the conditional expectation of the outcome in cluster g given assignment (d, t). We consider the following sample means estimators: G 1 g =1 (Tg = t) ng 1 i=1 Yig (Dig = d) g 1t d¯d g Ng Yg ˆ(d, t) = µ = (5) 1 1 g 1g Ng G ng t d g =1 (Tg = t) i=1 (Dig = d) where 1t g = 1(Tg = t), Ng = d i 1(Dig = d) and Yg = ¯d i Yig 1(Dig = d)/Ng , defined whenever d d Ng > 0. These estimators are commonly computed by running an OLS regression of the outcome on a full set of indicators (1(Dig = d, Tg = t))(d,t) , without an intercept. Thus, in what follows, we refer to these estimators as OLS estimators. Our parameter of interest is the vector of cluster-size-weighted average of cluster-specific differences in means: G ng βn (d, t) = (µg (d, t) − µg (0, 0)) . (6) g =1 n We note that our framework can easily accommodate other parameters with different weighting schemes, such as the simple average across clusters G g =1 (µg (d, t) − µg (0, 0)) /G. 2 In practice, some desired saturations may not coincide with the observed proportion of treated units for some cluster sizes. For instance, if P[Dig = 1|Tg = t]=0.5 but ng is odd, the observed proportion of treated cannot be exactly 0.5. Appendix A.4 proposes an assignment mechanism that ensures that the expected proportion of treated coincides with P[Dig = 1|Tg = t]. 7 3.2 Asymptotic Behavior of OLS Estimators We now study the asymptotic distribution of the OLS estimators defined in Equation (5) and functions thereof. We consider a double-array asymptotic setting where the cluster sizes are allowed, but not re- quired, to grow with the sample size. This type of approximation is more appropriate than the bounded cluster size approach when groups can be large and heterogeneous in size, but we note that the settings with bounded cluster sizes and/or equally-sized clusters are nested as particular cases of our analysis.3 We consider the following sampling scheme. Assumption 1 (Sampling) (i) (Yg , Dg , Tg )G g =1 are mutually independent across g . (ii) For each g and for all i = 1, . . . , ng , E[Yig |Dig = d, Tg = t] = µg (d, t) for all (d, t) and for all such that E[|Yig | |Dig = d, Tg = t] < ∞. (iii) For each g and for all i = 1, . . . , ng , P[Dig = d|Tg = t] = pg (d|t) and P[Dig = d, Djg = d |Tg = t] = pg (d, d |t) for all d, d and t. Part (i) states that clusters are mutually independent, a standard assumption in the clustering literature. Notice that we do not require clusters to be identically distributed, so outcome distributions can be het- erogeneous across clusters. Part (ii) states that average conditional outcomes are the same for all units in the same cluster. In what follows we define µg (d, t) = µg (d, t) for = 1 to reduce notation. Part (iii) states that the unit-level treatment probabilities are the same within a cluster. Note that within-cluster assignments may be correlated. Next, let D(i)g = (Djg )j =i denote the vector of treatments excluding unit i and D(ij )g = (Dkg )k=(i,j ) denote the vector of treatments excluding units i and j . We introduce the following restriction on the conditional moments of the outcome. Assumption 2 (Exchangeability) For all i, j and g , (i) E[Yig |Dig = d, Tg = t, D(i)g ] = E[Yig |Dig = d, Tg = t] (ii) E[Yig Yjg |Dig = d, Djg = d , Tg = t, D(ij )g ] = E[Yig Yjg |Dig = d, Djg = d , Tg = t]. This assumption is a high-level condition stating that, conditional on own treatment assignment and the cluster-level assignment Tg , the first and second moments of Yig do not vary with the peers’ treatment indicators.4 As shown in Theorem 1 below, these conditions guarantee that the OLS estimator is consistent 3 The number of parameters remains fixed in our setup. See Vazquez-Bare (2023) for an alternative approach in which the number of parameters is allowed to grow with the sample size. 4 We refer to unit i’s “peers” as all the units other than i in the same cluster. 8 for a weighted average of cluster-specific conditional means and that the outcome variance only depends on (d, t), the variation in treatment assignment that is controlled by the experimental design. Assumption 2 can be interpreted as a requirement that the assignment (Dig , Tg ) contains all the relevant variation in the outcome moments, so that the spillovers model is “correctly specified”. To further justify this assumption, in Section 3.6 we show that this condition is guaranteed when peers are assumed to be exchangeable, so that potential outcomes only depend on the proportion of treated peers and not on their identities. This exchangeability assumption is very common in the spillovers literature. This requirement may be violated, for example, in networks where units have different degrees of network centrality, and thus both the proportion of and the identities of the treated units matter. Finally, we restrict cluster heterogeneity in the following way. Assumption 3 (Cluster heterogeneity and bounded moments) 2/r (i) For some 2 ≤ r < ∞, as n → ∞, maxg n2 g /n → 0 and g nr g /n ≤ C < ∞. ˜ < ∞. (ii) For some > r, supi,g,d,t E[|Yig | |Dig = d, Tg = t] ≤ C Condition (i) is taken from Hansen and Lee (2019). The first part ensures that the largest cluster is small relative to the total sample size, so no cluster dominates the sample. The second part of condition (i) is a regularity condition that rules out unbounded r-th moments in the distribution of cluster sizes. As an example, setting r = 4 restricts the fourth moment of the cluster size distribution, which rules out heavy tails.5 Condition (ii) is a standard regularity condition that ensures that the -th conditional moment of the outcome is bounded. In what follows, we use “→P ” to denote convergence in probability “plimn→∞ ” to denote probability limits, “→D ” to denote convergence in distribution and · to denote the Euclidean norm. We define any generic (2M + 1)-dimensional vector v as: v = (v (d, t))(d,t) = (v (0, 0), v (0, 1), . . . , v (0, M ), v (1, 1), . . . , v (1, M )) ˆ n = (ˆ Consider the vector of estimators µ µ(d, t))(d,t) from (5) and define the vector: g ng pg (d|t)µg (d, t) µp p n = (µn (d, t))(d,t) , µp n (d, t) = . g ng pg (d|t) Define the (2M + 1) × (2M + 1) covariance matrix Ωn with elements: 1 g 1g Ng Ng Cov Yg , Yg |Tg , Dg E 1t t d d ¯d ¯d Ωn ((d, t), (d , t )) = n g pn (d |t ) ¯n (d|t)¯ qt qt p 1 g Ng , 1g Ng ) (µg (d, t) − µn (d, t))(µg (d , t ) − µn (d , t ))Cov(1t d t d + n g pn (d |t ) ¯n (d|t)¯ qt q t p 5 Notice that condition (i) holds automatically when group sizes are seen as fixed or bounded in the asymptotic analysis. 9 where p¯n (d|t) := g ng pg (d|t)/n. In what follows we use Ωn (d, t) to refer to the diagonal elements of Ωn . We introduce the following technical conditions to guarantee invertibility of the covariance matrix and to ensure the denominators of the estimators are bounded below. Assumption 4 (Invertibility conditions) (i) The minimum eigenvalue of Ωn is bounded away from 0. ¯n (d|t) := (ii) For any (d, t) such that pg (d|t) > 0 for some g , p g ng pg (d|t)/n ≥ c > 0. The following theorem characterizes the asymptotic distribution and variance of the OLS estimators in (5). Let I2M +1 is a (2M + 1)-dimensional identity matrix. −1/2 √ ˆ n − µp Theorem 1 If Assumptions 1 to 4 hold, µ n →P 0 and Ωn ˆ n − µp n (µ n ) →D N (0, I2M +1 ). All the proofs can be found in Appendix D. Because the estimator µ ˆ can be obtained through a saturated OLS regression including one regressor per distinct treatment assignment, Theorem 1 can be thought of as generalizing the results in Hansen and Lee (2019) to a specific type of nonparametric regression where coefficients are heterogeneous across clusters. 3.3 Estimation and Inference for Differences in Means Theorem 1 has two main implications. First, each µ ˆ(d, t) estimates a weighted average of cluster-specific means µg (d, t), where the weights depend on the cluster size ng and the within-cluster probability of a treatment pg (d|t). Second, the distribution of µ ˆ n can be approximated as µ ˆ n ∼ N (µp n , Ωn /n) where the variance matrix Ωn allows for heterogeneity in cluster sizes and outcomes distributions, heteroskedasticity, different treatment assignment probabilities across clusters and intracluster correlation in both outcomes and unit-level treatment assignments. This result can be applied to obtain an asymptotic distributional approximation and variance formulas for functions of µ ˆ n , such as subvectors, linear combinations (like the pooled and slope effects proposed by Baird et al., 2018) or nonlinear functions thereof, applying the delta method when needed. Theorem 1 implies that the difference-in-means estimators β ˆ(d, t) = µ ˆ(d, t) − µ ˆ(0, 0) consistently estimate: ng pg (d|t)µg (d, t) ng µg (0, 0) βnp (d, t) = g − g g ng pg (d|t) n which is different from our parameter of interest (6) because treatment probabilities may differ across p clusters. When the treatment probabilities are equal across clusters, pg (d|t) = p(d|t) for all g , βn (d, t) = βn (d, t) so the parameter of interest can be consistently estimated by OLS. Thus, in settings with hetero- geneous clusters, the experimenter may prefer designs in which the within-cluster treatment probabilities 10 do not vary across clusters with the same assignment Tg = t, or to reweight the estimators by the inverse of pg (d|t). When conducting inference and hypothesis testing, the variance of the estimators of interest is com- monly estimated using a cluster-robust variance estimator. In this setting, and ignoring finite-sample degrees-of-freedom adjustments, the cluster-robust variance estimator of Ωn is: Ç å−1 Ç å−1 ˆ cr = n Ω 1g 1g ˆ )(Yg − 1g µ 1g (Yg − 1g µ ˆ ) 1g 1g 1g (7) g g g where 1g = (11g , . . . , 1ng g ) is an ng × (2M + 1) matrix and 1ig = (1(Dig = d, Tg = t))(d,t) is an ng -dimensional column vector. Based on this matrix estimator, the cluster-robust variance estimator for the difference in means β ˆ(d, t) is V ˆ cr (0, 0) using that Ω ˆ cr (d, t) + Ω ˆcr (d, t) = Ω ˆ cr ((d, t), (d , t )) = 0 for t = t . The following result shows that, in a setting with distributional heterogeneity, the cluster-robust variance estimator for the difference in means can be conservative. Proposition 1 Let Vn (d, t) = Ωn (d, t) + Ωn (0, 0) − 2Ωn ((d, t), (0, 0)) denote the true asymptotic variance Ä ä ˆ(d, t). Under Assumptions 1 to 4, plim V of β ˆcr (d, t)/Vn (d, t) ≥ 1. n→∞ The reason why the cluster-robust variance estimator can be conservative is that the true asymptotic variance can be approximated as: ng pg (d|t) 2 pg (d, d|t) ß ™ 1 Vn (d, t) ≈ σ (d, t) 1 + ρg (d, d, t) (ng − 1) qt g np¯n (d|t)2 g pg (d|t) 1 ng 2 + σ (0, 0) {1 + ρg (0, 0, 0)(ng − 1)} q0 g n g ng pg (d|t) pg (d, d|t) ß ™ 1 p 2 + (µg (d, t) − µn (d, t)) 1 + (ng − 1) (8) qt g np¯n (d|t)2 pg (d|t) 1 n2g + (µg (0, 0) − µp n (0, 0)) 2 q0 g n ò2 n2 g pg (d|t) ï p p − (µg (d, t) − µn (d, t)) − (µg (0, 0) − µn (0, 0)) . g n p ¯n (d|t) The first two lines in Equation (8) represent the average within-cluster variation in outcomes for the units in treated and pure control clusters, respectively. The third and fourth lines represent the between-cluster variation in average outcomes for treated and control clusters, respectively. Finally, the fifth line can be interpreted as a the between-cluster variance of the difference in means, weighted by the relative probabilities of treatment in each cluster. Note that when pg (d|t) = p ¯n (d|t), this last term becomes 2 2 g ng (βg (d, t) − βn (d, t)) /n. The last three lines in this formula equal zero when outcome distributions are homogeneous between clusters, as we discuss below. The last term in Equation (8) is not estimable because it depends on the within-cluster difference in 11 means between assignments, which is never observed. The cluster-robust variance estimator is, asymptot- ically: ò2 n2 g pg (d|t) ï ˆcr (d, t) ≈ Vn (d, t) + V p p (µg (d, t) − µn (d, t)) − (µg (0, 0) − µn (0, 0)) (9) g n p ¯n (d|t) ˆcr (d, t) can be asymptotically upward-biased, and thus which is Equation (8) without the last term. Thus, V inference based on this variance estimator can be conservative. Similar results have also been obtained in design-based causal inference settings with non-random potential outcomes, see for example Hudgens and Halloran (2008); Basse and Feller (2018); Abadie et al. (2022) and Jiang, Imai and Malani (2023). Proposition 1 shows that an analogous result holds in a superpopulation setting when clusters exhibit dis- tributional heterogeneity. In particular, when outcome distributions are homogeneous across clusters, this additional term disappears and inference based on the cluster-robust variance estimator is asymptotically exact, as we discuss further in Section 3.5. 3.4 Power Calculations and Optimal Design p By Theorem 1, the power of a two-sided hypothesis test of βn (d, t) = 0, can be approximated by: Å√ p ã Å√ p ã p nβn (d, t) nβn (d, t) Γ(βn (d, t)) ≈1−Φ √ + z1−α/2 + Φ √ − z1−α/2 (10) V V for some appropriately chosen asymptotic variance V , where z1−α/2 is the (1 − α/2)-quantile from the standard normal distribution. To use the true variance formula, the researcher may replace V by Equation (8). This variance depends on the within-cluster variances, intra-cluster correlation and the between- cluster variation in outcomes, which can be imputed using baseline data, the cluster size distribution, which is observable, and the cluster- and unit-level assignment probabilities which are chosen by the researcher. One issue with this choice of variance formula is that, as shown in Proposition 1, the variance estimator that is actually used when conducting inference may be upward biased, which may result in an underpowered study. To avoid this issue, the researcher may instead conduct power calculations using the variance formula in Equation (9). The number of saturations M and the within-cluster treatment probabilities pg (d|t) and pg (d, d|t) play a crucial role in identification, as they determine the type of comparisons that can be made between treated and control units. The choice of these parameters can be guided by previous knowledge or assumptions on how the conditional average outcome varies as a function of the treatment saturation. For instance, if this function is assumed to be linear or close to linear, two saturations would be enough to identify the shape of this function, whereas if the function can be approximated by a quadratic function, one would need three saturations, and so on. In turn, the choice of within-cluster treatment probabilities pg (d|t) depends on the slope of the conditional average outcome as a function of the treatment saturation. For instance, 12 with three saturations M = {0, 1, 2}, if average outcomes are expected to jump around some value p ¯ but ¯ and pg (1|2) > p ¯, the researcher can choose pg (1|1) < p to be relatively flat below or above p ¯ to increase the chance of detecting these changes. Without knowledge of how this function is expected to change, the researcher may spread these probabilities approximately uniformly, choosing some “low”, “intermediate” and “high” treatment probabilities. While we do not provide formal guidance on choosing M and the within-cluster treatment probabilities, our results in Theorem 1 and power function (10) can be used to compare the power and MDEs of competing designs. We now propose a method to optimally choose the cluster-level assignment probabilities {qt }M t=0 . M Given M and the within-group treatment probabilities, optimally choosing {qt }t=0 requires defining an optimality criterion that determines how the variances of all the estimators of interest are aggregated. The literature on optimal design of experiments has proposed several criteria (see e.g. Silvey, 1980; Melas, 2006; Berger and Wong, 2009). We consider A-optimality, which minimizes the trace of the variance- covariance matrix of the difference in means estimators (β ˆ(d, t))(d,t>0) (or equivalently, the average of the asymptotic variances).6 The justification of this criterion is that the trace of the variance-covariance matrix can be seen as a measure of the size of the confidence ellipsoid (i.e. the multidimensional confi- dence interval) for the vector of parameters of interest. One advantage of A-optimality is its tractability, as the optimal choice has a simple closed-form solution in this setting. In the theorem below, we consider a generalized version of A-optimality that allows the researcher to assign different weights to different variances. Theorem 2 Let ω = (ωdt )(d,t>0) be a known vector of weights with ωdt ≥ 0, ω1t + ω0t > 0, t>0 (ω0t + ω1t ) = 1. Consider the optimal design problem: M ¶ © min ˆ(1, t)] ˆ(0, t)] + ω1t V[β ω0t V[β q0 ,q ,...,q 1 M t=1 with qt > 0, M t=0 qt = 1 using the variance formula in Equation (8) or (9). The optimal assignment probabilities are given by: √ ∗ B0 ∗ Bt (ω ) q0 (ω ) =√ , qt (ω ) = √ √ , t > 0, B0 + Bt (ω ) B0 + Bt (ω ) t>0 t>0 where 2 B0 = ng σg (0, 0) {1 + ρg (0, 0, 0)(ng − 1)} + ng (µg (0, 0) − µn (0, 0))2 g 6 Notice that this criterion is different from the one in Baird et al. (2018), who minimize the average standard error. We propose this alternative method as it is in line with the theoretical literature on optimal design, while also allowing for a simple, closed-form solution to the optimal design problem. 13 and for t > 0, ng pg (1|t) 2 pg (1, 1|t) ï ß ™ Bt (ω ) = ω1t σg (1, t) 1 + ρg (1, t) (ng − 1) g p ¯n (1|t)2 pg (1|t) pg (1, 1|t) ß ™ò 2 + (µg (1, t) − µn (1, t)) 1 + (ng − 1) pg (1|t) ng pg (0|t) 2 pg (0, 0|t) ï ß ™ +ω0t σg (0, t) 1 + ρg (0, t) (ng − 1) g p ¯n (0|t)2 pg (0|t) pg (0, 0|t) ß ™ò 2 + (µg (0, t) − µn (0, t)) 1 + (ng − 1) . pg (0|t) Theorem 2 provides the formula for the optimal cluster assignment probabilities that minimize a weighted average of estimators variances. By choosing the vector (ωdt )(d,t>0) , the researcher can assign lower (or zero) weights to some parameters that are not of interest, and larger weights to parameters that are deemed more important. For instance, to focus on comparisons between untreated units in treated clusters and pure controls, the researcher can set ω1t = 0 for all t. While A-optimality has the advantage of a simple closed form solution, there are other optimality criteria that may be desirable in different settings. Optimization problems based on these alternative criteria do not have closed form solutions in general, but can be solved numerically using our variance formulas. See Silvey (1980), Melas (2006) and Berger and Wong (2009) for further details and discussions. It should be noted that researchers may often need to incorporate different sets of constraints (such as logistical, budgetary, political or administrative constraints) when choosing assignment probabilities. These restrictions can be incorporated when choosing qt , either directly into the optimization problem in Theorem 2 or on a case-specific basis. For example, in the experiment we describe in the next section, the total number of treated units was set by the government agency. We set up a system of equations incor- porating this restriction to control the variance of the smallest treatment cells (i.e. the noisiest estimators). See Section 4.3 for details. Finally, we note that our optimality criterion does not incorporate baseline covariates. A strand of the literature on experiments has considered alternative designs that include observed covariates in the treatment assignment mechanism as a way to improve balance and increase precision. A common design in these settings is the matched pairs design (Imai, King and Nall, 2009; Bai, 2022; Liu, 2023), where each cluster is paired with another one with similar covariates and then treatment is randomized within each pair. Our results provide a different and complementary approach that does not require covariates, and instead allows the researcher to assign different weights to the different variances of the estimators of interest. 14 3.5 Power Calculations under Distributional Homogeneity The formulas for Ωn from Theorem 1 and Equations (8) and (9) can be difficult to implement when the researcher does not have access to baseline outcome data. We now introduce an additional assumption that simplifies the variance formulas and makes them easier to implement in the absence of this information. Specifically, the following assumption rules out between-cluster heterogeneity in conditional outcome moments. Assumption 5 (Between-Cluster Moment Homogeneity) E[Yig |Dig = d, Tg = t] = µ (d, t) and E[Yig Yjg |Dig = ˜ (d, d , t) for all g , (d, d , t) and for any for which the moments exist. d, Djg = d , Tg = t] = c As before, we write µ1 (d, t) = µ(d, t) to reduce notation. Under this additional assumption we obtain the following result. Corollary 1 Suppose Assumptions 1 to 5 hold. Then µp n (d, t) = µ(d, t) for all (d, t), Theorem 1 holds and the variance Ωn takes the following form: ® ´ nσ 2 (d, t) g ng (ng − 1)pg (d, d|t) Ωn (d, t) = 1 + ρ(d, t) , t > 0, qt g ng pg (d|t) g ng pg (d|t) ® Ç 2 å´ σ 2 (0, 0) g ng Ωn (0, 0) = 1 + ρ(0, 0) −1 , q0 n ng (ng − 1)pg (0, 1|t) Ωn ((0, t), (1, t)) = nσ (0, t)σ (1, t)ρ(0, 1, t) g , t > 0, g ng pg (0|t) g ng pg (1|t) Ωn ((d, t), (d , t )) = 0, t=t and where σ 2 (d, t) = V[Yig |Dig = d, Tg = t], ρ(d, t) = cor(Yig , Yig |Dig = d, Djg = d, Tg = t), pg (d, d |t) = P[Dig = d, Djg = 1|Tg = t], and ρ(0, 1, t) = Cov(Yig , Yig |Dig = 0, Djg = 1, Tg = t). In Ä ä addition, plim V ˆcr (d, t)/Vn (d, t) = 1. n→∞ Corollary 1 has three main implications. First, under between-cluster homogeneity, the difference-in- means estimators are consistent for the population differences in means β (d, t) = µ(d, t)−µ(0, 0). Second, it shows that under this additional assumption, the cluster-robust variance estimator is consistent and thus inference based on this estimator is asymptotically exact. Third, it provides a simplified variance formula that allows for heterogeneity in cluster sizes and within-cluster probabilities, conditional heteroskedasticity and intracluster correlation in outcomes and treatments, but does not depend on cluster-specific average outcomes like the variance formula in Theorem 1. This simplified formula can be readily used to conduct 15 power and MDE calculations for the parameters of interest. Specifically, ® ´ 2 ˆ(d, t)] ≈ σ ( d, t) g n g (n g − 1) p g (d, d|t) V[ β 1 + ρ(d, t) qt g ng pg (d|t) g ng pg (d|t) ® Ç 2 å´ (11) σ 2 (0, 0) g ng + 1 + ρ(0, 0) −1 nq0 n which only depends on the variance and conditional intracluster correlation in outcomes (as in any standard power calculation), the assignment probabilities, which are chosen by the experimenter, and the sample distribution of cluster sizes, which is observable. This variance can be fed into the power formula (10) to calculate power and MDEs. We discuss practical implementation issues in more detail in Sections 4 and 5. As a word of caution, we note that, just like ignoring cluster size heterogeneity, incorrectly imposing Assumption 5 when conducting power calculations can result in variances and MDEs that are too small because they ignore between-cluster variability in outcomes. While this assumption may be strong in some settings, most of the formulas for experimental design available in the literature rely on it. To illustrate this point, the following examples show how our general formulas simplify to the ones proposed in the literature under further assumptions. Example 1 (Standard RCT with a binary treatment) Suppose that each cluster has one unit (ng = 1), and there are two saturations so that each (single-unit) cluster is assigned to treatment or control with probability q and 1 − q respectively. In this case, qt = q , q0 = 1 − q , g ng pg (1|1) = n and under Assumptions 1 to 5, V[β ˆ(1, 1)] ≈ σ 2 (1, 1)/nq + σ 2 (0, 0)/n(1 − q ). In addition, under the homoskedas- ˆ(1, 1)] ≈ σ 2 /nq (1 − q ) which is Equation (6) in Duflo, ticity assumption σ 2 (1, 1) = σ 2 (0, 0) = σ 2 , V[β Glennerster and Kremer (2007). Example 2 (Cluster RCT) Suppose that clusters are assigned to two saturations Tg ∈ {0, 1} and that all units within the same cluster receive the same treatment. In this case, pg (1|1) = pg (1, 1|1) = 1 and under Assumptions 1 to 4, ¯g0 |Tg = 0] ¯g1 |Tg = 1] V[Y ã2 ® ´ n2 V[Y µg (1) − µp µg (0) − µp Å ˆ(1, 1)] ≈ g n (1) n (0) V[β + + q 0 q1 + g n2 q1 q0 q1 q0 where µg (1) = µg (1, 1), µg (0) = µg (0, 0) and similarly for the remaining terms. This formula is analo- gous to the one derived by Bugni et al. (2023) for what they call the size-weighted cluster-level average treatment effect, up to a term in their formula that accounts for the stratification procedure.7 Furthermore, if Assumption 5 holds, ® Ç å´ ® Ç å´ 2 ˆ(1, 1)] ≈ σ (1, 1) gn2 g σ 2 (0, 0) gn2 g V[β 1 + ρ(1, 1) −1 + 1 + ρ(0, 0) −1 . nq n n(1 − q ) n 7 See also Liu (2023) for an analysis of stratification in two-stage experiments. 16 Finally, suppose clusters are equally-sized, ng = n ¯ , and assume a random effects structure so that σ (1, 1) = σ (0, 0) = σ + τ and ρ(1, 1) = ρ(0, 0) = τ 2 /(σ 2 + τ 2 ). In this case, V[β 2 2 2 2 ˆ(1, 1)] ≈ nτ 2 + σ 2 )/[q (1 − q )Gn (¯ ¯ ] which is Equation (9) in Duflo, Glennerster and Kremer (2007). Example 3 (Homoskedastic case with two treatment saturations) Suppose there are only two satura- tions, so that M ∈ {0, 1}, as in Duflo and Saez (2003). Let q = P[Tg = 1] and p = P[Dig = 1|Tg = 1]. Assume that σ 2 (d, t) = 1 and ρ(d, t) = 0 for all (d, t). In this case, for assignment (d, t) = (0, 1), under Assumptions 1 to 5, V[β ˆ(0, 1)] ≈ (1 − pq )/[(1 − p)q (1 − q )] which corresponds to the variance formula in Hirano and Hahn (2010). Example 4 (Random effects structure with equally-sized clusters) Suppose that all clusters are equally sized, ng = n¯ for all g , and consider a random effects covariance structure so that σ 2 (d, t) = σ 2 + τ 2 , ρ(d, t) = τ 2 for all (d, t). In addition, suppose that the within-cluster assignment given Tg = t sets a fixed number of treated units n ¯ pt in each cluster, which implies that P[Dig = 1, Djg = 1|Tg = npt − 1)/(¯ t] = pt (¯ n − 1). In this case, for assignment (1, t), under Assumptions 1 to 5, V[β ˆ(1, t)] ≈ ¶ Ä ä Ä ä© ¯ ρ q1t + q10 + (1 − ρ) pt1qt + q10 /n which corresponds to Equation (3) in Baird et al. (σ 2 + τ 2 ) n (2018). 3.6 A Potential Outcomes Framework In this section we introduce a potential outcomes framework to study the causal interpretation of the OLS estimands discussed in the previous sections. Let Yig (d, dg , t) denote unit i’s (random) potential outcomes where d denotes own treatment, dg ∈ {0, 1}ng −1 is a vector denoting unit i’s peers’ treatments and t denotes the cluster-level assignment. To be able to compare outcomes across clusters, our first assumption is an exclusion restriction stating that the cluster level assignment t does not directly affect potential outcomes. Assumption 6 (Exclusion restriction) Yig (d, dg , t) = Yig (d, dg ) for all (d, dg , t). While this assumption is required to identify treatment effects using variation across clusters, to our knowledge we are the first to make it explicit. This potential outcome structure allows for within-cluster spillovers, an assumption often known as stratified interference (Hudgens and Halloran, 2008). Specifi- ˜g ) cally, Yig (1, dg ) − Yig (0, dg ) is the direct effect of the treatment on unit i in cluster g , Yig (0, dg ) − Yig (0, d is the spillover effect on an untreated unit and Yig (1, dg ) − Yig (1, d ˜ g ) is the spillover effect on a treated unit. The observed outcome of interest for unit i in cluster g is denoted by Yig = d,dg Yig (d, dg )1(Dg = (d, dg )). Next, we assume that the vector of treatment assignments (Dg , Tg ) is independent of the vector of potential outcomes, which is guaranteed by random assignment of the treatment. ⊥ (Dg , Tg ). Assumption 7 (Independence) (Yig (d, dg ))(d,dg ) ⊥ 17 Finally, we assume that peers are exchangeable. Under this assumption, potential outcomes can depend flexibly on the proportion of treated peers, as long as they do not depend on the peers’ identities. This assumption reduces the dimensionality of potential outcomes and is ubiquitous when analyzing spillovers (see Vazquez-Bare, 2023, and references therein for further discussion). In what follows let 1g be an (ng − 1)-dimensional column vector of ones. Assumption 8 (Exchangeability) For all dg , Yig (d, dg ) = Yig (d, πg ) where πg = 1g dg /(ng − 1) is the proportion of unit i’s treated peers. The following result links moments of observed outcomes to average potential outcomes. Proposition 2 Under Assumptions 6 to 8, letting Sig = j =i Djg , ng −1 ï Å ãò sg E[Yig |Dig = d, Tg = t] = E Yig d, P[Sig = sg |Dig = d, Tg = t] sg =0 ng − 1 for any such that the expectations are well defined. Proposition 2 implies that the conditional mean E[Yig |Dig = d, Tg = t] in cluster g equals an average of average potential outcomes over the proportions of treated peers that are consistent with the assignment mechanism. In particular, if the treatment assignment mechanism exactly determines the proportion of treated units, so that P[Sig = sg |Dig = d, Tg = t] = 1 for some sg , each observed conditional mean point-identifies the average potential outcome. 1 n g Theorem 3 Let Ng = i=1 Dig be the total number of treated units in cluster g and define yg = (Yig (d, dg ))i,d,dg . Suppose that: (i) Assumptions 6 to 8 hold. (ii) (yg , Dg , Tg )G ˜l g =1 are mutually independent across g ; for each g and for all i, E[Yig (d, π )] = µ g (d, π ) for all (d, π ) and for all such that E[Yig (d, π )] < ∞; Assumption 1(iii) holds. ˜ < ∞. (iii) Assumption 3(i) holds and for some > r, maxi,g,d,π E[|Yig (d, π )|] ≤ C 1 1 (iv) Ng is nonrandom conditional on Tg , with P[Ng = ng pg (1|t)|Tg = t] = 1 and pg (d|t) = p(d|t) for all g . Then, Theorem 1 holds and ng p(1|t) − d ï Å ã ò ng ng βn (d, t) := (µg (d, t) − µg (0, 0)) = E Yig d, − Yig (0, 0) . g n g n ng − 1 18 Theorem 3 provides conditions on the potential outcomes and experimental design to guarantee that The- orem 1 holds. By Proposition 2 and Condition (iii) of Theorem 3, moments of observed outcomes for (d, t) can be replaced by moments of potential outcomes for (d, (ng p(1|t) − d)/(ng − 1)) so the formulas in Theorem 1 can be readily applied, as long as the variance matrix is invertible. In addition, Theorem 3 ensures that differences in means have a causal interpretation: each βn (d, t) equals a cluster-size-weighted average of average differences in potential outcomes. By Theorem 1, these parameters can be consistently estimated by OLS. When cluster sizes vary, average potential outcomes may vary across clusters, even within the set of clusters with the same assignment Tg = t and when the observed proportion of treated units is fixed. To see this, consider the following example. Suppose there are two cluster sizes, ng = 16 and ng = 20, and consider clusters with pg (1|t) = 0.5 so that half the units are assigned to treatment. In clusters with ng = 16, the total number of treated units will be 8 and thus the proportion of treated peers for each unit is 8/15 ≈ 0.535 for untreated units and 7/16 = 0.438 for treated units. On the other hand, in clusters with ng = 20 there will be 10 treated units and thus the proportion of treated peers is 10/19 ≈ 0.526 for untreated units and 9/19 ≈ 0.474 for treated units. Hence, an untreated unit in a cluster with treatment intensity pg (1|t) = 0.5 will have a proportion of 0.533 treated peers if the cluster size is 16, and a proportion of 0.526 treated peers if the cluster size is 20, so the proportions are slightly different even though the treatment assignment is the same. As a result, to be able to use the simplified formula in Corollary 1, the outcome homogeneity assumption needs to be strengthened to ensure that average potential outcomes are invariant to small perturbations in the proportion of treated units, as shown below. Theorem 4 Suppose that the conditions for Theorem 3 hold, and that: ˜ (d, π ) and E[Yig (d, π )Yjg (d , π )] = c (i) E[Yig (d, π )] = µ ˜ (d, d , π, π ) for all g and (d, d , π, π ). ˜ (d, πg (d|t)) − µ (ii) For each (d, d , t) there exists a π (d|t) such that for πg (d|t) = (ng pg (1|t)−d)/(ng −1), maxg µ ˜( 0 and ˜ (d, d , π (d|t), π (d |t)) = 0. ˜ (d, d , πg (d|t), πg (d |t)) − c maxg c Then Corollary 1 holds and β (d, t) := µ(d, t) − µ(0, 0) = E [Yig (d, π (d|t)) − Yig (0, 0)]. Condition (i) above states that, for a given (d, π ), potential outcome moments do not vary across clusters, whereas condition (ii) formalizes the requirement that potential outcome moments are invariant to pertur- bations in the proportion of treated peers generated by the variation in cluster sizes, that is, the function is locally flat. Intuitively, in the example from the previous paragraph, this condition implies for instance that E[Yig (0, 0.533)] = E[Yig (0, 0.526)]. While this second condition may be unlikely to hold exactly in practice, it can be a reasonable approximation when πg (d|t) shows little variation across g for each (d, t) (which happens for example when clusters are not very small) and/or the function µ ˜ (d, π ) is relatively flat around relevant values of π . Under these conditions, Corollary 1 can be applied to estimate direct and spillover effects by OLS and to conduct power calculations. 19 4 Estimating Spillovers in Tax Compliance 4.1 Background There is a large literature on nudges and tax compliance (Antinyan and Asatryan, 2024), but there is relatively scant evidence on the social interaction effects behind these interventions. We designed and implemented an intervention based on the framework presented in the previous sections to illustrate its potential to capture social interaction effects in tax compliance. The intervention took place in a large municipality of Argentina where dwellings are billed and re- quired to pay a municipal property tax on a monthly basis (the Tasa por Servicios Generales). The treat- ment consisted of a one-page personalized letter with information on the current billing period, past due debt, and how to pay online or in person.8 The randomized treatment assignment was conducted in two stages—first at the street block level (clusters), and then at the taxpayer account/dwelling level (units). In the first stage, we randomly divided blocks into four categories with different intensity of treatment, as depicted in Figures A.2 and A.3: (1) pure control blocks where no accounts were treated, (2) blocks with 20% of the accounts treated, (3) blocks with 50% of the accounts treated, and (4) blocks with 80% of the accounts treated. These different treat- ment intensities were designed to assess whether spillovers depend on the saturation of our information campaign at the block level (namely, low, medium, and high saturation levels).9 In the second stage, we randomly selected accounts within the latter three groups of blocks according to their treatment saturation to receive the letter. The experiment was run on residential dwellings present in the municipality in 2019. The timeline of the intervention is displayed in Figure A.4. The letters were delivered between September 28th and October 7th, 2020, corresponding to payments due on October 9th, 2020, as well as past due debt (if any). 4.2 Administrative Data We use a combination of administrative databases provided by the revenue agency of the municipality where the experiment took place. The main database is constructed from the monthly bills issued to account holders between January 2018 and December 2020. The unit of observation is an account (cuenta), which coincides with a dwelling unit. The data contain the following billing details and demographic characteristics of the account holder (titular): account number (unique ID), address, block number, name of locality (neighborhood), year and month of the bill (12 bills per year), monthly fee (in pesos), paid fee 8 Figure A.1 in the appendix provides an anonymized example of the intervention letter. Our simple design emphasized action- relevant information, in accordance with De Neve et al. (2021) who show that simplified tax letters are an effective way to increase tax compliance. 9 The choice p1 = 20%, p2 = 50% and p3 = 80% attempts to balance parsimony with the flexibility to detect nonlinearities in direct and spillover effects without having to estimate too many parameters. See Section 3.4 for further discussion. 20 (amount in pesos), due date, date of payment, days overdue, means of payment (cash or electronic), type of account (residential, retail store, factory), gender of the account holder, age of the account holder, linear front meters of the lot/property, assessed value of the property. The municipality authorities required us to target blocks with eight to fifty accounts, neither very sparse nor very dense, which was the target for their mailing campaign. Figure 1 shows the distribution of accounts per block. Table A2 shows some descriptive statistics for the year 2019. Our sample size consists of 68,808 accounts distributed in 3,982 blocks. The frequency of payments is highly polarized. About 45 percent of the accounts paid the twelve 2019 monthly bills, and about 35 percent did not pay any bill at all.10 We call these two core groups always payers and never payers, respectively. The proportion of always payers is relatively low (45 percent) and, therefore, leaves room for potential behavioral responses from non-compliant and partially-compliant neighbors, and this was compounded by the context of the pandemic, during which lockdown measures reduced payments even from highly compliant individuals. For the randomization, power calculations, and simulations, we use baseline data from the year 2019. We rely on three different pre-treatment outcomes: (i) an indicator equal to 1 if the account paid the twelve monthly bills of 2019, (ii) an indicator equal to 1 if the account paid at least one bill in 2019, and (iii) an indicator equal to 1 if the account paid six bills or more in 2019. 4.3 Experimental Design and MDEs Following the notation in Section 3, the block-level treatment indicator is denoted by Tg ∈ {0, 1, 2, 3} with distribution P[Tg = t] = qt for t = 0, 1, 2, 3 where Tg = 0 indicates the pure control blocks, Tg = 1 indicates the blocks with 20% treated, Tg = 2 indicates blocks with 50% treated, and Tg = 3 indicates blocks with 80% treated. The account-level treatment indicator is Dig ∈ {0, 1}. We use an independent within-cluster treatment assignment and constant within-cluster treatment prob- abilities. In the absence of data from a pilot experiment, we assume equal moments across assignments 2 2 σg (d, t) = σg (0, 0), µg (d, t) − µn (d, t) = µg (0, 0) − µn (0, 0) and ρg (d, t) = ρg (0, 0) for all g, d, t. We further assume that the intracluster correlation is constant across clusters. We then impute all these mag- nitudes based on our baseline data. The parameters of interest are the difference in means between treated or untreated units in each treated group and the pure control units, βn (d, t) for d = 0, 1 and t = 1, 2, 3. The municipality authorities requested that the total number of letters sent be set to L = 25, 061. To incorporate this constraint into the choice of the saturation probabilities qt , we set up a system of equations as follows. The expected number of treated units is n1 = n(0.2q1 + 0.5q2 + 0.8q3 ). Since the assignments Tg = 1 and Tg = 3 can be seen as symmetric, we set q1 = q3 . Finally, we add an equation that ensures that the variance of the effect at 50% saturation is equal to the variance for the “small” cells (treated units in 20% clusters and untreated units in 80% clusters), so that V[β ˆ(d, 2)] = V[β ˆ(0, 3)] = V[β ˆ(1, 1)]. This 10 For the full distribution, see Figure B.15. 21 gives a third equation of the form q2 = Rq3 where R is a constant obtained from our variance formulas. Our system of equations consists of four equations: (i) L = n(0.2q1 + 0.5q2 + 0.8q3 ), (ii) q1 = q3 , (iii) q2 = Rq3 and (iv). We use the results in Theorem 1 to approximate the variances and calculate the ratio R. For our MDE calculations, and to illustrate our methods, we consider three scenarios: one with “sub- stantial” heterogeneity (scenario 1), one with “moderate” heterogeneity (scenario 2), and one with “lim- ited” heterogeneity (scenario 3). The first one uses our raw data set that contains all clusters with eight households or more. This raw data contains one cluster with a very large number of units. This is the scenario where cluster heterogeneity is most substantial. The second scenario considers an intermediate case where we drop all clusters with more than 500 units. This second data set still exhibits substantial heterogeneity but eliminates one extreme outlier. Finally, the third scenario considers clusters of size be- tween eight and 50, which is the sample we use in our experiment. While scenarios 1 and 2 are not used in our empirical analysis, we use them to illustrate how ignoring cluster heterogeneity can result in severely underpowered experiments. These three scenarios are described in Table 1. In scenario 1, the average cluster size is around 21 households, but the data set contains one very large outlier with 2,754 units. This large cluster makes the standard deviation of cluster sizes very large, indeed larger than the average size, so our theoretical results indicate that the adjustment for heterogeneity will make a substantial difference when calculating power and MDEs. In scenario 2, we remove this outlier from the data and this reduces the standard deviation of cluster size, slightly below but very close to the average cluster size. Finally, in scenario 3, while cluster sizes are still heterogeneous and range from eight to 50, the standard deviation is about half the average size. Thus, we may expect the power and MDE adjustment for cluster heterogeneity to be less sizeable in this case. In each scenario, we consider the results obtained with our general formula from Theorem 1 (“Het”), the formulas that rule out between-cluster moment heterogeneity from Corollary 1 (“Homog”) and the formulas that assume homogeneous, equally sized clusters (“Equal”). We emphasize that this last case imposes an incorrect assumption in all three scenarios, as cluster sizes are not homogeneous in our sample. Table 1 shows the cluster assignment probabilities and MDEs for the binary outcomes of interest. We refer to the corresponding MDEs for the parameters βn (0, 1), βn (0, 2) and βn (0, 3) as M DE1 , M DE2 and M DE3 , respectively (the MDEs for βn (1, t) are symmetric and therefore not reported). Our calculations reveal that in scenario 1, the MDEs vary dramatically and range from 0.02 to almost 0.15 depending on the assumptions one is willing to make about cluster heterogeneity. In scenario 2, the difference in MDEs is less pronounced but still substantial: the MDEs under full heterogeneity are about twice as large as the case with equally-sized and homogeneous clusters. Reassuringly, in scenario 3, the one that we use in our experiment, the MDEs are much more robust to the different assumptions about cluster heterogeneity, although ignoring heterogeneity may still result in MDEs that are about 30% smaller than the ones that account for it. 22 For comparison, we repeat these MDE calculations using the optimal cluster assignment probabilities from Theorem 2 to assess the extent to which the constraint in the number of treated units affects the power of our experiment. The results are shown in the bottom panels of Table 1. These calculations reveal that our results are very robust: using the optimal cluster assignment probabilities instead of the constrained probabilities would give different proportions of clusters in each saturation, but similar MDEs. The final sample sizes for our experiment are shown in Table A1. We assign 1,102 cluster to pure control, 1,100 clusters to the 20% and 80% saturation and 680 clusters to the 50% saturation. Note that the 20% and 80% saturations are “oversampled” relative to the 50% saturation because they contain small cells (20% treated and 20% untreated, respectively). 4.4 Empirical Results 4.4.1 Direct and Spillover Effects on the Treated Tax Bill We begin the empirical analysis by estimating direct and spillover effects on payments of the October 2020 property tax bill. The due date was October 9th, and the letters were delivered between September 28th and October 7th. We show graphical evidence of the causal effect of the intervention in Figure 3 and summarize the point estimates in Table 2.11 Figure 3 presents the coefficients and 95% confidence intervals from a saturated regression that estimates the day by day difference in payment rates between treated and untreated units relative to accounts in pure control blocks.12 The figure focuses on high-saturation blocks with 80% treated units. The complete analysis with the three saturation groups is presented in Figure B.8 and Table 2. The top left panel of Figure 3 reveals a clear positive effect of the intervention on tax compliance of treated accounts. The payment rate of treated units started to diverge from the pure control group as soon as the intervention began. This treatment effect reached a magnitude of about 4.5 percentage points exactly by the due date of the current billing period, and it stayed relatively constant thereafter.13 The top right panel of Figure 3 shows a clear spillover effect of the intervention on untreated accounts in high-saturation blocks. It is smaller in size than the main direct effect but still substantial. The payment rates increase by about 1.1 percentage points, and the effect is statistically significant in the early days of the intervention, losing significance from the due date onward for the full sample. Table 2 presents the direct and spillover effect coefficients for the three saturation groups. Panels A, B, and C display the effects in blocks where 80%, 50%, and 20% of accounts were treated, respectively. 11 Appendix Section A.3 confirms that our groups are balanced and comparable. Appendix Section B.2 further shows that our tax communication campaign also increased the subscriptions to receive an electronic bill by e-mail and induced taxpayers to pay past-due tax bills. 12 The top panels of Figures B.6 and B.7 display the corresponding payment rates in levels, i.e., the cumulative share of individ- uals paying the October 2020 bill over time for treated, untreated, and pure control units. 13 Appendix Section B.3 shows that pure control blocks are not (indirectly) affected by adjacent treated blocks and thus provide a valid counterfactual for our analysis. 23 The omitted category comprises pure control blocks with untreated accounts only. Columns (1) and (4) show the coefficients and block-clustered standard errors for October 2020 bill payments on two different dates: October 3 (early payments) and October 31 (includes overdue payments). As a benchmark for our treatment effects, the last row reports the average payment rate in pure control blocks at each of these dates (i.e., the constant of each regression). The results in Table 2 indicate that in the early stage of the intervention high-saturation blocks with 80% treated accounts present statistically significant direct and spillover effects of about 1 percentage point. This is relatively large, considering that only 5.2% of neighbors in pure control blocks had paid their October 2020 bill by this date. Naturally, as time passes and more individuals pay their bills (reaching 34.4% in pure control blocks by the end of the month), small effects become harder to detect. Thus, while the spillover effect on untreated units remains unchanged in size, it statistical significance diminishes over time. Conversely, the direct effect on treated units rises to 4.5 percentage points, representing 13.2% of the payment rate in pure control blocks.14 4.4.2 Heterogeneous Effects The results from the full experimental sample presented in Section 4.4.1 revealed modest spillover effects in the high saturation group, primarily in the early days of the intervention. However, as outlined in our pre-analysis plan, treatment effects are likely to vary along a fundamental dimension, namely pre- treatment tax compliance behavior. In this section, we study heterogeneous effects along this dimension by dividing the sample into blocks that exhibited average past compliance (i.e., payments) in 2019 above and below the block median.15 Columns (2)-(3) and (5)-(6) of Table 2 break down the results from columns (1) and (4), respectively, into blocks with below and above median 2019 (i.e., pre-intervention) compliance. The direct effects at the end of the month are generally larger but not substantially different: for blocks with 80%, 50%, and 20% saturation, direct effects are about 5.1, 5.7, and 4.4 percentage points for street blocks above the median average compliance in 2019, compared to about 4.1, 4.8 and 5.4 for those below the median. The differences are relatively small for early payments. The division of the sample into these two groups reveals a much starker contrast for spillover effects. As in the main analysis in column (1) of Table 2, there is a spillover effect in early payments for the 80% saturation group but only for blocks above median compliance in 2019. The middle and bottom panels of Figure 3 make this pattern all the more apparent. This effect is relatively large: 1.58 percentage points, larger in fact than the direct effect of 1.06. The end-of-month spillover effect is much larger: 2.56 percentage points, about half of the direct effect in the high-saturation group (5.09 percentage points). 14 Figure B.9 and Table A4 validate our analysis by showing experimental balance (i.e., no effects) on the pre-intervention bill of September 2020. 15 See Appendix Section B.4 for more details and the intuition behind this exercise. 24 5 Recommendations for Practice In this section, we outline the steps for designing partial population experiments and refer to an example of spillovers on student test scores of an education intervention, the distribution of One Laptop per Child (as in Beuermann et al. 2015) to illustrate these steps. Step 1. First, the researcher needs to select the number of saturations M (i.e., categories with different intensity of treatment) and the within-cluster treatment probabilities {pg (d|t)}(g,d,t) (i.e., the proportion of within-clusters treated units), as discussed in Section 3.4. The choice of M could be guided by previous knowledge or by assumptions on how conditional average outcomes vary as a function of the treatment saturation. Consider, for example, the case of One Laptop per Child (OLPC)-type experiment in which each cluster is a school. Assuming spillovers are linear as a function of saturations, a pure control group of schools with untreated pupils, and two groups of schools with different degrees of intensity of treatment or saturation (low and high) would suffice. To test for non-linear spillovers as a function of saturation, the researcher should specify at least three saturation levels (low, medium, and high). Our framework does not provide specific guidance on these choices, but highlights the trade-off between the level of detail in which this function can be traced and the availability of units in each treatment assignment - i.e., there might not be enough schools or laptops to distribute to test many different saturation levels. Our power formulas quantify these trade-offs in terms of the statistical power for different designs. Step 2. Use baseline data to assess the degree of cluster size heterogeneity. In the OLPC case, cluster size consists of each school’s enrollment which may vary substantially, especially across districts or geo- graphical areas (for instance, if there are large urban and smaller rural schools in the population). In some cases, the researcher may consider excluding clear outliers to satisfy the “no cluster too large” require- ment, Assumption 3(i). In the OLPC context, it may be necessary to exclude one particularly large school from the experiment. It should be kept in mind that excluding outliers generally changes the population of interest and thus affects the external validity of the estimates. In addition to accounting for variation in cluster sizes, the researcher may need to account for distributional heterogeneity, i.e., variation in outcome distributions across clusters (see Theorem 1). The variation in outcome distributions across clusters may be assessed using baseline outcome data and possibly some distributional assumptions on outcomes. Step 3. Select the variance formula for the power and MDE calculations. This step may be based on the general formula in Equation (9) or on the simplified formula in Equation (11) under distributional homo- geneity. These formulas may be further simplified under additional assumptions on the data-generating process or the experimental design. For instance, one may assume equal within-cluster probabilities across g . Another possible simplifying assumption is homoskedasticity, σ 2 (d, t) = σ 2 and ρ(d, t) = ρ for all (d, t), which means, for example, that the variance and intra-school correlation in student test scores are the same across treatment assignments. This assumption may be reasonable when the effects of the treat- ment (e.g., the effects of OLPC on test scores) are approximately constant across units. Step 4. Choose the cluster-level assignment probabilities {qt }M t=0 —that is, what proportion of clusters 25 to assign to each of the saturation levels defined in point 1 above. These probabilities can be chosen using Theorem 2 when the goal is to minimize a weighted average of estimators variances, incorporating ad-hoc constraints as in Section 4.3, or based on another optimization criteria (and possibly numerical methods) as discussed in Section 3.4. One common ad-hoc constraint is having a fixed number of treated units. In this case, researchers may rely on a system of equations as in Section 4.3. For example, in the OLPC RCT example, the government may have mandated the distribution of exactly 10,000 laptops for the experiment. In that case, if we have two saturation groups of 25% and 75% treated pupils within a school (i.e., clusters), and the saturation probabilities are q1 and q2 , respectively, researchers should use an equation that represents the treatment units as 10, 000 = n(0.25q1 + 0.75q2 ), where n is the total number of pupils. Another condition may, for example, equalize the variance of the estimators in the “small” î ó î ó cells (i.e., treated units in 25% and untreated units in 75% class), that is: V β ˆ(1, 1) = V β ˆ(0, 2) . See Section 4.3. Step 5. Use the power formula in Equation (10) together with the variance formula chosen in step 3 and the cluster probabilities in step 4 to calculate power and/or MDEs. It should be noted that our framework encompasses several other common settings that are particular cases of partial population experiments. For example, our formulas can be used for designing clustered RCTs, such as an intervention where all students in treated schools receive an OLPC laptop. Finally, we note that our results allow for general between-cluster heterogeneity but assume that outcome distributions are homogeneous within each cluster. We leave the generalization of our framework to within-cluster heterogeneity for future work. 6 Conclusion We provide a general framework to analyze and design partial population experiments with heterogeneous clusters. We derive an asymptotic approximation and variance formulas for general clustered experimental designs, allowing for multiple treatment intensities, general forms of intracluster correlation, and two sources of cluster heterogeneity: heterogeneity in cluster sizes and distributional heterogeneity. We then apply our results to analyze inference and to conduct power and MDE calculations in partial population experiments, and derive formulas for optimal group-level assignment probabilities. Our formulas are easy to adapt to other experimental designs. We estimate total and neighborhood spillover effects of a randomized communication campaign on property tax compliance in a large municipality of Argentina where neighbors must pay a monthly bill on their real estate. We estimate direct effects on monthly payments and analyze whether the campaign creates spillover effects on neighbors who live nearby within a treated block but who do not receive a letter. We find evidence of direct and spillover effects on property tax payment rates. Our results reveal higher payment rates of treated and untreated accounts relative to neighbors in pure control blocks where nobody 26 received the communication letter. We find that spillover effects are stronger in blocks that exhibited a higher degree of tax compliance in the pre-treatment period. This application showcases the usefulness of our methodological framework for designing partial population experiments. References Abadie, Alberto, Susan Athey, Guido W Imbens, and Jeffrey M Wooldridge. 2022. “When Should You Adjust Standard Errors for Clustering?” Quarterly Journal of Economics, 138(1): 1–35. Angelucci, Manuela, and Giacomo De Giorgi. 2009. “Indirect Effects of an Aid Program: How Do Cash Transfers Affect Ineligibles’ Consumption?” American Economic Review, 99(1): 486–508. Antinyan, Armenak, and Zareh Asatryan. 2024. “Nudging for Tax Compliance: a Meta-Analysis*.” The Economic Journal, ueae088. Athey, Susan, Dean Eckles, and Guido W. Imbens. 2018. “Exact P-values for Network Interference.” Journal of the American Statistical Association, 113(521): 230–240. Baird, Sarah, Aislinn Bohren, Craig McIntosh, and Berk Ozler.¨ 2018. “Optimal Design of Experi- ments in the Presence of Interference.” The Review of Economics and Statistics, 100(5): 844–860. Bai, Yuehao. 2022. “Optimality of Matched-Pair Designs in Randomized Controlled Trials.” American Economic Review, 112(12): 3911–3940. Barrera-Osorio, Felipe, Marianne Bertrand, Leigh L. Linden, and Francisco Perez-Calle. 2011. “Im- proving the Design of Conditional Transfer Programs: Evidence from a Randomized Education Exper- iment in Colombia.” American Economic Journal: Applied Economics, 3(2): 167–195. Basse, Guillaume, and Avi Feller. 2018. “Analyzing two-stage experiments in the presence of interfer- ence.” Journal of the American Statistical Association, 113(521): 41–55. Basse, G W, A Feller, and P Toulis. 2019. “Randomization tests of causal effects under interference.” Biometrika, 106(2): 487–494. Berger, Martijn P.F., and Weng-Kee Wong. 2009. An Introduction to Optimal Designs for Social and Biomedical Research. Wiley. Bergeron, Augustin, Gabriel Tourek, and Jonathan Weigel. 2024. “The State Capacity Ceiling on Tax Rates: Evidence from Randomized Tax Abatements in the DRC.” NBER working paper. Beuermann, Diether W., Julian Cristia, Santiago Cueto, Ofer Malamud, and Yyannu Cruz-Aguayo. 2015. “One Laptop per Child at Home: Short-Term Impacts from a Randomized Experiment in Peru.” American Economic Journal: Applied Economics, 7(2): 53–80. 27 Bloom, Howard S. 2005. “Randomizing Groups to Evaluate Place-Based Programs.” Learning More from Social Experiments: Evolving Analytic Approaches, 115–172. Russell Sage Foundation. Boning, William C., John Guyton, Ronald Hodge, and Joel Slemrod. 2020. “Heard it through the grapevine: The direct and network effects of a tax enforcement field experiment on firms.” Journal of Public Economics, 190(C). Brockmeyer, A, A Estefan, K Ramirez Arras, and J.C. Suarez Serrato. 2020. “Taxing Property in Developing Countries: Theory and Evidence from Mexico.” IFS Working Paper. Bruhn, Miriam, and David McKenzie. 2009. “In Pursuit of Balance: Randomization in Practice in Development Field Experiments.” American Economic Journal: Applied Economics, 1(4): 200–232. Bugni, Federico A., Ivan A. Canay, and Azeem M. Shaikh. 2018. “Inference under Covariate-Adaptive Randomization.” Journal of the American Statistical Association, 113(524): 1784–1796. Bugni, Federico A, Ivan A. Canay, and Azeem M. Shaikh. 2019. “Inference under Covariate-Adaptive Randomization with Multiple Treatments.” Quantitative Economics, 10(4): 1747–1785. Bugni, Federico A., Ivan A. Canay, Azeem M. Shaikh, and Max Tabord-Meehan. 2023. “Inference for Cluster Randomized Experiments with Non-ignorable Cluster Sizes.” arXiv:2204.08356. Carrillo, Paul E., Edgar Castro, and Carlos Scartascini. 2021. “Public good provision and property tax compliance: Evidence from a natural experiment.” Journal of Public Economics, 198: 104422. Carter, Andrew V., Kevin T. Schnepel, and Douglas G. Steigerwald. 2017. “Asymptotic Behavior of a t-Test Robust to Cluster Heterogeneity.” Review of Economics and Statistics, 99(4): 698–709. Chiang, Harold, Yuya Sasaki, and Yulong Wang. 2023. “On the Inconsistency of Cluster-Robust Infer- ence and How Subsampling Can Fix It.” arXiv:2308.10138. epon, Bruno, Esther Duflo, Marc Gurgand, Roland Rathelot, and Philippe Zamora. 2013. “Do La- Cr´ bor Market Policies have Displacement Effects? Evidence from a Clustered Randomized Experiment.” Quarterly Journal of Economics, 128(2): 531–580. De Neve, Jan-Emmanuel, Cl´ ement Imbert, Johannes Spinnewijn, Teodora Tsankova, and Maarten Luts. 2021. “How to Improve Tax Compliance? Evidence from Population-Wide Experiments in Bel- gium.” Journal of Political Economy, 129(5): 1425–1463. Djogbenou, Antoine A., James G. MacKinnon, and Morten Ørregaard Nielsen. 2019. “Asymptotic theory and wild bootstrap inference with clustered errors.” Journal of Econometrics, 212(2): 393–412. Drago, Francesco, Friederike Mengel, and Christian Traxler. 2020. “Compliance Behavior in Net- works: Evidence from a Field Experiment.” American Economic Journal: Applied Economics, 12(2): 96–133. 28 Duflo, Esther, and Emmanuel Saez. 2003. “The Role of Information and Social Interactions in Retire- ment Plan Decisions: Evidence from a Randomized Experiment.” Quarterly Journal of Economics, 118(3): 815–842. Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2007. “Using Randomization in Development Economics Research: A Toolkit.” In Handbook of Development Economics. Vol. 4 of Handbook of Development Economics, , ed. T. Paul Schultz and John A. Strauss, 3895–3962. Elsevier. Foos, Florian, and Eline A. de Rooij. 2017. “All in the Family: Partisan Disagreement and Electoral Mobilization in Intimate Networks—A Spillover Experiment.” American Journal of Political Science, 61(2): 289–304. Gin´e, Xavier, and Ghazala Mansuri. 2018. “Together We Will: Experimental Evidence on Female Vot- ing Behavior in Pakistan.” American Economic Journal: Applied Economics, 10(1): 207–235. Hansen, Bruce E., and Seojeong Lee. 2019. “Asymptotic theory for clustered samples.” Journal of Econometrics, 210(2): 268–290. Haushofer, Johannes, and Jeremy Shapiro. 2016. “The Short-term Impact of Unconditional Cash Trans- fers to the Poor: Experimental Evidence from Kenya.” Quarterly Journal of Economics, 131(4): 1973– 2042. Hirano, Keisuke, and Jinyong Hahn. 2010. “Design of Randomized Experiments to Measure Social Interaction Effects.” Economics Letters, 106(1): 51–53. Hudgens, Michael G., and M. Elizabeth Halloran. 2008. “Toward Causal Inference with Interference.” Journal of the American Statistical Association, 103(482): 832–842. ¨ Ichino, Nahomi, and Matthias Schundeln. 2012. “Deterring or Displacing Electoral Irregularities? Spillover Effects of Observers in a Randomized Field Experiment in Ghana.” Journal of Politics, 74(1): 292–307. Imai, Kosuke, Gary King, and Clayton Nall. 2009. “The Essential Role of Pair Matching in Cluster- Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation.” Statistical science, 24(1): 29–53. Imai, Kosuke, Zhichao Jiang, and Anup Malani. 2021. “Causal Inference With Interference and Non- compliance in Two-Stage Randomized Experiments.” Journal of the American Statistical Association, 116(534): 632–644. Jiang, Zhichao, Kosuke Imai, and Anup Malani. 2023. “Statistical Inference and Power Analysis for Direct and Spillover Effects in Two-Stage Randomized Experiments.” Biometrics, 79(3): 2370–2381. 29 Krause, Benjamin. 2020. “Balancing Purse and Peace: Tax Collection, Public Goods and Protests.” Mimeo. Leung, Michael P. 2022. “Rate-optimal cluster-randomized designs for spatial interference.” The Annals of Statistics, 50(5): 3064 – 3087. Liu, Jizhou. 2023. “Inference for Two-stage Experiments under Covariate-Adaptive Randomization.” arXiv:2301.09016. Melas, Viatcheslav B. 2006. Functional Approach to Optimal Experimental Design. Springer New York. Miguel, Edward, and Michael Kremer. 2004. “Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities.” Econometrica, 72(1): 159–217. Moffit, Robert. 2001. “Policy Interventions, Low-level Equilibria and Social Interactions.” In Social Dy- namics. , ed. Stephen N. Durlauf and Peyton Young, 45–82. MIT Press. Moulton, Brent R. 1986. “Random group effects and the precision of regression estimates.” Journal of Econometrics, 32(3): 385–397. Pomeranz, Dina. 2015. “No Taxation without Information : Deterrence and Self-Enforcement in the Value Added Tax.” American Economic Review, 105(8): 2539–2569. Pomeranz, Dina, and Jos´ e Vila-Belda. 2019. “Taking State-Capacity Research to the Field: Insights from Collaborations with Tax Authorities.” Annual Review of Economics, 11(1): 755–781. Puelz, David, Guillaume Basse, Avi Feller, and Panos Toulis. 2022. “A graph-theoretic approach to ran- domization tests of causal effects under general interference.” Journal of the Royal Statistical Society: Series B, 84(1): 174–204. Sasaki, Yuya, and Yulong Wang. 2022. “Non-Robustness of the Cluster-Robust Inference: with a Pro- posal of a New Robust Method.” arXiv:2210.16991. Silvey, Samuel D. 1980. Optimal Design: An Introduction to the Theory for Parameter Estimation. Springer Netherlands. Vazquez-Bare, Gonzalo. 2023. “Identification and Estimation of Spillover Effects in Randomized Exper- iments.” Journal of Econometrics, 237(1): 105237. Viviano, Davide. 2024. “Policy design in experiments with unknown interference.” working paper. Weigel, Jonathan L. 2020. “The Participation Dividend of Taxation: How Citizens in Congo Engage More with the State When it Tries to Tax Them.” Quarterly Journal of Economics, 135(4): 1849–1903. 30 Figures and Tables Figure 1: Distribution of cluster sizes in six partial population experiments 300 50 40 200 Frequency Frequency 30 20 100 10 0 0 10 20 30 40 50 0 100 200 300 Cluster (block) size Cluster (local agency) size (a) This paper; n = 68, 808; G = 3, 982; n/G = epon et al. (2013); n = 21, 431; G = 235; (b) Cr´ 17.3; sd(ng ) = 8.25. n/G = 91.2; sd(ng ) = 42.2. 30 6 20 4 Frequency Frequency 10 2 0 0 0 20 40 60 80 0 20 40 60 80 Cluster (village) size Cluster (neighborhood) size (c) Haushofer and Shapiro (2016); n = 2, 880; G = e and Mansuri (2018); n = 2, 637; G = 67; (d) Gin´ 123; n/G = 23.4; sd(ng ) = 14.8. n/G = 39.4; sd(ng ) = 16.7. 4 80 3 60 Frequency Frequency 2 40 1 20 0 0 0 10 20 30 40 0 20 40 60 80 100 Cluster (constituency) size Cluster (village) size undeln (2012); n = 868; G = 39; (e) Ichino and Sch¨ (f) Imai, Jiang and Malani (2021); n = 11, 089; n/G = 22.3; sd(ng ) = 9.6. G = 437; n/G = 25.4; sd(ng ) = 16.7. Notes: This figure shows the distribution of cluster sizes in six partial population experiments; n denotes the total sample size; G denotes the number of clusters; n/G denotes the average cluster size; sd(ng ) denotes the standard deviation of cluster sizes. Average values across studies are n = 7, 781; G = 180; n/G = 40.4; sd(ng ) = 20. Median values across studies are n = 2, 880; G = 123; n/G = 25.6; sd(ng ) = 16.7. The data source for Cr´ epon et al. (2013) is: DARES (2010) “Enquˆete aupr` ` la prestation d’insertion jeunes diplˆ ´ ligibles a es des jeunes e es,” Progedo-Adisp. doi:10.13144/lil-1596. om´ 31 Figure 2: Power functions - numerical illustration 1.0 power = 0.80 0.8 power = 0.69 0.6 Power power = 0.48 0.4 0.2 MDE = 0.29 0.0 −0.4 −0.2 0.0 0.2 0.4 Difference in means Notes: This figure illustrates how ignoring heterogeneity can result in severely underpowered experiments. We consider the simple setting of a cluster RCT with a few “large” clusters and variation in the distribution of outcomes across clusters. We assume 200 clusters, with 10 clusters containing 100 units each and the remaining 190 clusters containing 25 units each. The figure plots three power functions corresponding to different variance formulas: the short-dashed curve depicts the power function for the variance formula that accounts for clustering assuming equally-sized clusters. The long-dashed curve depicts the power function using a variance formula that accounts for variation in cluster sizes. The solid curve depicts the power function using a variance formula that accounts for heterogeneity in both cluster sizes and outcome distributions. Given this sample size, the MDE at 80% power, ignoring cluster heterogeneity, is 0.29. Accounting for cluster size heterogeneity decreases the power to detect an effect of 0.29 from 80% to 69%. Accounting for both sources of heterogeneity decreases the power further to 48%. 32 Figure 3: Direct and spillover effects on property tax payments in high-saturation blocks Direct Effects Spillover Effects Treated vs. Pure Control Untreated vs. Pure Control 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill 6 Oct'20 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 1 0 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Above Median Compliance 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill Oct'20 bill 6 6 Treatment effect (p.p.) Treatment effect (p.p.) 5 5 4 4 3 3 2 2 1 1 0 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Below Median Compliance 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill 6 Oct'20 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 3 4 2 3 1 0 2 1 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Notes: These figures show the coefficients and 95% confidence intervals from a saturated regression that computes, at each calendar day, the payment rate difference between treated and untreated groups relative to the pure control group (i.e., blocks where no accounts were treated). We focus the attention on blocks where 80% of the units were treated. The three left figures exhibit the direct effects on treated accounts, and the three right figures present the spillover effects on untreated accounts. The top panel includes all the observations in high-saturation blocks, the middle panel focuses on blocks with baseline compliance above the median, and the bottom panel focuses on blocks with baseline compliance below the median. We define compliance as the share of bills paid by block in 2019. The median compliance is 0.56 (see Figure B.15). Standard errors are clustered by block. The first vertical bar shows the due date for the September 2020 bill. This corresponds to a bill issued and due for payment before our intervention began, thus serving as a placebo. The second vertical bar indicates the start of the intervention. The letters were delivered between September 28th and October 7th. 33 Table 1: MDEs with constrained and optimal choice of cluster probabilities Scenario 1 Scenario 2 Scenario 3 Het Homog Equal Het Homog Equal Het Homog Equal Restricted qt q0 0.408 0.408 0.408 0.388 0.388 0.388 0.272 0.272 0.272 q1 0.199 0.209 0.230 0.219 0.231 0.239 0.273 0.283 0.286 q2 0.194 0.173 0.131 0.173 0.149 0.135 0.182 0.162 0.157 q3 0.199 0.209 0.230 0.219 0.231 0.239 0.273 0.283 0.286 MDEs M DE1 0.145 0.047 0.022 0.042 0.027 0.022 0.033 0.025 0.023 M DE2 0.146 0.051 0.029 0.047 0.033 0.029 0.039 0.032 0.030 M DE3 0.146 0.051 0.030 0.047 0.034 0.030 0.040 0.033 0.032 Optimal qt q0 0.364 0.352 0.314 0.348 0.328 0.313 0.332 0.315 0.309 q1 0.212 0.219 0.238 0.221 0.231 0.238 0.229 0.237 0.240 q2 0.211 0.211 0.210 0.210 0.210 0.210 0.210 0.210 0.210 q3 0.212 0.219 0.238 0.221 0.231 0.238 0.229 0.237 0.240 MDEs M DE1 0.145 0.047 0.023 0.043 0.028 0.023 0.033 0.026 0.024 M DE2 0.145 0.049 0.025 0.045 0.030 0.025 0.036 0.028 0.027 M DE3 0.146 0.051 0.030 0.047 0.034 0.030 0.040 0.033 0.032 Scenario Statistics n 84,175 81,961 68,808 G 4,139 4,138 3,982 min ng 8 8 8 max ng 2,754 376 50 mean(ng ) 20.5 19.8 17.3 sd(ng ) 45.9 17.5 8.3 Notes: This table shows the cluster assignment probabilities and MDEs for the binary outcomes of interest. The parameters of interest are the difference in means between untreated units in each treated group and the pure control units, βn (0, t), with t = 1, 2, 3 indicating the groups with 20%, 50%, and 80% treated units, respectively. We refer to the corresponding MDEs for the parameters βn (0, 1), βn (0, 2) and βn (0, 3) as M DE1 , M DE2 and M DE3 , respectively. Scenario 1 exhibits “substantial” heterogeneity, scenario 2 has “moderate” heterogeneity, and scenario 3 presents “limited” heterogeneity. n denotes the sample size; G denotes the number of clusters; min ng and max ng show the smallest and largest cluster; mean(ng ) is the average cluster size; sd(ng ) is the standard deviation of cluster sizes. In each scenario, we consider the results obtained with our general formula from Theorem 1 (“Het”), the formulas that rule out between-cluster moment heterogeneity from Corollary 1 (“Homog”) and the formulas that assume homogeneous, equally-sized clusters (“Equal”). Panels 1 and 2 show the constrained and optimal cluster assignment probabilities and their corresponding MDEs. Panels 3 and 4 show the optimal cluster assignment probabilities and their corresponding MDEs. 34 Table 2: Direct and spillover effects on property tax payments Early Payments On-time & Late Payments Below Above Below Above All Median Median All Median Median (1) (2) (3) (4) (5) (6) A. Blocks with 80% treated Treated 0.96*** 0.86** 1.06** 4.55*** 4.12*** 5.09*** (0.28) (0.34) (0.42) (0.74) (0.79) (0.81) Untreated 1.10** 0.55 1.58** 0.79 -1.25 2.56** (0.43) (0.50) (0.67) (1.01) (1.16) (1.27) B. Blocks with 50% treated Treated 1.07*** 1.24** 1.02 4.87*** 4.81*** 5.67*** (0.41) (0.50) (0.62) (0.93) (1.07) (1.08) Untreated -0.02 0.10 -0.03 -0.10 1.34 -0.76 (0.34) (0.43) (0.50) (0.91) (1.00) (1.14) C. Blocks with 20% treated Treated 0.69* 0.85* 0.52 4.97*** 5.41*** 4.40*** (0.42) (0.52) (0.63) (0.99) (1.21) (1.27) Untreated 0.11 0.68** -0.42 -0.18 0.61 -1.09 (0.26) (0.33) (0.38) (0.72) (0.77) (0.82) Payment rate of pure control 5.15 3.63 6.49 34.37 23.53 43.91 Observations 68,806 32,361 36,445 68,806 32,361 36,445 Number of clusters (blocks) 3,981 2,013 1,968 3,981 2,013 1,968 Notes: This table shows the results from saturated OLS regressions. The dependent variable in columns (1)-(3) is an indicator for paying the October 2020 bill by October 3rd (early payments); while in columns (4)-(6) we use an indicator for paying the October 2020 bill by October 31st (includes early, on time, and overdue payments). Columns (2) and (3) break the main result from column (1) into blocks below and above median compliance in 2019. We define compliance as the share of bills paid by block in 2019 with median value of 0.56 (see Figure B.15). Each column corresponds to a separate regression. The omitted category corresponds to blocks where no accounts were treated (pure control). Panel A shows the results for blocks where 80% were treated, panel B for blocks with 50% treated, and panel C for blocks with 20% treated. The letters were delivered between September 28th and October 7th. The due date for the October 2020 bill was October 9th. The row Payment rate of pure control displays the constant of each regression, corresponding to the average payment rate in blocks with no treated units. Standard errors clustered by blocks are reported in parentheses. * p<0.10, ** p<0.05, *** p<0.01 35 Supplementary Materials for: “Design of Partial Population Experiments with an Application to Spillovers in Tax Compliance” A-1 A Further Details on Experimental Design A.1 Additional Material Figure A.1: Example of the intervention letter ID: XXXXX CAP. MADARIAGA N° LOCALIDAD: 11 de Septiembre 1657 XXXXXX/7 XXXXX/7 Cuota 10 vencimiento 10 de octubre 2020: 347,29 Deuda año en curso*: 1.702,58 Deuda años anteriores*: 289,54 * Al 15/09/2020 Notes: This figure shows an anonymized example of the letters sent during the intervention between September 28th and October 7th, 2020. The headline reads: “Your municipal taxes are now available on the electronic bill.” The information below the headline contains the name of the account holder, the address, and the account number. The main text of the letter reads: “We would like to tell you that now in Tres de Febrero your municipal General Service Fee (TSG) bill is 100% digital. In other words, paper is no longer used. You can access it and pay for it from your cell phone or computer. In this way, we take care of each other by reducing circulation and we also take care of the environment. It is a difficult situation and we appreciate the effort you are making to keep up with your taxes, because that translates directly into constructions and services that do not stop in your neighborhood. We inform you of the status of your account and show you how easy it is:” The table below this text shows the account number, the amount due in the October 2020 billing period, the amount of past due debt from previous months of 2020, and the amount of past due date from earlier years. The large box below the table explains: (1) how to sign up for electronic billing, and (2) how to pay the bill and the different means of payment (online or in person). Finally, below the box, the text reads: “For questions, contact us at reclamos.mistasas@tresdefebrero.gov.ar. If this letter arrived by mistake at your address, inform us in that same email. Many thanks!” A-2 Figure A.2: Map of the municipality with the experimental design Notes: This figure shows a map of the municipality where the 2-level randomized communication campaign took place. We highlight the group-level assignment of blocks (cuadras) with different colors: pure control blocks with 0% treated (light green), blocks with 20% treated accounts (green), blocks with 50% treated (blue), and blocks with 80% treated (dark blue). We use gray for blocks that were not part of the experiment (e.g., industrial or commercial blocks). A-3 Figure A.3: A Partial Population Design Sample of clusters q0 q1 q2 q3 Pure control 20% treated 50% treated 80% treated 0.8 0.2 0.5 0.5 0.2 0.8 C C T C T C T Notes: In a partial population design, clusters are first randomly assigned to different treatment intensities or saturations. Within each cluster, units are randomly assigned to treatment with a probability equal to their cluster saturation. The figure above shows an example of a partial population design with four saturations, including pure control clusters with no treated units. Figure A.4: Timeline of the randomized communication campaign 25,000 letters delivered September 28 October 7 October 9 Timeline First day Last day October 2020 2020 of campaign of campaign bill is due A.2 Sample Sizes and Descriptive Statistics Table A1: Sample sizes Blocks Control Obs Treated Obs Tg =0 Pure control 1, 102 19, 103 0 Tg =1 20% treated 1, 100 15, 060 3, 853 Tg =2 50% treated 680 5, 905 5, 897 Tg =3 80% treated 1, 100 3, 677 15, 311 Total 3, 982 43, 745 25, 061 Notes: This table shows the final sample sizes used in our experiment. We limit the analysis to clusters of size ranging between 8 and 50 property tax accounts per street-block. A-4 Table A2: Descriptive statistics in 2019 (baseline year) Blocks Obs Mean SD ICC Paid the twelve bills in 2019 3, 981 68, 808 0.449 0.497 0.062 Paid at least one bill in 2019 3, 981 68, 808 0.650 0.477 0.071 Paid six bills or more in 2019 3, 981 68, 808 0.572 0.495 0.073 Notes: This table shows descriptive statistics about the frequency of payments in 2019. This is the baseline year we used for the randomization, power calculations, and simulations. The data set is restricted to blocks with size between 8 and 50 accounts. Figure 1 shows the distribution of accounts per block. Our sample size consists of 68,808 accounts distributed in 3,982 blocks. The frequency of payments is very polarized. About 45 percent of the accounts paid the twelve bills and about 35 percent did not pay any bill. We call these two core groups always payers and never payers, respectively. The perfect compliance rate of 45 percent is presumably low and, therefore, leaves room for potential behavioral responses from non-compliant and partially-compliant neighbors. A-5 Figure A.5: Distribution of payment date for treated, untreated, and pure control (October 2020 billing period) (a) Treated vs. Pure Control .15 Due date of Treated October bill Pure Control .1 Fraction .05 0 30sep2020 14oct2020 28oct2020 11nov2020 25nov2020 09dec2020 23dec2020 (b) Untreated vs. Pure Control .15 Due date of Untreated October bill Pure Control .1 Fraction .05 0 30sep2020 14oct2020 28oct2020 11nov2020 25nov2020 09dec2020 23dec2020 Notes: These figures show the fraction of individuals paying the October 2020 bill before and after the due date (October 9th, 2020). Panel (a) shows the distribution of payments for treated units (in blue) relative to pure control units (in red). We pool together treated units from Tg = 1, 2, 3. Panel (b) shows the distribution of payments for untreated units (in blue) relative to pure control units (in red). We pool together untreated units from Tg = 1, 2, 3. The area of each histogram integrates to one. A larger bar on a particular date means that the payment frequency of the corresponding group is higher than the other group. A-6 A.3 Balance Checks We ran balance test checks to verify the comparability of the treated, untreated, and pure control groups in terms of demographic and account-related characteristics in 2019. We jointly estimate the parameters of interest through the following saturated OLS regression: 3 3 Xig = α + θt 1(Tg = t)(1 − Dig ) + τt 1(Tg = t)Dig + εig (12) t=1 t=1 where Xig is one of the account holder or dwelling characteristics contained in our baseline data. We allow εig to be correlated within blocks and use a cluster-robust variance estimator. In this regression, θt captures the average difference of Xig of untreated units in groups with Tg = t relative to the pure control group, and τt captures the average difference of Xig of treated units in groups with Tg = t relative to the pure control group. The results are reported in Table A3 and reassuringly confirm that our groups are highly balanced. The null effect on timely payments (i.e., excluding past-due payments) of the September 2020 bill—the bill prior to our intervention— sheds further light on the balance between groups (see Figure B.9). A-7 Table A3: Balance test saturated regressions Property Front House Tenant Tenant Bill N Bills Digital Value Metres type Male Age amount paid 2019 payment (1) (2) (3) (4) (5) (6) (7) (8) A. Blocks with 80% treated: Treated 0.01 −8.27 −0.00 −0.00 −0.14 2.81 0.05 −0.00 (0.02) (17.77) (0.00) (0.01) (0.40) (7.81) (0.09) (0.01) Untreated 0.00 −1.76 0.00 0.00 −0.53 6.27 −0.06 −0.00 (0.02) (20.70) (0.01) (0.01) (0.53) (12.95) (0.12) (0.01) B. Blocks with 50% treated: Treated 0.01 12.65 −0.00 −0.00 −0.47 1.16 0.03 0.00 (0.02) (20.38) (0.01) (0.01) (0.50) (9.21) (0.11) (0.01) Untreated 0.01 25.30 −0.00 −0.00 −0.42 1.88 0.02 0.01 (0.02) (20.66) (0.01) (0.01) (0.48) (9.66) (0.11) (0.01) C. Blocks with 20% treated: Treated 0.02 32.57* −0.01 0.01 0.10 5.94 0.07 −0.01 (0.02) (16.79) (0.01) (0.01) (0.54) (9.55) (0.12) (0.01) Untreated 0.02 19.14 −0.01 −0.01 0.12 1.32 0.00 0.00 (0.02) (14.05) (0.00) (0.01) (0.40) (7.77) (0.09) (0.01) Mean Pure Control 13.64 841.50 0.91 0.62 19.15 368.66 6.71 0.35 Observations 64,932 68,808 68,808 46,419 52,714 68,808 68,808 38,112 Number of clusters 3,979 3,981 3,981 3,973 3,976 3,981 3,981 3,968 Notes: This table shows balance test regressions to formally test for differences in observable characteristics between the treatment and control groups. Each column corresponds to a separate regression (equation (12) in the text). The dependent variables in each column are: (1) the log of assessed property value; (2) the front metres of the property; (3) an indicator for the property being a house versus a house with a store; (4) whether the tenant is male; (5) a proxy for the tenant’s age (first two digits of the ID); (6) the amount paid in the bill corresponding to December 2019 (including zeroes); (7) the number of bills paid in 2019 (the maximum is 12); (8) for those who paid, whether they did so digitally. The row Mean Pure Control displays the constant of each regression, corresponding to the average of the dependent variable for accounts in blocks with no treated units (Tg = 0). Missing/non-missing indicators for the dependent variables with missing observations (columns 1, 4, 5 and 8) are also balanced between groups (results not reported). Standard errors clustered by blocks are reported in parentheses. * p<0.10, ** p<0.05, *** p<0.01 A.4 Further Details on Within-Group Assignment Mechanisms Fixed Margins. The within-group treatment is often assigned by choosing a fixed (i.e. nonrandom) number of treated units within each group. Given Tg = t, suppose the researcher wants to assign a A-8 proportion pt of, or a total of ng pt , units to treatment. Assigning exactly ng pt units to treatment is not possible when ng pt is not an integer. We propose the following procedure to deal with this issue. Define an independent binary random variable ξg and let the number of treated units in cluster g be: 1 Ng = ng pt + ξg 1(ng pt ∈ / N). so that ξg plays the role of an adjusting factor that randomly rounds the number of treated up or down. Given Tg = t, set the probability that ξg = 1 to:  0 if ng pt ∈ N Pg [ξg = 1|Tg = t] = n p − n p if ng pt ∈ / N. g t g t This implies that, given Tg = t, the expected number of treated units in group g is ng pt and that Pg [Dig = 1|Tg = t] = pt . Then, given Tg = t, the expected number of treated units in group g is ng pt and that Pg [Dig = 1|Tg = t] = pt . More precisely, 1 E[Ng |Tg = t] = ng pt + E[ξg |Tg = t]1(ng pt ∈ / N) = ng pt + (ng pt − ng pt )1(ng pt ∈ / N) = ng pt using that ng pt = ng pt when ng pt ∈ N. It follows that: ñ 1 ô Ng E Tg = t = P[Dig = 1|Tg = t] = pt ng 0 1 which doesn’t vary across groups conditional on Tg = t. On the other hand, defining Ng = ng − Ng , we have that: ñ 0 ô Ng E Tg = t = P[Dig = 0|Tg = t] = 1 − pt . ng Next, for this assignment mechanism, ñ 1 Ç 1 å ô Ng Ng −1 P[Dig = 1, Djg = 1|Tg = t] = E Tg = t ng ng − 1 1 2 1 E[(Ng ) |Tg = t] − E[Ng |Tg = t] = ng (ng − 1) A-9 where ) |Tg = t] = E[( ng pt + ξg 1(ng pt ∈ 1 2 E[(Ng / N))2 |Tg = t] = n2g pt 1(ng pt ∈ N) 2 Pg [ξg = 0|Tg = t] 1(ng pt ∈ Ä 2 ä + ( ng pt + 1)2 Pg [ξg = 1|Tg = t] + ng pt / N) = n2g pt 1(ng pt ∈ N) 2 (1 − ng pt − ng pt ) 1(ng pt ∈ Ä 2 ä + ( ng pt + 1)2 (ng pt − ng pt ) + ng pt / N). Similarly, 0 2 0 E[(Ng ) |Tg = t] − E[Ng |Tg = t] P[Dig = 0, Djg = 0|Tg = t] = ng (ng − 1) where 0 2 1 2 E[(Ng ) |Tg = t] = E[(ng − Ng ) |Tg = t] = n2 1 2 2 g + E[(Ng ) |Tg = t] − 2ng pt Notice that even if P[Dig = d|Tg = t] does not change across g , the joint probabilities do. Nevertheless, these terms can be calculated for any sample using the chosen probabilities pt and the cluster sizes {ng }G g =1 . Bernoulli Trials. Alternatively, the within-cluster treatment may be assigned to each unit independently as a “coin flip” with probability pt . Under this mechanism, independence between treatment indicators implies that: P[Dig = 1|Tg = t] = P[Dig = 1|Tg = t] = pt P[Dig = d, Djg = d|Tg = t] = P[Dig = d|Tg = t]2 . which do not vary over g . It follows that: Ç å g ng (ng − 1)P[Dig = d, Djg = d|Tg = t] 1−d g n2 g = pd t (1 − pt ) −1 g ng P [ Dig = d |Tg = t] n Then the variances are approximated by: ® Ç å´ ® Ç å´ 2 ˆ0t ] ≈ σ (0t) gn2 g σ 2 (00) gn2 g V[ β 1 + ρ0t (1 − pt ) −1 + 1 + ρ00 −1 nqt (1 − pt ) n nq0 n and ® Ç å´ ® Ç å´ 2 ˆ1t ] ≈ σ (1t) gn2 g σ 2 (00) gn2 g V[β 1 + ρ1t pt −1 + 1 + ρ00 −1 . nqt pt n nq0 n B-10 B Additional Empirical Results B.1 Further Details and Figures for Main Results B-11 Figure B.6: Payment rates: Treated groups vs Pure control blocks (a) Payment rates in levels Due date of 40 Intervention October 2020 Treated groups begins billing period Cumulative payment rate (%) 30 Pure Control 20 10 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date (b) Difference relative to pure control group Intervention Due date of T50 5 begins October 2020 billing period T80 T20 4 Treatment effect (p.p.) 3 2 1 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Notes: These figures show the effect of the intervention on payments of the October 2020 bill for treated groups. Panel (a) shows the cumulative share of individuals paying the October 2020 bill over time. The brown dashed line shows the payment rate for pure control units. The blue dashed line corresponds to treated units in group Tg = 1 (blocks with 20% treated). The black dashed line corresponds to treated units in group Tg = 2 (blocks with 50% treated). The red solid line corresponds to treated units in group Tg = 3 (blocks with 80% treated). Panel (b) shows, for each calendar date, the difference between each treated group and the pure control group (treatment effect coefficients). The letters were delivered between September 28th and October 7th. The first vertical bar denotes the start of the intervention. The due date was October 9th and is indicated with another vertical bar. B-12 Figure B.7: Payment rates: Untreated groups vs Pure control blocks (a) Payment rates in levels Due date of Intervention 40 begins October 2020 billing period Untreated groups Cumulative payment rate (%) 30 Pure Control 20 10 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date (b) Difference relative to pure control group Due date of Intervention 5 begins October 2020 billing period Treated (pooled) 4 Treatment effect (p.p.) 3 2 1 T80 T50 0 T20 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Notes: These figures show the effect of the intervention on payments of the October 2020 bill for untreated groups. Panel (a) shows the cumulative share of individuals paying the October 2020 bill over time. The brown dashed line shows the payment rate for pure control units. The blue dashed line corresponds to untreated units in group Tg = 1 (blocks with 20% treated). The black dashed line corresponds to untreated units in group Tg = 2 (blocks with 50% treated). The red solid line corresponds to untreated units in group Tg = 3 (blocks with 80% treated). Panel (b) shows, for each calendar date, the difference between each untreated group and the pure control group (treatment effect coefficients). For comparison, the gray solid line shows the treatment effects for treated units (pooled from Tg = 1, 2, 3). The letters were delivered between September 28th and October 7th. The first vertical bar denotes the start of the intervention. The due date was October 9th and is indicated with another vertical bar. B-13 Figure B.8: Direct effects on treated accounts and spillover effects on untreated accounts Direct Effects Spillover Effects Treated vs. pure controls Untreated vs. pure controls 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill 6 Oct'20 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 1 0 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Blocks with 50% treated Blocks with 50% treated 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill 6 Oct'20 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 0 1 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Blocks with 20% treated Blocks with 20% treated 7 Intervention 7 Intervention begins Due date of begins Due date of Oct'20 bill 6 Oct'20 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 1 0 0 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 25sep2020 02oct2020 09oct2020 16oct2020 23oct2020 30oct2020 Calendar date Calendar date Notes: These figures show the coefficients and 95% confidence intervals from a saturated regression that computes, at each calendar day, the payment rate difference between each treated and untreated group relative to the pure control group (i.e., blocks where no accounts were treated). The top panel shows the effect on treated (left) and untreated (right) units in blocks with 80% treated (Tg = 3). The middle panel shows the effect on treated (left) and untreated (right) units in blocks with 50% treated (Tg = 2). The bottom panel shows the effect on treated (left) and untreated (right) units in blocks with 20% treated (Tg = 3). These point estimates coincide with those reported in panel (b) of Figures B.6 and B.7. Standard errors are clustered by block. The first vertical bar denotes the start of the intervention. The due date for the October 2020 bill was October 9th and is indicated with another vertical bar. The letters were delivered between September 28th and October 7th. B-14 Figure B.9: Placebo. Direct and spillover effects for the pre-intervention Sep’20 bill Treated groups Untreated groups vs. pure controls vs. pure controls Blocks with 80% treated Blocks with 80% treated Intervention Intervention 7 Due date of begins 7 Due date of begins Sept 2020 bill 6 Sept 2020 bill 6 5 5 Treatment effect (p.p.) Treatment effect (p.p.) 4 4 3 3 2 2 1 1 0 0 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 Calendar date Calendar date Blocks with 50% treated Blocks with 50% treated Intervention Intervention 7 begins 7 Due date of begins Due date of Sept 2020 bill 6 Sept 2020 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 1 0 0 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 Calendar date Calendar date Blocks with 20% treated Blocks with 20% treated Intervention Intervention 7 Due date of begins 7 Due date of begins Sept 2020 bill 6 Sept 2020 bill 6 5 Treatment effect (p.p.) Treatment effect (p.p.) 5 4 4 3 3 2 2 1 1 0 0 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 01sep2020 08sep2020 15sep2020 22sep2020 29sep2020 Calendar date Calendar date Notes: These figures show the coefficients and 95% confidence intervals from a saturated regression that computes, at each calendar day, the payment rate difference between each treated and untreated group relative to the pure control group (i.e., blocks where no accounts were treated). The top panel shows the effect on treated (left) and untreated (right) units in blocks with 80% treated (Tg = 3). The middle panel shows the effect on treated (left) and untreated (right) units in blocks with 50% treated (Tg = 2). The bottom panel shows the effect on treated (left) and untreated (right) units in blocks with 20% treated (Tg = 3). Standard errors are clustered by block. The first vertical bar shows the due date for the September 2020 bill. This corresponds to a bill issued and due for payment before our intervention began, thus serving as a placebo. The second vertical bar indicates the start of the intervention. The letters were delivered between September 28th and October 7th. B-15 Table A4: Placebo. Direct and spillover effects for the pre-intervention September 2020 bill Dependent variable: Placebo bill (September 2020) Pr(pay the bill) All Below Median Above Median (1) (2) (3) A. Blocks with 80% treated Treated 0.12 0.10 0.28 (0.69) (0.73) (0.81) Untreated -0.30 -1.55 0.78 (0.95) (1.09) (1.24) B. Blocks with 50% treated Treated 0.76 1.54 0.69 (0.88) (0.99) (1.12) Untreated 0.26 0.81 0.36 (0.88) (0.94) (1.15) C. Blocks with 20% treated Treated 0.85 1.32 0.27 (0.93) (1.11) (1.24) Untreated 0.07 0.27 -0.32 (0.68) (0.72) (0.80) Payment rate of pure control 29.70 20.05 38.19 Observations 68,806 32,361 36,445 Number of clusters (blocks) 3,981 2,013 1,968 Notes: Notes: This table shows the results from saturated OLS regressions using as dependent variable an indicator for paying the September 2020 bill by September 15th (pre-intervention). Each column corresponds to a separate regression. The omitted category corresponds to blocks where no accounts were treated (pure control). Panel A shows the results for blocks where 80% were treated, panel B for blocks with 50% treated, and panel C for blocks with 20% treated. Columns (2) and (3) split the sample from column (1) into blocks below and above median compliance in 2019, respectively. We define compliance as the share of bills paid by block in 2019 with median value of 0.56 (see Figure B.15). The estimates reported in column (1) correspond exactly to the numbers shown in Figure (B.9). The row Payment rate of pure control displays the constant of each regression, corresponding to the average payment rate in blocks with no treated units. Standard errors clustered by blocks are reported in parentheses. * p<0.10, ** p<0.05, *** p<0.01 B-16 B.2 Other Margins Subscriptions to electronic billing. We find evidence that our tax communication campaign also in- creases the subscriptions to receive an electronic bill by e-mail.1 These effects are greater in high- saturation blocks, albeit small in absolute value. Appendix Section B.2.1 presents graphical evidence of direct and spillover effects (Figure B.10), which are then summarized in Table A5, although spillover effects in this outcome are much more tenuous. Backward and forward payments. We also find that the effects of our letters are not solely concentrated on the October 2020 billing period (the bill targeted by our intervention). Section B.2.2 presents graphical evidence that the letters also increased the payment rates in subsequent billing periods. Perhaps more strikingly, we also show that some neighbors made backward payments to cancel past-due debt from previous billing periods. This is especially prominent after April 2020 when the COVID-19 lockdown measures were established in Argentina (See Figure B.11). B.2.1 Effects on Subscriptions to Electronic Billing The communication campaign also included information about how to sign up for electronic billing, a system introduced in June 2020. We briefly analyze the effect of our mailing on subscription to this service. We rely on a database that contains the individuals who signed up for the electronic billing option. This database goes through December 2020 and contains the account number, date of subscription, and email address. This source is linked with the main data through the unique account identifier. We analyze the intervention’s effect on subscriptions to electronic billing and present convincing graphical evidence that the tax communication campaign increased subscriptions to receive an electronic bill by e-mail. These effects are greater in high-saturation blocks, albeit small in absolute value. The results are summarized in Figure B.10, which follows a similar structure as Figure B.8 but for e-bill subscriptions. We run dynamic difference-in-differences comparing subscription rates between each treated and each untreated group relative to pure control blocks, day by day (fixing September 27, 2020, as the baseline date). Four important points are worth highlighting: (1) trends are generally parallel, as we estimate no significant differences between the treatment and control groups prior to the intervention; (2) the difference in subscription rates between treated accounts and pure control blocks experiences a noticeable break at the time we started sending letters, which is reassuring and implies that the effects we estimate are indeed caused by our experiment; (3) direct effects are greater in high-saturation blocks with 50% and 80% treated units relative to low-saturation blocks where only 20% received the letter. As happened with payment 1 Note that nudging individuals to sign up for e-billing was an explicit content of the letter (see Figure A.1). B-17 rates, this could be interpreted as a spillover effect, whereby the intervention creates interference between treated units strengthening the effect of the letter; and (4) although less clear than the left-hand-side panels for treated units, the right-hand-side panels of Figure B.10 also suggest the presence of spillover effects in subscriptions to e-billing for untreated accounts in high-saturation blocks. As was the case with payment rates, these effects are harder to detect. They are precisely estimated but only significant at the 5% level at the beginning of the intervention. Lastly, Table A5 summarizes the corresponding diff-in-diffs estimates reported in Figures B.10, with the same structure as Table 2.2 To benchmark our estimates, in the last row we report the share of e-bill subscribers in pure control blocks on September 27 (our baseline date). For treated accounts, the table shows an immediate effect in the three saturation groups that increases over time. This effect is higher in blocks with 80% treated units, consistent with interference that strengthens the effect. In such blocks, the total effect reaches 0.86 percentage points by the end of October. Although this represents about 20% of the baseline 4.25% share of e-bill subscribers, we find it striking that so few individuals switched to the digital bill. In the case of untreated accounts, spillover effects on subscription rates are smaller and, therefore, much harder to detect than in the analysis of payment rates. The clearest effect arises in blocks with 50% treated accounts with a spillover effect of 0.25 percentage points, significant at the 10% level. The somewhat absence of spillovers in this case can be explained by the fact that the outcome of analysis (subscription rate) has very low take up, making it harder for interference between neighbors to emerge. In sum, we find that our tax communication campaign also generates direct effects and spillover effects among neighbors in subscriptions to electronic billing. These effects are greater in high-saturation blocks, albeit small in absolute value. 2 Column (1) validates the experiment by showing a placebo saturated regression that compares subscription rates between each group and the pure control group on September 17, before the intervention began. None of the coefficients are statistically significant or large in magnitude. B-18 Figure B.10: Direct effects on treated accounts and spillover effects on untreated accounts (subscriptions to e-billing). Difference in differences Treated groups Untreated groups vs. pure controls vs. pure controls Blocks with 80% treated Blocks with 80% treated 1 Intervention 1 Intervention begins Due date of begins Due date of Oct'20 bill Oct'20 bill Treatment effect (p.p.) Treatment effect (p.p.) .5 .5 0 0 -.5 -.5 16sep2020 30sep2020 14oct2020 28oct2020 16sep2020 30sep2020 14oct2020 28oct2020 Calendar date Calendar date Blocks with 50% treated Blocks with 50% treated 1 Intervention begins Due date of 1 Intervention Oct'20 bill Due date of begins Oct'20 bill Treatment effect (p.p.) Treatment effect (p.p.) .5 .5 0 0 -.5 -.5 16sep2020 30sep2020 14oct2020 28oct2020 16sep2020 30sep2020 14oct2020 28oct2020 Calendar date Calendar date Blocks with 20% treated Blocks with 20% treated 1 Intervention 1 Intervention begins Due date of begins Due date of Oct'20 bill Oct'20 bill Treatment effect (p.p.) Treatment effect (p.p.) .5 .5 0 0 -.5 -.5 16sep2020 30sep2020 14oct2020 28oct2020 16sep2020 30sep2020 14oct2020 28oct2020 Calendar date Calendar date Notes: These figures show the coefficients and 95% confidence intervals from dynamic difference-in-differences regressions where the outcome of interest is a dummy equal to one if the account is subscribed to an electronic bill. All the coefficients are estimated with respect to September 27th, 2020 (baseline date) and relative to the pure control group (i.e., blocks where no accounts were treated). The top panel shows the effect on treated (left) and untreated (right) units in blocks with 80% treated (Tg = 3). The middle panel shows the effect on treated (left) and untreated (right) units in blocks with 50% treated (Tg = 2). The bottom panel shows the effect on treated (left) and untreated (right) units in blocks with 20% treated (Tg = 1). Standard errors are clustered by block. The first vertical bar denotes the start of the intervention. The due date for the October 2020 bill was October 9th and is indicated with another vertical bar. The letters were delivered between September 28th and October 7th. B-19 Table A5: Total effects and spillover effects for subscriptions to e-billing Dependent variable: Placebo: Intervention: Pr(subscribe to e-bill) By Sep 20 Early By Oct 31 (1) (2) (3) A. Blocks with 80% treated Treated -0.02 0.31*** 0.86*** (0.04) (0.06) (0.12) Untreated 0.04 0.11 0.08 (0.03) (0.08) (0.15) B. Blocks with 50% treated Treated 0.03 0.18** 0.81*** (0.03) (0.08) (0.18) Untreated -0.07 0.10 0.25* (0.05) (0.06) (0.13) C. Blocks with 20% treated Treated -0.04 0.15* 0.57*** (0.05) (0.08) (0.19) Untreated -0.01 0.05 -0.09 (0.03) (0.04) (0.09) Mean of Pure Control at baseline 4.25 4.25 4.25 Observations 137,612 137,612 137,612 Number of clusters (blocks) 3,981 3,981 3,981 Notes: This table shows the results from a saturated dynamic difference-in-differences regression where the dependent variable is an indicator for subscribing to electronic billing. The regression computes the outcome difference between each of the treated and untreated groups relative to the pure control group for each calendar date relative to September 27th, 2020 (baseline date). The estimates correspond exactly to the numbers shown in Figure (B.10). Column (1) shows the results for e-bill subscriptions made before the letters were delivered (placebo); Column (2) shows the results for early subscriptions right after the letters started to be delivered (by October 3); Column (3) shows the results for subscriptions made up to the end of October 2020. The letters were delivered between September 28 and October 7. The due date for the October 2020 bill was October 9th. The row Mean of Pure Control displays the constant of the regression, corresponding to the average susbcription rate for units in blocks with no treated units on September 27, 2020. Standard errors clustered by blocks are reported in parentheses. * p<0.10, ** p<0.05, *** p<0.01 B-20 B.2.2 Timing of Payments and Due Bills For completeness, we analyze the effects of the intervention on backward and forward payments corre- sponding to billing periods before and after month 10, the month of our intervention. These results are summarized in Figure B.11. Intuitively, neighbors can pay their property tax bill at any time before or after the due date, and hence, payments from previous billing periods can also be affected by our intervention.3 To illustrate this, the left panels of Figure B.11 only consider timely payments, defined as bills paid before the 27th of the corresponding month. We set any payment made after the 27th as unpaid in our data. Hence, pre- intervention bills mechanically exclude any past-due payment triggered by our intervention. In contrast, the right panels of Figure B.11 consider timely as well as past-due payments made until December 2020 and, thus, capture backward payments triggered by our intervention (e.g., individuals that decide to pay the October 2020 bill as well as previous unpaid bills after receiving the letter). The top figures show payment rates in levels for treated units (black line) and pure control units (gray line), for 24 consecutive monthly bills between January 2019 and December 2020. Treated units are pooled from groups Tg = 1, 2, 3. The bottom figures report total treatment effects—i.e., the difference between treated and pure control units—and 95% confidence intervals for the 24 billing periods. The first vertical bar denotes the start of the COVID-19 pandemic in Argentina, and the second vertical bar flags the October’20 bill targeted by our intervention. Four important points are worth noting: (1) Overall, payment rate levels are low. The top left panel shows that about 48% of households pay their bill before the 27th of each month. This share is relatively constant until March 2020, when the COVID-19 pandemic hit Argentina and payment rates decreased sharply to 23%; (2) a similar pattern emerges when we consider timely and past-due payments. The reason why levels are higher and decrease over time is that as time goes by, it is more likely that individuals cancel unpaid bills; (3) placebo direct effects (red line), based on payment rates constructed with timely payments only, are precisely estimated and not different from zero for the 21 pre-intervention bills. For the October 2020 bill, however, timely payments are 4.4 p.p. higher in treated units relative to control blocks. This is reassuring and implies that our sample is balanced and that the effects we estimate are indeed caused by our experiment; and (4) when we account for past-due payments, the blue line shows that our intervention nudged some individuals to catch up with unpaid bills. The difference in payment rates between treated and pure control accounts experiences a noticeable increase in the pandemic billing periods from April 2020 onward. Although the October bill, when the intervention took place, presents the highest effect (4.2 p.p.), the letters also had some residual positive effects in November and December. 3 The treatment letter included past due balances and could therefore induce neighbors to make backward payments to cancel debt. B-21 Figure B.11: Direct effects on pre- and post-intervention bills Timely payments Timely and past-due only payments (a) Payment rates in levels % paid the bill % paid the bill 60 COVID-19 Intervention COVID-19 Intervention billing periods bill 60 billing periods bill 55 55 50 50 45 45 Treated (pooled) Treated (pooled) 40 40 Pure Control Pure Control 35 35 30 30 25 25 20 20 2019m1 2019m4 2019m7 2019m10 2020m1 2020m4 2020m7 2020m10 2021m1 2019m1 2019m4 2019m7 2019m10 2020m1 2020m4 2020m7 2020m10 2021m1 Billing Period Billing Period (b) Difference relative to pure control group Treatment Treatment effect (p.p.) effect (p.p.) 6 COVID-19 6 COVID-19 billing periods billing periods 5 5 Oct 2020: Oct 2020: 4.4 p.p. 4.2 p.p. 4 4 3 3 2 2 1 1 0 0 -1 -1 2019m1 2019m4 2019m7 2019m10 2020m1 2020m4 2020m7 2020m10 2021m1 2019m1 2019m4 2019m7 2019m10 2020m1 2020m4 2020m7 2020m10 2021m1 Billing Period Billing Period Notes: These figures show the effect of the communication campaign on payment rates of pre- and post-intervention bills. The left panels only consider timely payments, defined as bills paid before the 27th of the corresponding month (i.e., any payment made after the 27th is considered unpaid). Hence, pre-intervention bills mechanically exclude any past-due payment triggered by our intervention. The right panels consider timely as well as past-due payments made until December 2020 and, thus, capture backward payments triggered by our intervention (e.g., individuals who, after receiving the letter, pay the October 2020 bill as well as previous unpaid bills). The top figures show payment rates in levels for treated units (black line) and pure control units (gray line), for 24 consecutive monthly bills between January 2019 and December 2020. Treated units are pooled from groups Tg = 1, 2, 3. The bottom figures report total treatment effects—i.e., the difference between treated and pure control units—and 95% confidence intervals for the 24 billing periods. The letters were delivered between September 28th and October 7th. The vertical bar denotes the start of the COVID-19 pandemic in Argentina. Each coefficient is estimated in separate regressions. Standard errors are clustered at the block level. The red line shows no difference on timely payments for pre-intervention bills. In contrast, when we account for past-due payments, the blue line shows that our intervention nudged some individuals to catch up with unpaid bills from April 2020 onwards. B-22 B.3 Are Untreated Blocks Affected by the Intervention? A crucial aspect of partial population experiments is the unit within which the experimenter will test the presence of spillovers. In some settings, these are relatively straightforward to establish: electoral precincts for political outcomes, towns for regional policies, and schools or school districts for educational interventions. In our application, we aim to measure information spillovers among taxpayers. Discussions with municipal tax authorities and with taxpayers, as well as the context of our intervention, led us to select city street blocks as the relevant clusters for potential information spillovers about tax reminders and deadlines and their effects on tax compliance. Specifically, the campaign was motivated by the sharp drop in compliance in April 2020 induced by the severe lockdown imposed in the Greater Buenos Aires area in Argentina during the COVID pandemic in a context where most payments were made in person (see Figure B.11). The lockdown was strongly enforced, and as a result, citizens’ mobility was severely limited, which justifies the choice of the city street block—a relatively small cluster—as the relevant unit for information spillover since it reflects the limited physical interactions generated by the lockdown. A further justification is the city’s street layout, which consists mainly of relatively homogeneous straight streets with orthogonal intersections in square/rectangular city blocks (see Figure A.2). A potential concern with this setup is that the city street block may not be the relevant unit to capture information spillovers. The random assignment process and the city’s physical layout imply that taxpayers in pure control street blocks (i.e., blocks where no one received a tax reminder) were still adjacent and/or surrounded by blocks with treated taxpayers, as shown by inspection of the map in Figure A.2. Inter- ference between adjacent blocks is possible, and this would induce a downward bias in our results, since individuals in pure control (untreated) blocks would be affected by the information campaign via spillovers from adjacent (treated) blocks. Our empirical setup allows for an auxiliary test to rule out this concern and establish that units in pure control blocks indeed provide a valid counterfactual in our analysis.4 To test the robustness of untreated blocks as pure controls, we leverage our experimental assignment process, which implies that the “intensity” of treatment in the surrounding blocks is random by definition. Pure control units are by chance surrounded by blocks with varying degrees of treatment intensity (0%, 20%, 50%, or 80%), and thus by a random number of treated taxpayers. If there is interference between treated and untreated blocks, we should observe that pure control payment rates increase with the exposure of untreated blocks to treated blocks. We construct our measure of the potential exposure of a street-block to the intervention in two steps. First, we use GIS software to calculate a buffer of 100, 200, and 300 meters around the centroid of each street-block (see the three figures in the top panel of Figure B.12), given the typical street block length of 100 meters. Second, for each street-block and radius, we calculate the share of properties receiving a letter (treated) relative to all the properties in the buffer zone. The three figures in the middle panel of Figure 4 This is an auxiliary analysis in the sense that while it exploits features of our experimental design, it does not correspond to our pre-registered analysis and only represents an ex-post robustness check. B-23 B.12 display the distribution of the share of treated units around pure control blocks. With this exposure measure at hand, we test whether payment rates of the October 2020 bill in pure control blocks increase with the exposure to the proportion of treated units in surrounding blocks. The figures in the bottom panel of Figure B.12 present parametric and non-parametric evidence of this rela- tionship. Each panel shows a binned scatterplot of payment rates of the October 2020 bill (y-axis) by equally-sized bins of exposure to treated units within the buffer zone (x-axis). Reassuringly, the rela- tionship is flat, and it is robust to increasing the size of the buffer zone to 200 and 300 meters. This is confirmed by the small linear regression coefficients and large p-values reported in these figures. Our main results indicated that we only found spillover effects in our main research design for high saturation blocks with high previous compliance, as illustrated by the results in Figure 3 and Table 2. We conduct a similar analysis with the exposure measure for the 100-meter buffer in Figure B.13. The parametric and non-parametric results presented there confirm a flat gradient for untreated blocks with both high and low compliance in 2019, further confirming that untreated blocks were not affected by the intervention even when considering this heterogeneity. Finally, for completeness, we also study the relationship between payment rates and exposure to ad- jacent treated blocks in blocks where 80% of the units were treated, again for the 100-meter buffer. The results of this exercise are reported in Figure B.14. The left panel corresponds to the October 2020 bill affected by our intervention, whereas the middle and right panels correspond to pre-intervention bills of July and August 2020. In all these cases, the relationship between exposure and payment rates is flat and statistically not significant for both the pure control blocks (with blue dots and blue linear fit) and the 80% saturation blocks (with red triangles and a red linear fit). Interestingly, the vertical distance between the red and blue linear fit in the left panel captures the treatment effect of our experiment, which is clearly uniform in the exposure measure. Taken together, the results from the exercise in this section indicate that pure control blocks were not affected by adjacent treated blocks, and thus provide a valid counterfactual for the analysis. In more general terms, information spillovers do not seem to have happened at a higher degree of aggregation than the city street block. When combined with the presence of information spillovers documented in the main body of the paper, the city street block seems to have been the relevant level of information dissemination for this campaign. B-24 Figure B.12: Robustness of untreated blocks as pure controls 100m buffer 200m buffer 300m buffer (a) Illustration (b) Exposure distribution (share of treated accounts around pure treatment blocks) Share of 100m buffer Share of 200m buffer Share of 300mts buffer accounts accounts accounts .08 .08 .08 .06 .06 .06 .04 .04 .04 .02 .02 .02 0 0 0 0 .1 .2 .3 .4 .5 .6 .7 .8 0 .1 .2 .3 .4 .5 .6 .7 .8 0 .1 .2 .3 .4 .5 .6 .7 .8 Share of treated accounts around pure control blocks Share of treated accounts around pure control blocks Share of treated accounts around pure control blocks (c) Payment rates for pure controls as a function of exposure Payment Payment Payment rate 100m buffer rate 200m buffer rate 300m buffer .5 .5 .5 b=.000 b=.049 b=.008 (p=.979) (p=.271) (p=.881) .45 .45 .45 .4 .4 .4 .35 .35 .35 .3 .3 .3 .25 .25 .25 .2 .2 .2 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 Exposure: 100mts Exposure: 200mts Exposure: 300m Notes: The top three panels illustrate the way we compute buffer zones around the centroid of each street-block using GIS tools in our data. We consider radiuses of 100 meters (left panel), 200 meters (middle panel), and 300 meters (right panel). The middle panel three figures show the distribution of accounts in pure control street-blocks according to their exposure to treated accounts. The bottom three panels show binned scatterplots of payment rates of the October 2020 bill (y-axis) in pure control blocks and their exposure to treated units within the buffer zone (x-axis). The x-axis is grouped into equally-sized bins. The coefficient and p-value of each regression are also reported in each panel. The regressions flexibly control for a cubic polynomial of the number of properties in the buffer zone. This variable is highly correlated with payment rates, and its omission leads to omitted variable bias. B-25 Figure B.13: Payment rates and exposure of untreated blocks above and below median 2019 compliance, 100 meters buffer Payment All Payment Above Median Payment Below Median rate rate Compliance rate Compliance .55 .55 .55 b=-0.018 (p=0.687) .5 .5 .5 b=.000 (p=.979) .45 .45 .45 .4 .4 .4 .35 .35 .35 b=-0.013 (p=0.745) .3 .3 .3 .25 .25 .25 .2 .2 .2 .15 .15 .15 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 Exposure: 100mts Exposure: 100mts Exposure: 100mts Notes: This figure shows binned scatterplots of payment rates (y-axis) in pure control blocks by equally-sized bins of exposure to treated units within a buffer zone of 100 meters (x-axis). The left panel replicates the bottom left panel of Figure B.12. The middle and right panels split pure control blocks into blocks with above- and below-median compliance defined in 2019, respectively. The regressions flexibly control for a cubic polynomial of the number of properties in the buffer zone. This variable is highly correlated with payment rates, and its omission leads to omitted variable bias. Figure B.14: Payment rates and exposure of untreated blocks and blocks with 80% treated units, 100 meters buffer Payment Payment July 2020 Payment August 2020 rate October 2020 rate rate Pre intervention Pre intervention .5 .5 .5 .45 .45 .45 Blocks with 80% treated .4 .4 .4 .35 .35 .35 Untreated blocks .3 .3 .3 .25 .25 .25 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 .05 .15 .25 .35 .45 .55 Exposure: 100mts Exposure: 100mts Exposure: 100mts Notes: This figure shows binned scatterplots of payment rates (y-axis) by equally-sized bins of exposure to treated units within a buffer zone of 100 meters (x-axis). The left panel shows the gradient in both untreated blocks (blue dots) and blocks with 80% treated units (red triangles) for the October 2020 bill (the one affected by the intervention). The middle and right panels correspond to the pre-intervention bills of July and August 2020, respectively. The regressions flexibly control for a cubic polynomial of the number of properties in the buffer zone. This variable is highly correlated with payment rates and its omission leads to omitted variable bias. B-26 B.4 Heterogeneity and Pre-treatment Tax Compliance Section 4.4.2 analyzes heterogeneous effects based on pre-treatment tax compliance behavior in the year 2019. The distribution of the 68,806 accounts by the number of bills paid in 2019 is bi-modal, with a core group of neighbors not paying any bills (35%) and another group paying all of them (45%). Panel (a) of Figure B.15 shows the individual-level distribution. We define past compliance by computing the average number of payments of the twelve monthly bills for 2019 in each block. Panel (b) of Figure B.15 shows the block-level distribution. We use this measure to divide our sample into two groups – those above and those below the median block average payment rate. The logic of this exercise goes as follows. A large fraction of neighbors who typically paid their bills stopped doing so during the pandemic in the first few months of 2020. This decrease in compliance was stronger in blocks that had higher compliance in 2019. Hence, we argue that such a core group of “good compliers” is more likely to be nudged to pay by our intervention, and where spillover effects are more likely to manifest. Figure B.16 suggests that 2018 and 2019 are comparable in terms of compliance, but compliance decreased substantially in 2020 because of the pandemic—the sharp fall corresponds to the lockdown measures put in place. Figure B.17 shows that payment rates in 2020 decreased more in blocks with higher compliance in 2019. In contrast, 2018 and 2019 show similar levels of compliance. This set of figures thus helps us rationalize the reasoning behind the heterogeneity analysis presented in Section 4.4.2. Table 2 confirms that spillover effects are driven by blocks with baseline compliance above the median in high saturation blocks (80% treated). Spillover effects are more muted and insignificant in medium (50% treated) and low (20% treated) saturation blocks, however. Reassuringly, the first two columns also show no effects for the pre-intervention bill of September 2020, either above or below the median. B-27 Figure B.15: Distribution of bill payments in 2019 for individuals and blocks (a) Number of monthly bills paid in 2019 (by individuals) Share of accounts .5 .4 45% .3 35% .2 .1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of bills paid in 2019 (b) Share of bills paid in 2019 (by blocks) N blocks 300 Mean = 0.59 p50 = 0.56 N = 3,981 200 100 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Share of bills paid in 2019 - by block Notes: Panel (a) shows the distribution of the 68,806 accounts by the number of bills paid in 2019. The distribution is bi-modal with a core group of neighbors not paying any bills (35%) and another group paying all of them (45%). Panel (b) uses the information from panel (a) to compute the share of total bills paid in 2019 for each block. We use this measure of block-level compliance for the heterogeneity analysis, to split our sample into blocks below and above the median of 0.56 (see Table 2). These two figures and values look very similar for the year 2018. B-28 Figure B.16: Compliance in the first nine months of 2018, 2019, and 2020 (a) 2018 vs 2019 N blocks 300 2018 2019 200 100 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Share of bills paid - by block (b) 2019 vs 2020 N blocks 2020 300 2019 200 100 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Share of bills paid - by block Notes: These figures show compliance in the first 9 billing periods of the year. For each block, we compute the share of total bills paid out of 9. Panel (a) compares 2018 and 2019, and panel (b) compares 2019 and 2020. We restrict the analysis to the first 9 bills because our intervention takes place in October. To make it comparable, the numerator excludes overdue payments (i.e., payments made after the due date of each month). The figure suggests that 2018 and 2019 are comparable in terms of compliance and that compliance decreased substantially in 2020 because of the pandemic. B-29 Figure B.17: Payment rates in 2020 decreased more in blocks with higher compliance in 2019 1 45° line Share of bills paid in 2018 or 2020 .9 .8 2018 .7 .6 2020 .5 .4 .3 .2 .1 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Share of bills paid in 2019 Notes: This figure compares compliance in 2018 or 2020 (vertical axis) relative to 2019 (horizontal axis) at the block level. To that end, we split the sample of blocks into ten evenly-spaced groups using the share of payments in 2019 (horizontal axis). For each bin, we then compute the average share of payments in 2018, 2019, and 2020. The red triangles compare 2018 to 2019, and the blue circles compare 2020 to 2019. The 45◦ line corresponds to the situation where compliance remains unchanged over time. The figure suggests that the drop in compliance in 2020 highlighted in Figure B.16 is more prominent for higher levels of baseline compliance. That is, blocks that had high compliance in 2019 are those where the payment rate decreased the most in the first nine months of 2020. In contrast, 2018 and 2019 display similar levels of compliance. This stylized fact suggests that blocks with high compliance in 2019 (and low compliance in 2020) are more likely to be nudged by our intervention and, thus, where spillovers are more likely to manifest. B-30 C Additional Numerical Illustration Figure 1 summarizes the distribution of cluster sizes in five published studies employing partial population designs: epon et al. (2013), Gin´ Cr´ undeln (2012) and e and Mansuri (2018), Haushofer and Shapiro (2016), Ichino and Sch¨ Imai, Jiang and Malani (2021). For this numerical illustration, we calculate the estimators standard errors and minimum detectable effects based on our formulas from Section 3 using the cluster size distribution of these four studies. We refer to these magnitudes as “adjusted” standard errors and MDEs, since they are adjusted for cluster size variation. For comparison, we also calculate the “unadjusted” standard errors and MDEs using average cluster size and assuming that the variance of group size is equal to zero, that is, ignoring cluster size heterogeneity. To make the results comparable, we use as a benchmark the design in our application to tax compliance, which has four saturations: p0 = 0, p1 = 0.2, p2 = 0.5, p3 = 0.8. We compute the optimal probabilities {q0 , q1 , q2 , q3 } using Theorem 2. We assume for simplicity that outcomes are homoskedastic with σ 2 (dt, dt) = 1 for all d, t so that effects are measured in standard deviations, and consider four values for the intraclass correlation, ρ ∈ {0.1, 0.2, 0.5, 0.8}. The parameter of interest is the spillover effect on untreated units in groups with 80% treated. Table A6: Numerical results Standard error MDE Adj. Unadj. Ratio Adj. Unadj. Ratio ρ = 0.1 GM 0.1262 0.1181 1.0687 0.3536 0.3308 1.0689 HS 0.1053 0.0932 1.1307 0.2951 0.2610 1.1307 IS 0.1768 0.1667 1.0608 0.4954 0.4670 1.0608 IJM 0.0569 0.0497 1.1453 0.1595 0.1393 1.1450 ρ = 0.5 GM 0.2593 0.2393 1.0835 0.7265 0.6705 1.0835 HS 0.2098 0.1783 1.1761 0.5877 0.4997 1.1761 IS 0.3437 0.3171 1.0840 0.9630 0.8884 1.0840 IJM 0.1136 0.0950 1.1961 0.3183 0.2661 1.1962 ρ = 0.8 GM 0.3252 0.2997 1.0851 0.9112 0.8397 1.0851 HS 0.2622 0.2218 1.1818 0.7345 0.6215 1.1818 IS 0.4284 0.3941 1.0869 1.2002 1.1042 1.0869 IJM 0.1420 0.1181 1.2024 0.3979 0.3309 1.2025 The numerical results are shown in Table A6. When the intraclass correlation is low (ρ = 0.1), accounting for cluster size heterogeneity increases standard errors and MDEs between 6.8% and 14.5%. The problem worsens for larger intraclass correlations. When ρ = 0.5, adjusted standard errors and MDEs are between 8.3% and 19.6% larger, and between 8.5% and 20.2% larger when ρ = 0.8. D-31 D Proofs D.1 Setup and Definitions Following the notation in the paper, consider clusters g = 1, . . . , G with cluster size ng , units i = 1, . . . , ng and total sample size n = g ng . The cluster-level treatment assignment is Tg ∈ {0, . . . , M } with P[Tg = t] = qt , and the individual-level treatment indicator Dig with P[Dig = d|Tg = t] = pg (d|t). Within each cluster, the total number of i 1(Dig = d), and conditional on Ng > 0, the within-cluster average d = units receiving treatment Dig = d is Ng d outcome under Dig = d is Y ¯ d = ng Yig 1(Dig = d)/N d . g i=1 g Letting 1dt ig = 1(Dig = d, Tg = t), ig )(d,t) and 1g = (1ig , . . . , 1ng g ) , the vector of OLS estimators 1ig = (1dt for the sample means is: −1 µ ˆn = 1g 1g 1g Yg = (N)−1 1g Yg g g g where N = diag (N (d, t))(d,t) is a diagonal matrix with entries N (d, t) = g g Ng and where 1g = 1(Tg = t). 1t d t 2 (d, t), Cov(Y , Y |D Also define E[Yig |Dig = d, Tg = t] = µg (d, t), V[Yig |Dig = d, Tg = t] = σg ig jg ig = d, Djg = d , Tg = t) = cg (d, d , t) with Cov(Yig , Yjg |Dig = d, Djg = d, Tg = t) = cg (d, t) and similarly ρg (d, d , t) = cg (d, d , t)/(σg (d, t)σg (d , t)) and ρg (d, t) = ρg (d, d, t). Finally, let pg (d, d |t) = P[Dig = d, Djg = d |Tg = t]. D.2 Auxiliary Results Lemma 1 (Convergence of Sample Sizes) Under Assumptions 1 and 3, ï ò−1 ïò N N N ×E →P I2M +1 , E = diag qt ng pg (d|t)/n . n n n g (d,t) Proof. For any (d, t), 1 1 1 1t V[1t V 1t 1 ¶ î ó î ó© d d d t d V g Ng = g Ng ] = g E[ Ng |Tg ] + E g V [ Ng |Tg ] n g n2 g n2 g 1 = 2 n2 2 g pg (d|t) qt (1 − qt ) + qt ng pg (d|t)(1 − pg (d|t)) + qt ng (ng − 1) pg (d, d|t) − pg (d|t) 2 n g n2 g ng = qt (1 − qt ) pg (d, t)2 + qt pg (d|t)(1 − pg (d|t)) g n2 g n2 ng (ng − 1) + qt 2 pg (d, d|t) − pg (d|t)2 g n 2 Ç å g ng =O = o(1). n2 D-32 since g n2 2 g /n ≤ maxg ng /n → 0. Therefore, by Markov’s inequality, ñ ï ò−1 ô Å ã2 N N N (d, t)/n P ×E − I2M +1 > ε = P − 1 > ε2 n n d,t E[N (d, t)/n] 1 1 ε E[N (d, t)] ≤ P 1t d g Ng − E 1t d g Ng >√ d,t n g n g 2M + 1 n Ç å2 (2M + 1)2 n 1 ≤ V 1t d g Ng ε2 d,t qt g g pg (d|t) n n g (2M + 1)3 1 1 ≤ 2 · · max V 1t d g Ng → 0 ε c d,t n g using that g ng pg (d|t)/n is bounded below. ¯ d ) Under Assumptions 1 and 2, Lemma 2 (Moments of Yg g E[Yg |Tg = t, Dg ] = 1g µg (d, t) 1t ¯d t 1t 1d ig 1jg d 1t g V [ ¯ Yg d |Tg = t, Dg ] = σ (d, t) + 2cg (d, t)1t g 2 d g g d )2 Ng i j>i (Ng 1t î ó d 2 ¯d 2 E g (Ng ) V[Yg |Tg = t, Dg ] = σg (d, t)qt ng pg (d|t) + cg (d, t)ng (ng − 1)qt pg (d, d|t). ig = 1(Dig = d), Proof. By direct calculation, letting 1d 1t ñ ô 1 1t ¯d g E[Yg |Tg = t, Dg ] = 1 t gE d Yig d1 Tg = t, Dg = ig d g E[Yig |Tg = t, Dg ]1d ig Ng i Ng i =1 t g µg (d, t) where the last equality follows from Assumption 2. Similarly, 1t 1t ¯d g V[Yg |Tg = t, Dg ] = g d )2 V[Yig |Tg = t, Dig = d]1d ig + 2 ig 1jg Cov(Yig , Yjg |Dig = d, Djg = d) 1d d (Ng i i j>i 1t 1 1 d d + 21t g 2 ig jg = σ (d, t) d g g cg (d, t) d 2 Ng i j>i ( N g) and the third expression follows from taking expectation. Lemma 3 (Convergence of squared sums) Given a vector of random variables Xg = (X1g , . . . , Xng g ) and (Xg )G g =1 , ng 1 let Xg = i=1 Xig X 2 . Suppose that: (i) (Xg )G and define Tn = g =1 are independent across g ; (ii) Assumption îg g ó n 3(i) holds; (iii) For some > r, supi,g E |Xig | < ∞. Then |Tn /E[Tn ] − 1| →P 0. D-33 Proof. This proof follows those of Theorems 2 and 3 in Hansen and Lee (2019). Write 2 ng Tn 1 Xg 1 2 Xig = 1 = Zg , Zig := , Zg := Zig . E[Tn ] n g E n g 2 Xg n g E[Tn ]1/2 i=1 Fix ε > 0. We show that for n large enough, E [|Tn /E[Tn ] − 1|] < ε and the result follows by Markov’s inequality. Set δ = ε2 /4. Then, using that: 1 1 1 E 2 Zg =1= 2 E[Zg 1(Zg 2 > nδ )] + 2 E[Zg 1(Zg 2 ≤ nδ )] n g n g n g we have: ï ò Tn 1 E −1 ≤E 2 Zg 1(Zg 2 2 > nδ ) − E[Zg 1(Zg 2 > nδ )] E[Tn ] n g 1 +E 2 Zg 1(Zg 2 2 ≤ nδ ) − E[Zg 1(Zg 2 ≤ nδ )] . n g and by the triangle inequality, ï ò Tn 2 E −1 ≤ 2 E[Zg 1(Zg 2 > nδ )] (13) E[Tn ] n g 1 + E 2 Zg 1(Zg 2 2 ≤ nδ ) − E[Zg 1(Zg 2 ≤ nδ )] . (14) n g Consider the term (13). For r ≥ 2, |Zg |r Ä ñ ô 1 1 1 1 |Zg | > (nδ ) ä 2 2 r −2 r/2−1 E[Zg (Zg > nδ )] = E n g n g |Zg |r−2 1 1 î Ä äó r 1/2 ≤ E | Z g | | Z g | > ( nδ ) n(nδ )r/2−1 g ñ r Ç åô 1 Zg Zg (nδ )1/2 ≤ r n E 1 > nr/2 δ r/2−1 g g ng ng ng r Ç å1/2 1 Zg Zg n ≤ r n E 1 > δ 1/2 nr/2 δ r/2−1 g g ng ng maxg n2 g r r Ç å1/2 1 g ng Zg Zg n ≤ · sup E 1 > δ 1/2 δ r/2−1 nr/2 g ng ng maxg n2 g r Ç å1/2 C r/2 Zg Zg n ≤ sup E 1 > δ 1/2 δ r/2−1 g ng ng maxg n2 g D-34 where the last equality follows from Assumption 3(i). Now, by Condition (iii), for > r, î ó î ó supi,g E |Xig | sup E |Zig | = < ∞. i,g E[Tn ]l/2 Thus by Lemma 1 in Hansen and Lee (2019), there is a B large enough such that: ñ r Ç åô Zg Zg εδ r/2−1 sup E 1 >B ≤ g ng ng 2C r/2 and by Assumption 3(i) there is an n large enough such that: Ç å1/2 n B≤ δ 1/2 , maxg n2 g from which: r Ç å1/2 Zg Zg n εδ r/2−1 sup E 1 > δ 1/2 ≤ . g ng ng maxg n2 g 2C r/2 Therefore, 1 ε 2 E[Zg 1(Zg 2 > nδ )] ≤ . n g 2 Next, consider the term (14). We have that: 2 1/2   1 1  E 2 Zg 1(Zg 2 2 ≤ nδ ) − E[Zg 1(Zg 2 ≤ nδ )] ≤ E 2 Zg 1(Zg 2 2 ≤ nδ ) − E[Zg 1(Zg 2 ≤ nδ )]  n g n g 1/2 1 = V 2 Zg 1 2 (Zg ≤ nδ ) n g 1/2 1 1 1 2 î ó 2 2 2 2 = E Zg (Zg ≤ nδ ) − E[Zg (Zg ≤ nδ )] n g 1/2 1 ≤ E 4 Zg 1 2 (Zg ≤ nδ ) n g where the first line uses Jensen’s inequality, the second line uses the definition of variance, the third line uses the fact that clusters are independent and the fourth line uses that for any random variable A, V[W ] ≤ E[W 2 ]. Next, 4 1(Z 2 ≤ nδ ) = Z 2 1(Z 2 ≤ nδ ) use that Zg 2 1(Z 2 ≤ nδ ) ≤ nδZ 2 and thus Zg g g g g g 1/2 1/2 1 1 ε E 4 Zg 1 2 (Zg ≤ nδ ) ≤δ 1/2 2 E[Zg ] ≤ δ 1/2 = n g n g 2 D-35 since 2 ]/n = 1. Collecting these results, E[Zg g ï ò Tn E −1 ≤ε E[Tn ] as required. D.3 Proof of Theorem 1 For any (d, t), g 1t d ¯d p g Ng (Yg − µn (d, t)) ˆ(d, t) − µp µ n (d, t) = N (d, t) g 1 ¯d t N d (Y g g g− µg (d, t)) 1 t d g ( g Ng − qt ng pg (d|t))(µg (d, t) − µp n (d, t)) = + N (d, t) N (d, t) where the second equality uses that: qt ng pg (d|t)(µg (d, t) − µp n (d, t)) = qt ng pg (d|t)µg (d, t) − µp n (d, t) ng pg (d|t) = 0. g g g Next, g 1t d ¯d g Ng (Yg − µg (d, t)) 1 − qt ng pg (d|t))(µg (d, t) − µp t d g ( g Ng n (d, t)) ˆ(d, t) − µ µp n (d, t) = + N (d, t) N (d, t) E[N (d, t)] 1 g Ng (Yg − µg (d, t)) + (1g Ng − qt ng pg (d|t))(µg (d, t) − µn (d, t)) 1t d ¯d t d p = · N (d, t) n g qt g ng pg (d|t)/n E[N (d, t)] 1 = · ψg (d, t) N (d, t) n g where 1t g Ng (Yg − µg (d, t)) + (1g Ng − qt ng pg (d|t))(µg (d, t) − µn (d, t)) d ¯d t d p ψg (d, t) = , E[ψg (d, t)] = 0 ¯n (d|t) qt p D-36 ¯n (d|t) = with p g ng pg (d|t)/n and 1 g (Ng ) V[Yg |Tg = t, Dg ] + (µg (d, t) − µn (d, t)) V[1g Ng ] E 1t ¶ î ó © V[ψg (d, t)] = d 2 ¯d p 2 t d 2p qt ¯n (d|t)2 2 n (d, t))Cov(1g Ng (Yg − µg (d, t)), 1g Ng ) t d ¯d + 2p (µg (d, t) − µp t d qt ¯n (d|t)2 1 = 2 σ 2 (d, t)qt ng pg (d|t) + cg (d, t)ng (ng − 1)qt pg (d, d|t) ¯n (d|t)2 g qt p (µg (d, t) − µp n (d, t)) 2 + 2p qt (1 − qt )n2 2 g pg (d|t) + qt ng pg (d|t)(1 − pg (d|t)) qt ¯n (d|t)2 (µg (d, t) − µp n (d, t)) 2 + 2p qt ng (ng − 1) pg (d, d|t) − pg (d|t)2 . qt ¯n (d|t)2 From this, 1 1 V ψg (d, t) = V[ψg (d, t)] n g n2 g 1 ng 2 = σ (d, t)qt pg (d|t) 2p qt ¯n (d|t)2 g n2 g 1 ng (ng − 1) + 2 cg (d, t)qt pg (d, d|t) ¯n (d|t)2 qt p g n2 qt (1 − qt ) n2 g + 2p pg (d|t)2 (µg (d, t) − µp n (d, t)) 2 qt ¯n (d|t)2 g n2 qt ng + 2 pg (d|t)(1 − pg (d|t))(µg (d, t) − µp n (d, t)) 2 qt p¯n (d|t)2 g n2 qt ng (ng − 1) + 2 pg (d, d|t) − pg (d|t)2 (µg (d, t) − µp n (d, t)) 2 ¯n (d|t)2 qt p g n2 2 Ç å g ng =O = o(1) n2 2 (d, t) and |µ (d, t) − µ (d, t)| are bounded by Assumption 3, p since σg ¯n (d|t) is bounded from below and maxg ng /n → g n 0. This implies that: ˆ(d, t) − µp |µ n (d, t)| →P 0 for all (d, t), which gives the consistency result. Next, stack the elements ψg (d, t) in a vector ψg and note that 1 1 Ωn = V √ ψg = E[ψg ψg ] n g n g D-37 where pg (d, d|t) ß Å ã 1 2 n 2 E[ψg (d, t) ] = 2 ng σg (d, t)pg (d|t) 1 + ρg (d, t)(ng − 1) n g qt g ng pg (d|t) g pg (d|t) + ng (µg (d, t) − µp 2 2 n (d, t)) ng (1 − qt )pg (d|t) + pg (d|t)(1 − pg (d|t))) + (ng − 1)Cov(1d ig , 1jg |Tg = t) ä© d , 1 n , t)ng pg (d, d |t) g cg (d, d E[ψg (d, t)ψg (d , t)] = n g qt g ng pg (d|t) g ng pg (d |t) − µpn (d, t))(µg (d , t) − µn (d , t))Cov(1g Ng , 1g Ng ) n p t d t d g (µg (d, t) + , qt g ng pg (d|t) g ng pg (d |t) n2 p p 1 n g g pg (d|t)pg (d |t )(µg (d, t) − µn (d, t))(µg (d , t ) − µn (d , t )) E[ψg (d, t)ψg (d , t )] = − . n g g ng pg (d|t) g ng pg (d |t ) and this variance matrix is invertible because its minimum eigenvalue is bounded below by assumption. Finally, write 1 1 ψg (d, t) = ψig (d, t) n g n g i where g 1ig 1t d ® Ç å ´ 1 ψig (d, t) = 11 t d g ig (Yig − µg (d, t)) + − qt pg (d|t) (µg (d, t) − µp n (d, t)) . ¯n (d|t) qt p ng Then we have that for > r ≥ 2, Å î ã î ó1/ 1 ó1/ p 1/ E |ψig (d, t)| ≤ E |Yig − µg (d, t)| + |µg (d, t) − µn (d, t)| <∞ qt c uniformly over i, g, d, t since as moments are uniformly bounded and using Minkowski’s inequality and Assumption 3. Thus, /2 sup E[ ψig ] ≤ (2M + 1) sup E[|ψig (d, t)| ] < ∞ i,g i,g,d,t and by Theorem 2 in Hansen and Lee (2019), 1/2 1 Ω− n √ ψg →D N (0, I(2M +1) ). n g To complete the proof, notice that by Lemma 1 and the Slutsky theorem: −1/2 √ −1 −1/2 1 −1/2 1 Ωn n(µ ˆp ˆn − µn ) = N E[N]Ωn √ ψg = Ωn √ ψg + oP (1) →D N (0, I(2M +1) ) n g n g as required. D-38 D.4 Proof of Proposition 1 By Equation (7), ˆ cr (d, t) = n g 1t d 2 ¯d ˆ(d, t))2 g (Ng ) (Yg − µ Ω N (d, t)2 g 1g 1t i 1ig (Yig − µ 1d Ä äÄ ä t d g ˆ(d, t)) i ig (Yig − µ ˆ(d , t )) ˆ cr ((d, t), (d , t )) = n Ω N (d, t)N (d , t ) ˆ cr (d, t, d t ) = 0 for t = t . For the main diagonal terms, recall that: and notice that Ω 1 g (Ng ) V[Yg |Tg , Dg ] + (µg (d, t) − µn (d, t)) V[1g Ng ] . E 1t ¶ î ó © Ωn (d, t) = d 2 ¯d 2 t d 2 ¯n (d|t)2 nqt p g Adding and subtracting µp n (d, t) and expanding the square, the variance estimator is: Å ã2 n 1 ˆ cr (d, t) = Ω 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) 2 (15) N (d, t) n g 1 + (µp ˆ(d, t))2 n (d, t) − µ 1t d 2 g (Ng ) (16) n g 1 + 2(µp n (d, t) − µ ˆ(d, t)) 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) (17) n g where   g n2 g |µ ˆ(d, t) − µp n (d, t)| = OP . n2 as shown in the proof of Theorem 1. By Lemma 1 and the continuous mapping theorem, Å ã2 n 1 = 2p (1 + oP (1)). N (d, t) qt ¯n (d|t)2 On the other hand, for term (16), 1 1 2 E[1t d 2 g (Ng ) ] = g 1ig ] + E[1t d g 1ig 1jg ] E[1t d d n g n g i n g i j>i 1 1 = ng qt pg (d|t) + ng (ng − 1)qt pg (d, d|t) n g n g qt ¯n (d|t) + = qt p ng (ng − 1)pg (d, d|t) n g 2 Ç å gng =O n g 1ig , by Lemma 3, and letting Xig = 1t d 1 n 1t d 2 g (Ng ) g = 1 + oP (1) g E[1g (Ng ) ] 1 t d 2 n D-39 so 1 1 (µp ˆ(d, t))2 n (d, t) − µ 1t d 2 p ˆ(d, t))2 g (Ng ) = (µn (d, t) − µ E[1t d 2 g (Ng ) ](1 + oP (1)) n g n g 2 2 maxg n2 Ç å Ç å Ç å g ng g ng g = OP O ≤ OP = oP (1) n2 n n under Assumption 3. For term (17), 1 1 1 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) = 1t d 2 ¯d g (Ng ) (Yg − µg (d, t)) + 1t d 2 p g (Ng ) (µg (d, t) − µn (d, t)). n g n g n g Now, 1t î ó d 2 ¯d E g (Ng ) (Yg − µg (d, t)) = 0 and n2 Ç å 1 1 1 g g î ó t d 2 ¯d ¯d E g (Ng ) (Yg − µg (d, t)) ≤ n2 gE Y g − µg (d, t) =O n g n g n so by Markov’s inequality, n2 Ç å 1 1t d 2 ¯d g (Ng ) (Yg − µg (d, t)) = OP g g . n g n On the other hand, 1 n2 1t d 2 p g (Ng ) (µg (d, t) − µn (d, t)) ≤ max |µg (d, t) − µp n (d, t)| g n g g n which implies n2 Ç å 1 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) = OP g n g n and therefore   n2 Ç å 1 n2 (µp n (d, t) − µ ˆ(d, t)) 1t d 2 ¯d g (Ng ) (Yg − µp n (d, t)) = OP g g OP g n g n2 n     g n6 g n2 g g n2 g = OP ≤ OP max · = oP (1). n4 g n n2 Thus, Å ã2 n 1 ˆ cr (d, t) = Ω 1t d 2 ¯d p 2 g (Ng ) (Yg − µn (d, t)) + oP (1). N (d, t) n g Next, under Assumption 3 and by Lemma 3 1 n g 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) 2 = 1 + oP (1) g 1g (Ng ) (Yg − µn (d, t)) 1 t d 2 ¯d p 2 E n D-40 and therefore: Å ã2 n 1 ˆ cr (d, t) = Ω E 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) 2 (1 + oP (1)) + oP (1) N (d, t) n g 1 E 1 ¯ d − µp (d, t))2 t (N d )2 (Y g g g n = 2p (1 + oP (1)) + oP (1). n g qt ¯n (d|t)2 But 1 E 1t d 2 ¯d p g (Ng ) (Yg − µn (d, t)) 2 1 E 1t d 2 ¯d p 2 g (Ng ) E (Yg − µn (d, t)) |Tg , Dg 2p = 2p n g qt ¯n (d|t)2 n g qt ¯n (d|t)2 1t î Ä äó ¯gd |Tg , Dg + (µg (d, t) − µp d 2 V Y 2 1 E g (Ng ) n (d, t)) = 2p n g qt ¯n (d|t)2 1 E 1t d 2 ¯d g (Ng ) V Yg |Tg , Dg + V 1t d p g Ng (µg (d, t) − µn (d, t)) 2 = 2p n g qt ¯n (d|t)2 1 E 1t d 2 (µ (d, t) − µp (d, t))2 g Ng g n + 2p n g qt ¯n (d|t)2 n2 pg (d|t) 2 Å ã g 2 = Ωn (d, t) + (µg (d, t) − µp n (d, t)) g n p ¯n (d|t) which implies that: ã2 2 ˆ cr (d, t) n2 pg (d|t) (µg (d, t) − µp Å Ω g n (d, t)) =1+ + oP (1). Ω(d, t) g n ¯n (d|t) p Ωn (d, t) ˆ cr (0, 0). The true ˆ cr (d, t) + Ω ˆcr (d, t) = Ω Finally, consider the variance estimator for the difference in means, V variance is: Vn (d, t) = Ωn (d, t) + Ωn (0, 0) − 2Ω(d, t, 0, 0) n2 pg (d|t) Å ã g = Ωn (d, t) + Ωn (0, 0) + 2 (µg (d, t) − µn (d, t))(µg (0, 0) − µn (0, 0)). g n ¯n (d|t) p Therefore, ã2 ˆcr (d, t) n2 pg (d|t) Å V g p p 1 =1+ (µg (d, t) − µn (d, t)) − (µg (0, 0) − µn (0, 0)) + oP (1). Vn (d, t) g n ¯n (d|t) p Vn (d, t) so that ˆcr (d, t) V plim ≥ 1. n→∞ Vn (d, t) as required. D-41 D.5 Proof of Theorem 2 Based on Theorem 1, the variance for the difference in means can be approximated as: ng pg (d|t) pg (d, d|t) ï ß ™ ˆ(d, t)] ≈ 1 V[ β 2 σg (d, t) 1 + ρg (d, d, t) (ng − 1) qt 2 ¯ (d|t)2 pg (d|t) g n p n pg (d, d|t) ß ™ò 2 + (µg (d, t) − µn (d, t)) 1 + (ng − 1) pg (d|t) 1 ng 2 + σ (0, 0) {1 + ρg (0, 0, 0)(ng − 1)} + ng (µg (0, 0) − µn (0, 0))2 q0 g n 2 g ò2 n2g pg (d|t) ï − 2 p (µg (d, t) − µn (d, t)) − (µg (0, 0) − µn (0, 0)) g n ¯n (d|t) where the last term does not depend on {qt }t so after dropping this term and rescaling by n2 , the minimization problem is equivalent to: M Bt (ω ) B0 min + = f (q0 , q1 , . . . , qM ) q0 ,q1 ,...,qM t=1 qt q0 subject to qt > 0, t qt = 1 where B0 and Bt (ω ) are defined in the statement of the theorem. The first-order conditions for each qt , t > 0 are given by:   ∂f Bt (ω ) B0 ∗ Bt (ω ) ∗ =− 2 + 2 =0 ⇐⇒ qt = q ∂qt qt q0 B0 0 Since t>0 qt = 1 − q0 ,   ∗ ∗ Bt (ω ) 1− q0 = q0 t>0 B0 and thus: √ √ ∗ B0 ∗ Bt q0 =√ , qt =√ , t > 0. B0 + t>0 Bt (ω ) B0 + t>0 Bt (ω ) On the other hand, the second-order conditions for t > 0 are given by: ∂2f 2Bt (ω ) 2B0 ∂2f 2B0 2 = 3 + 3 , = 3 ∂qt qt q0 ∂qt ∂ql q0 and therefore the Hessian matrix H can be written as: Å ã Å ã 2B1 (ω ) 2BM (ω ) 2B0 H = diag 3 ,..., 3 + 3 1M 1M q1 qM q0 where 1M is an M × 1 vector of ones. Thus, for any non-zero M × 1 vector v, M M M 2 2 2Bt (ω )vt Å 2B0 ã 2 2Bt (ω )vt Å 2B0 ã v Hv = 3 + 3 v 1M 1M v = 3 + 3 vt >0 t=1 q1 q0 t=1 q1 q0 t=1 D-42 using that Bt (ω ) > 0 for all t so the Hessian is positive definite as required. D.6 Proof of Corollary 1 This proof follows from Theorem 1 and Proposition 1 setting µg (d, t) = µp n (d, t) = µ(d, t) throughout. D.7 Proof of Proposition 2 Let Dg (d, t) denote the set of possible values for D(i)g given Dig = d and Tg = t. Then, E[Yig |Dig = d, Tg = t] = E[Yig |Dig = d, D(i)g = dg , Tg = t]P[D(i)g = dg |Dig = d, Tg = t] dg ∈Dg (d,t) = E[Yig (d, dg )|Dig = d, D(i)g = dg , Tg = t]P[D(i)g = dg |Dig = d, Tg = t] dg ∈Dg (d,t) = E[Yig (d, dg )]P[D(i)g = dg |Dig = d, Tg = t] dg ∈Dg (d,t) n g −1 ï Å ãò sg = E Yig d, P[Sig = sg |Dig = d, Tg = t] sg =0 ng − 1 where the first equality follows by the law of iterated expectations, the second equality plugs in the potential out- comes under the exclusion restriction (Assumption 6), the third equality uses independence (Assumption 7), and the fourth equality uses exchangeability (Assumption 8). D.8 Proof of Theorem 3 We verify the conditions for Theorem 1. First, condition (i) implies that Proposition 2 holds. Second, condition (ii) and Proposition 2 imply Assumption 1. Next, condition (iii) implies that: 1 − D = s , D = d|T = t] P[ Ng P[Sig = sg , Dig = d|Tg = t] ig g ig g P[Sig = sg |Dig = d, Tg = t] = = pg (d|t) pg (d|t) P[Ng1 = s + d, D = d|T = t] P[Dig = d|Tg = t] = 1(sg + d = ng pg (1|t)) g ig g = pg (d|t) pg (d|t) = 1(sg = ng pg (1|t) − d) D-43 î Ä n p (1|t)−d äó g and thus by Proposition 2, E[Yig |Dig = d, Tg = t] = E Yig d, g n g −1 . This fact also implies that E[Yig |Dig = d, Tg = t, D(i)g ] = E[Yig |Dig = d, Tg = t, D(i)g = dg ]1(D(i)g = dg ) dg = E[Yig (d, (dg 1g − d)/(ng − 1))]1(D(i)g = dg ) dg = E[Yig (d, (sg /(ng − 1))]1(Sig = sg ) sg = E[Yig (d, (ng pg (1|t) − d)/(ng − 1))] and an analogous argument gives the result for the joint moments, so Assumption 2 holds. Next, condition (iii) implies that Assumption 3 holds, so all the requirements for Theorem 1 are satisfied. Finally, by Proposition 2 and condition (iv), ng p(1|t) − d ï Å ãò ng ng ng ng βn (d, t) : = µg (d, t) − µg (0, 0) = E Yig d, − E[Yig (0, 0)] g n g n g n ng − 1 g n which completes the proof. D.9 Proof of Theorem 4 This result follows from the fact that conditions (i) and (ii) imply Assumption 5 and thus under the conditions for Theorem 3, Corollary 1 holds. D-44