Measuring Inequality from Top to Bottom

This paper presents a new methodology to measure inequality that optimally combines household survey information and tax records to construct a complete income distribution. Combining the two data sources is necessary because, on the one hand, household surveys do not accurately represent the wealthiest segment of the population, while tax records do; on the other hand, the opposite is true for the lower end of the income distribution: tax records only include incomes above a certain threshold. The key innovation of the proposed methodology?and the main difference from the existing literature?is the choice of an optimal income threshold b. The Gini coefficient for the population is then computed combining the conditional income distributions for incomes below b (using household survey data) and above b (using tax records). Central to this methodology is the fact that b is not chosen arbitrarily: it should be determined in such a way as to minimize reliance on household survey data to compute the top of the income distribution. In practice, the optimal b corresponds to the minimum income level that triggers mandatory tax filing. The proposed methodology is applied to the case of Colombia.

This paper presents a new methodology to measure inequality that optimally combines household survey information and tax records to construct a complete income distribution. Combining the two data sources is necessary because, on the one hand, household surveys do not accurately represent the wealthiest segment of the population, while tax records do; on the other hand, the opposite is true for the lower end of the income distribution: tax records only include incomes above a certain threshold. The key innovation of the proposed methodology-and the main difference from the existing literature-is the choice of an optimal income threshold b. The Gini coefficient for the population is then computed combining the conditional income distributions for incomes below b (using household survey data) and above b (using tax records). Central to this methodology is the fact that b is not chosen arbitrarily: it should be determined in such a way as to minimize reliance on household survey data to compute the top of the income distribution. In practice, the optimal b corresponds to the minimum income level that triggers mandatory tax filing. The proposed methodology is applied to the case of Colombia.

Introduction
Welfare measures such as poverty and inequality are central to the study of economic development. The causes and consequences of inequality have always been a key concern for governments and scholars, but interest in the evolution of inequality has recently spiked.
This paper focuses on how to optimally measure inequality.
Household surveys have traditionally been the main data source to estimate welfare measures like poverty and inequality. However, they are subject to measurement error, underreporting, and non-response, among other problems ( These issues represent a particularly severe problem when measuring inequality: they mostly affect the top of the income distribution, and this can lead to an incorrect estimation of the levels and trends in the Gini coefficient. If non-response rates were constant over time, the measurement of changes in inequality over time would not be greatly affected by these problems (Gasparini et al. 2000). However, it is likely that these rates do fluctuate over time. 1 The income underreporting level at the top of the distribution is not constant and homogeneous -it can vary by economic strata, by source of income, and over time. Yet there is no mechanism available to detect and model precisely the level of underreporting in household surveys. The most notable pattern is that richer households are more hesitant to disclose their income and assets, which would explain why underreporting tends to affect the top of the income distribution, ultimately resulting in an underestimation of the Gini coefficient (Székely and Hilgert 1999). Time variation in underreporting at the top of the income distribution will also affect the estimated time series of the coefficients (see Burkhauser et al. 2004Burkhauser et al. , 2009, Piketty and Saez 2006b, and Atkinson 2007). Finally, household surveys may fail to represent all types of income sources (for example capital gains), which may result in further bias (Burkhauser et al. 2012).
An alternative data source used to study top incomes in a population is income tax 1 records, which contain extensive information on firms and individuals that are often not captured by traditional data collection mechanisms such as household surveys. Therefore, tax records can provide valuable input to improve the measurement of income inequality, with the caveat that they also present important limitations. Not only do tax records contain little information about the bottom segments of the income distribution (where individuals are not required to file a tax return), but also filers have the financial incentive to report their income in a way that limits their tax liabilities, therefore reducing the marginal tax rate (Burkhauser et al. 2012), and the estimation could be biased due to tax evasion.
That said, examining the top of the income distribution is essential to understanding inequality, and tax records are a fundamental source for this purpose.
As shown by the key contributions of Atkinson (2007) and Alvaredo (2011), the share of income concentrated at the top of the income and wealth distribution can have dramatic effects on inequality. Recent studies have therefore proposed to rely on both household surveys and tax records to improve the estimation of the Gini coefficient. In particular, Atkinson (2007) proposes an approximate methodology to estimate the Gini coefficient.
The first stage of his methodology consists in choosing a group of top income earners (say, the top 0.1 percent). Then, if this group of top earners is approximately an infinitesimal fraction of the population, with share S * of total income, the Gini coefficient can be approximated by the following calculation: S * + (1 − S * )G. In this formula, G is the Gini coefficient of the population that excludes the top earners. More recently, Alvaredo (2011) extends this methodology to the case in which the group of top earners chosen is not infinitesimal relative to the size of the population (for example, the top 1 or 5 percent of the income distribution).
These methodologies have two salient features worth noting. First, the choice of which group of top earners to consider in the formula is arbitrary -the top income group is not optimally chosen based on specific criteria. Second, the Gini coefficient G for the population that does not include the chosen top earners group is computed using the household surveys.
This means that, since we cannot determine precisely the extent of underreporting at the top of the income distribution, it is not clear that the G estimated from household surveys 2 will correctly capture the Gini coefficient of the population that excludes the selected top earners. For example, suppose that we choose to apply the methodology focusing on the top 0.1 percent of the population, so S * in the formula above would be the share of income of the top 0.1 percent computed from tax records. Suppose also that the household survey is only representative of the bottom 95 percent of the population, because individuals above the 95th percentile do not correctly report their income. In that case, the G computed from the household survey would not correspond to the true Gini coefficient of the population outside the selected group of top earners (in this case the bottom 99.9 percent), since the 95-99.9 percentiles are not covered at all in the household survey. The methodology would then suffer from omitting a nontrivial segment of the population from the analysis. Gini coefficients of the population segments with incomes below b (using household survey data) and above b (using tax records). Since the underreporting distortions in the household survey are stronger for higher incomes, the optimal b that satisfies these requirements is the lowest level of income that triggers mandatory tax filing. This is the key element of the proposed methodology. Note that this income threshold can be significantly below the 1 percent traditionally used in the existing literature.
The proposed methodology, therefore, solves the issue of arbitrarily choosing the top income group, instead relying on the optimal income threshold b, which gives the maximum weight possible to tax records at the top of the income distribution. In addition, the methodology explicitly takes into account the underreporting of household surveys at the top of the income distribution, and only relies on household survey data to compute the Gini coefficient for the segment of the income distribution that lies below the threshold b.
This paper is organized as follows. After presenting the methodology and its properties in Section 2, the paper demonstrates an application to the case of Colombia (Section 3).
Although in recent years an increasing number of governments have made income tax data available to the public, Latin American countries have lagged behind. Colombia is the first country in the region to make personal income tax micro-data available to researchers, which determined the case study selection in this analysis. Section 4 offers conclusions.

Methodology
This section describes a simple procedure to estimate the entire income distribution in a population by combining the information in two data sets that are only informative about conditional income distributions. In this case, we focus on the fact that household surveys are representative of the bottom segment of the income distribution but not the top, while tax records capture the top end of the income distribution but not the bottom. The proposed methodology applies a result of Dagum (1997), who showed how to decompose a population's Gini coefficient into a combination of the Gini coefficients of its subpopulations. A related methodology has been applied by Alvaredo (2011) to measure income inequality in the United States and Argentina. The end of this section summarizes the main differences between the proposed methodology and the previous literature.

Setup and Assumptions
Suppose that the cumulative distribution function of income y in a population is F (y).
We would like to obtain an estimate of F (y) and some related statistics (like the Gini coefficient, G[F ]). However, we can only obtain consistent estimates of some conditional distributions of y. In particular, we assume: for some b. We observeF 1 (y) andF 2 (y), pointwise consistent estimators of F 1 (y) and F 2 (y) respectively.
The first assumption simply states that out of the available information (for example, household surveys or tax records) we can obtain consistent estimators of the conditional distribution of income;F 1 estimates the conditional distribution below an income threshold b, whileF 2 estimates the conditional distribution for incomes above b.
The second assumption is necessary to combine the information inF 1 andF 2 correctly into an estimator of F . This assumption does not require knowing the full distribution F , only the value at point b.

Results
We provide two simple results. The first proposition shows that under assumptions 1 and 2, we can obtain a consistent estimate of the distribution F . The second proposition shows that it is possible to express the Gini coefficient of the income distribution F as a combination of Gini coefficients constructed using the conditional distributions F 1 and F 2 . Proposition 1. Under Assumptions 1 and 2, the estimatorF (y) constructed as: is a pointwise consistent estimator of F (y).
Proof. In the population, by the definition of conditional distribution we have, for y ≤ b, and for y > b Given that under Assumptions 1 and 2 we have consistent estimators for all the right-hand side variables, the consistency of F (y) follows immediately.
Proposition 1 tells us how we can reconstruct the underlying unconditional income distribution if we know the two conditional distributions above and below an income threshold b and the fraction of the population with income lower than b. From the estimatedF (y), we can then study the properties of the income distribution.
In some cases, it may be convenient to be able to obtain directly the Gini coefficient for the full population, G [F ]. The next proposition shows that the Gini coefficient can be obtained as a simple linear combination of the Gini coefficients computed on the conditional distributions. We start from a few definitions.
Definition. The Gini coefficient G[F ] is defined as: is the unconditional mean of Y (i.e., the population mean). 6 Definition. The Gini coefficient of a conditional distribution G[F j ], j ∈ {1, 2}, is defined as: These Gini coefficients are simply the standard Gini coefficients computed on each of the two distributions.
Proposition 2. The full-population (unconditional) Gini coefficient can be written as: This proposition tells us that we can estimate the overall Gini coefficient as a linear combination of the Gini coefficients of the conditional distributions, plus an adjustment term that takes into account that the two conditional distributions are obtained from different underlying portions of the unconditional distribution.

Implementation and Relation to the Literature
A crucial prerequisite in the derivation of Propositions 1 and 2 is that we can obtain a consistent estimate of the conditional distribution of income above and below a threshold b. The specific case of combining household surveys with tax records provides a natural example in which to apply this methodology.
Household surveys tend to be poorly representative of the top of the income distribution because high income individuals tend to underreport or simply not report their 2 Note that given the definitions of F 1 and F 2 above, income. While the survey data will have a bias that primarily affects the top of the income distribution, the bottom segment of the population might be much better represented in the survey. For simplicity purposes, suppose that all incomes below a threshold a are well captured in the household survey. Similarly, for the case of tax records, only individuals with incomes above a threshold c report taxes, so only this high-income population is well represented. In this case, the dataset will correctly represent the conditional distribution of income above that threshold.
Suppose now that c < a. In other words, suppose that the level of income that triggers tax reporting is below the level of income above which we fear significant censoring in the household survey. Then, any income level b between c and a (c ≤ b ≤ a) will satisfy the following two properties: (1) all incomes above b are well represented in the tax records data (since it is above the minimum threshold c that triggers tax filing); and (2) all incomes below b are well represented in the household survey, because censoring only occurs above a, and b < a.
Any such b would be a valid income threshold to apply the methodology above. In practice, it is hard to know where censoring in the household survey starts (a). This suggests that rather than viewing b as a purely free parameter, one should optimally choose b to minimize the potential bias due to household survey underreporting at the top of the income distribution. If tax records are representative of incomes above the minimum threshold to file (c), the natural choice for the optimal b would be precisely c.
In other words, since we fear that household surveys underreport at the top of the income distribution, we optimally choose the lowest possible b that still ensures that incomes above In recent years, an increasing number of governments have been granting public access to administrative records and other information. The use of administrative records as a statistical tool is a recent trend that increases transparency and makes available to citizens, analysts, and policymakers a greater wealth of information. Tax records contain extensive information on firms and individuals that is often not captured by traditional data collection mechanisms such as household surveys, and, as such, they can provide valuable input to statistical systems. However, many Latin American countries have yet to make public their income tax data. Colombia is the first country to share the disaggregated micro-data from personal income tax records. This section applies the methodology described above to compute the Gini coefficient for Colombia in 2010 using data from both household surveys and administrative tax records. Colombians file their income tax returns differently depending on whether they are small or large taxpayers. This analysis considers both types of taxpayers: those who are not required to keep accounting books (small taxpayers) and those who are required to do so (large taxpayers). The small taxpayer data (Form 210) is composed of a balanced microdata panel and tabulations made by the tax agency in Colombia (Dirección de Impuestos y Aduanas Nacionales, DIAN). In the case of large taxpayers (Form 110), the data set is only composed of a balanced micro-data panel. 5 The data provides annual information about labor income, capital gains, other incomes, deductions, exemptions, and taxes paid.

Data
The Great Integrated Household Survey (GEIH) collects information about labor force conditions, socio-demographic characteristics, and different sources of income. The bottom of the income distribution is well represented in the GEIH, which collects data monthly and provides information at the national, urban, and rural levels as well as at the departmental level. In order to compare annual values from the administrative data with those from the household survey, we multiply by 14 the total individual monthly income (12 months plus a bonus equivalent to 2 months' pay).

Results
In general, all individuals whose incomes exceed a certain threshold appear as a tax unit in the administrative data. 6 Individuals with incomes below this threshold are not captured by the tax records, but they are represented in the household survey. In 2010, this threshold corresponded to 81 million Colombian pesos per year. As explained above, this income level is used to determine the parameter b of the formulas presented in Propositions 1 and 2: the bottom conditional distribution in the propositions will be obtained from the household survey by focusing on all individuals with incomes below b, and the top conditional distribution will use all available tax records (since the threshold b is chosen to be the minimum income level that appears in the tax record). Since tax records are based on individual returns, our estimation only considers the total individual income data collected through household surveys. If more than one person per household files a tax return, they will appear separately, each with their own individual income. We define as a control for population all individuals age 20 and above, since few individuals under 20 years of age 8 contribute income tax revenue; excluding them from the denominator does not significantly affect the results (Atkinson 2007).
The red vertical line in Figure 1 indicates the minimum total individual income needed 6 A caveat is worth mentioning here: the level of tax evasion and its changes over time can affect the results obtained for the Gini coefficient. 7 We calculate the total income distribution considering all personal incomes: labor income, transfers, remittances, and capital gains, among others. We also include pension claims, since in Colombia pension payments are considered labor income. 8 The control for population uses population projections data for 2010. The total population is 28,104,576.  Table 1 shows the estimates of the Gini coefficient using the methodology described 9 Figure 2 is truncated at the highest income levels for readability.     Table 1). Considering that the average annual reduction of the Gini coefficient in Latin American countries over the last 10 years was 0.51 percentage points, the differences in the estimates are not trivial. An even more striking difference can be seen when comparing these numbers to the ones obtained by Alvaredo Table 1). This shows that choosing an optimal threshold to combine the two datasets and explicitly considering the underreporting of incomes in the household survey when computing the Gini coefficients for conditional distributions can have dramatic effects on the measured coefficient. 11

Conclusion
This paper proposes a new methodology to optimally measure inequality by combining household survey and tax records data. The motivation stems from the fact that household surveys poorly represent the top of the income distribution, while tax records cover the top but not the bottom of the distribution. The key innovation of the proposed methodology is the choice of an income threshold b used to combine the two data sources; the paper shows that b should not be chosen arbitrarily, but should be chosen optimally to minimize the distortions from household survey underreporting at the top of the income distribution.
In particular, we discuss why the optimal income threshold b corresponds to the minimum income level required to file a tax return (i.e., the lowest income captured in the tax records).
After presenting the methodology, we apply it to the case of Colombia. We find that the Gini coefficient for Colombia in 2010 is 0.5978 when computed using an optimal income threshold of 81 million pesos, which is significantly different from the one that would be obtained if using an (arbitrary) threshold corresponding to the top 1 percent of the income distribution (0.5960). The difference of 0.2 points is not trivial, particularly when we consider that over the last 10 years the average annual reduction of the Gini coefficient in Latin America was 0.51 percentage points. The methodology presented in this paper builds on and improves upon the methodologies proposed by Atkinson (2007) and Alvaredo (2011). 11 When applying the methodology presented in Alvaredo (2011) and Alvaredo and Londoño (2013), instead of estimating the Gini coefficient for the bottom 99 percent of the population using the bottom 99 percent of the household survey's distribution, an alternative could be to simply use the Gini coefficient of the full distribution of the household survey as an estimator of the Gini coefficient for the bottom 99 percent of the true income distribution. This would be correct under the assumption that the household survey underreports incomes exactly above the 99th percentile. The Gini coefficient computed under this assumption (which generally will not be true) would be 0.6468.

Proof of Proposition 2
Start by writing: Substituting the expressions above: Now note that Similarly, we obtain Substituting: Now note that by Fubini's theorem, since income is all nonnegative, (1 − F (y))dy (1 − F (b))(1 − F 2 (y))dy So: Now we are going to focus only on one part of G[F ], that does not involve F (y) 2 terms: The first part (first line) is: Now remember that the the second distribution F 2 has support (b, ∞), and is 0 below b, so that the mean income of the upper distribution can be computed as (these are conditional means): and the average income of the first distribution is: So we can write: Now we can look at the first part: So to conclude we have: