80575 WORLD BANK ECONOMICS OF TOBACCO TOOLKIT Editors: Ayda Yurekli & Joy de Beyer Tool 2. Tobacco Data Data for Economic Analysis Christina Czart and Frank Chaloupka DRAFT USERS : PLEASE PROVIDE FEEDBACK AND COMMENTS TO Joy de Beyer ( jdebeyer@worldbank.org) and Ayda Yurekli (ayurekli@worldbank.org) World Bank, MSN G7-702 1818 H Street NW Washington DC, 20433 USA Fax : (202) 522-3234 1 Contents 2 I. Introduction What’s This Tool About? This tool provides a general introduction to “the art� of building databases. It addresses a number of issues pertaining to the search, identification and preparation of data for meaningful economic analysis. It can best be thought of as a reference mechanism that provides support for the occasionally frustrated but endlessly hungry researcher working through the adventures of tobacco control analysis. Who Can Use This Tool? Anyone can reference this tool. Whether you’re a sociologist, an economist, a public policy researcher or other social scientist beginning new research in the field of economic tobacco control, you are sure to find some assistance from this tool. The list of potential users is endless and includes anyone from old-timer economic researchers to student economists new to the tobacco industry; epidemiologists seeking economic evidence to compliment their epidemiological findings to policy makers eager to produce evidence that supports their political and economic agenda. Regardless of your country’s culture, social traditions or economic system, similar rules concerning data collection and data analysis exist. This tool will outline the bare minimum for you. How This Tool Works This tool addresses a variety of data issues as they pertain to the economic analyses presented in the remaining tools of this toolkit. 3 This tool has been designed to follow a series of step-by-step discussions and examples. The tool follows: - Step 1: A discussion of the different types of available data - Step 2: Identification of possible data sources - Step 3: Definitions and examples surrounding key variables required for the aggregate and individual level analyses presented in tools 3 through 7. - Step 4: Presentation of issues pertaining to data preparation and analysis including: - Evaluate and clean raw data - Transport raw data into a statistical package - Quality check the raw data - Missing observations - Outliers - Summarize raw data - Plot raw data - Recode survey data into usable form - Plot the data to determine functional form to be used in analyses 4 II. The “What, When, Who, Where and How� About Collecting Data What’s There to Know About Data? Did You Know that Data Differs in Types? It’s true! In the world of numbers, figures can be aggregated or summed to various sub-levels of our society. The national level is the highest categorization of data for any country. It captures rounded numerical descriptive information about every person (i.e. country’s total population), every item traded (i.e. prices, production) and societal or economic mechanism (i.e. interest rates) and reports it for the country as a whole. The next sub-level of a society breaks a country into regions (i.e. north, south, east, west). The same set of information (i.e. population, prices, etc.) is captured to reflect the average for each region (at “the regional level�). It is helpful to think of levels of data by example. Let’s consider a frequently reported economic figure. Let’s focus on earnings. In market economies across the globe, the concept of income or earnings is always a strong point of discussion. The Wall Street Journal highlights our country’s income or Gross Domestic Product (GDP) each and every day. Financial analysts and investment brokers report corporate earnings four times a year. Census Bureaus and Central Statistical Offices regularly track household earnings in homes across their nations. Our employers track our hourly and monthly wages. We as citizens, report our annual incomes to the government. The above examples suggest that income can be measured and defined in a number of ways. For example, Mr. Smith’s employer, Company A, defines Mr. Smith’s income by establishing his hourly wage. Mr. Smith’s tax lawyer determines Mr. Smith’s income based 5 on the annual sum of his wages earned at Company A plus the sum of his stock, savings and other investment earnings. A census worker reports Mr. Smith’s income along with Mrs. Smith’s income through a household income measure. Finally, the Wall Street Journal reports Mr. Smith’s income along with the income of thousands of other individuals and businesses through a national measure of income called Gross Domestic Product (GDP). Each of the income definitions presented above are valid and reflective measures of Mr. Smith’s income but present income at different levels (individual, household, national) and magnitudes (hourly, monthly, annual). When conducting economic analyses, it is important to capture comparable and compatible sets of variables which when combined together, tell a clear and cohesive story about the person, community of persons, or nation of people you want to address. Over the years, economic studies have used various types of data in their analyses including: Aggregate Time Series: - Consists of several years of aggregate data - Constructed from stacked annual national estimates Aggregate Cross-Sectional: - Consists of data drawn in one single moment in time - Based on a nationally representative survey of households Pooled Time-Series - Consists of several years of individual or household level data - Pools together several years of aggregate cross-sectional data into a single database Longitudinal - Consists of several years of individual level data - Longitudinal data tracks and repeatedly surveys the same sample individuals across time. These economic analyses have used a wide variety of statistical and econometric techniques to examine the effects of economic factors and socio-demographic characteristics on issues related to the supply and demand for tobacco in the consumer market. Do You Know Where to Begin Looking? Various institutions around the world collect information about people and the societies they live in. Most often and most regularly, our own governments keep close track of our actions including: 6 - Who we are (i.e. age, gender, race, religion, education) - Where we live (i.e. city versus rural location, rates of migration) - Who we live with (i.e. description of household including number of children, marital status) - Where we work - How we earn our income - What we buy (i.e. expenditure and consumption of various goods and services) Most governments regularly conduct household surveys (monthly, annual) in order to map the demographic, socio-economic, expenditure and employment characteristics of their national society. By doing so, a government can better understand the economic and social conditions, which exist and can identify the resources needed to improve national welfare. Regardless of their political and/or economic systems, most governments today maintain large institutions equipped with a number of tools and methods for gathering such information. Most countries in the world have two main bodies, which gather, record, manage and disseminate data. These can be defined as: 1. A Centralized Data Collection Agency 2. Ministries or Departments of Government Central Data Collection Agency Different countries use different title for their central data collection agency. Common names include: Central Statistical Office, National Bureau of Statistics, General Statistical Office or National Institute of Statistics. A national data collection agency generally has two main duties: 1. Collection and publication of primary data through censuses as well as household and individual surveys 2. Gathers and reports secondary data collected by Ministries or Departments of Government The collection of data through a national agency ensures that the data gathered represents the entire national society and that it is not influenced by interest groups. Furthermore, because it is generally accepted that government has the authority to collect such data, cooperation in the data collection process is quite strong. Ministries or Departments of Government A central data collection agency often relies on other ministries or departments of government for help with collection of national data. Ministries or Departments of various sectors of society including 7 agriculture, commerce, finance, health, industry, justice, trade and others regularly monitor relevant aspects of a society. The following Ministries or Departments of government are examples of tobacco relevant data sources. • Ministry of Finance Directs and records tobacco taxes • Ministry of Commerce Tracks all tobacco products, brands, prices and sales • Department of Industry Oversees tobacco production • Department of Agriculture Tracks tobacco farming • Ministry of Department Trade Monitors tobacco imports and exports; determines trade duties Country data can be obtained from a number of different sources including: a country’s respective central data agency, international data sources (i.e. various organizations of United Nations), non- governmental data sources (i.e. Action on Smoking and Health), private data companies (i.e. A.C. Nielsen) as well as select US agencies (i.e. Center for Disease Control). Information on country specific as well as international data sources and how to access them are discussed in greater detail in section ___ of this tool. Aggregate Data Most countries in the world today report at least a basic set of national economic and social information. In addition, aggregate or “macro� level data is also largely available at sub-national levels of these societies and captures information that’s reflective of regional, state, provincial, county or other jurisdictional divisions of the country. The International Monetary Fund’s (IMF), International Financial Statistics (IFS) provide a good example of basic economic information that is collected and reported by countries each month. The IFS reports monthly figures for such economic measures as GDP, money supply, consumer prices (CPI), producer prices (PPI), interest rates and industrial production. Such information provides a researcher with a summary of the overall economic status or performance of each country. Such data is especially helpful when trying to account for fluctuations in inflation over time and differences in the cost of living across countries. 8 In most countries, a similar array of national figures specific to tobacco are reported to central authorities such as the Central Statistical Office, the Ministry of Finance, the Ministry of Commerce, the Ministry of Trade and others. The reported set of national and sub-national information may include: consumption and sales of tobacco products, retail prices and taxes for tobacco products, export and imports of raw tobacco and finished tobacco products, information on consumer tobacco-related expenditures and demographic characteristics of consumers. Consumption Consumption represents product use. Therefore, data on tobacco consumption reflects the amount of tobacco products used by a consumer. Data on tobacco product consumption is required for any economic analysis related to the demand for tobacco. National and sub-national measures of use or consumption of cigarettes and/or other tobacco products are imperative to each of the tools presented in this toolkit and particularly to tools three, four and seven. Tobacco consumption information can be obtained through surveys of households and/or individual consumers. National population surveys and censuses interview random samples of individuals and/or households in an effort to obtain behavioral and socio- economic information that will best describe the characteristics of the nation’s current population. See Appendix 1 for an example of a national individual population survey that captures socio- demographic information pertaining to the respondent. Such surveys generally include a few direct questions about tobacco related behaviors. A survey will usually ask if the individual respondent or household uses tobacco, whether cigarettes in particular are smoked regularly and if so, how much. In this manner, a country’s central statistical office or national bureau of statistics is able to gather direct consumption information from individuals and households. Such information can later be used to represent current consumption statistics for the national population and to produce estimates of future tobacco consumption behaviors. Example - Consumer Survey Regarding Smoking Behavior How many cigarettes a day do you smoke on average? (One pack equals 20 cigarettes) A. None B. Less than one cigarette C. Less than half a pack D. About half a pack E. More than half a pack, but less than a pack F. A pack G. More than a pack 9 Given the above example question regarding tobacco use, individual level data is aggregated to reflect national consumption measures. Aggregate consumption measures can be reported in two distinct formats: Form A: The Prevalence of Tobacco Use Individuals who report smoking “none� are defined as non-smokers while those who answer smoking less than one cigarette per day or more (responses B through G) are defined as smokers. The percentage of defined smokers relative to the total number of respondents (smokers plus non-smokers) reveal the prevalence rate of tobacco use within a national sample of respondents. Form B: Conditional Demand for Tobacco Using the above question on cigarette use, a quasi-continuous measure of daily cigarette consumption can be constructed. This demand measure is conditional on the fact that a respondent is in fact a smoker. Using the format of the question presented above, the conditional demand for cigarettes is equal to a value of: - .5, if on average a respondent smokes less than one cigarette per day - 5, if on average, a respondent smokes less than 10 cigarettes per day - 10, if on average a respondent smokes approximately 10 cigarettes per day - 15, if on average a respondent smokes between 10 to 20 cigarettes per day - 20, if on average a respondent smokes a pack of 20 cigarettes per day - 30, if on average a respondent smokes a pack or more of cigarettes per day In order to produce an annual estimate that reflects conditional demand of a national population, these individual averages are aggregated to a national level. Once this aggregation is complete, the resulting national average reflects the average daily cigarette consumption of the population. For example recodes of these consumption measures, please refer to section___ of this tool. Smoking prevalence is defined as the percentage of current smokers in the total population. When talking about prevalence and tobacco 10 use, pay attention to the type of tobacco product that is being addressed with this statistic. The prevalence of smoking, is the number of people who report smoking tobacco in the form of cigarettes, bidis, cigarillos, cigars, pipes, rolled tobacco or others. A more comprehensive measure of tobacco consumption is prevalence of all tobacco use and includes the prevalence of smoking behavior plus the percentage of people who chew tobacco or use other forms of smokeless tobacco. Smoking Prevalence = % Population Who Smoke Measures of smoking prevalence are often not comparable across countries as the basic definition of a current smoker tends to vary across countries. Surveys are often administered to varying age, gender and social groups. For example, adult daily smokers in country X may range in age from 16 years and over while adult daily smokers in country Y may only include smokers in the range of 21 years and above. The World Health Organization (WHO) defines a current smoker as someone who smokes at the time of the survey and has smoked daily for at least a period of six months. (WHO, 1998). Other definitions of smoking prevalence are somewhat less restrictive. Other research groups have defined a current smoker as someone who has smoked one or more cigarettes in the 30 days prior to the survey. Ongoing efforts by the WHO, Center for Disease Control (CDC) and others aim to improve the consistency of survey data related to tobacco use across countries. The CDC’s Global Youth Tobacco Survey (GYTS) is an internationally, youth-focused survey that has been conducted in a large and continuously growing number of countries all over the world since 1999. This survey contains a standard set of questions, which are administered in the same manner across several counties. Such uniformity in survey design and survey administration ensures feasibility for conducting a standard set of analyses across countries. Please contact the CDC (or view their website – http://www.cdc.org) for additional information on the GYTS surveys. Beware of underreporting! While survey data provides generally accurate measures of prevalence (depending on the quality of the survey), there is some potential for the individual underreporting of smoking and/or other tobacco use prevalence. This is particularly true in country’s and among populations characterized by strong social disapproval of smoking behaviors. In addition, survey data on prevalence may also be biased as a result of the manner in which the survey is conducted. For example, household surveys that are conducted orally by an 11 interviewer can lead to inaccurately reported measures if the survey is not conducted privately; for example, youth and young adults are less likely to honestly report that they smoke when their parents may overhear their responses. Finally, measures of total consumption derived from survey data are likely to be inaccurate. Past research has demonstrated that the level of total cigarette consumption derived from survey data on smoking participation and average cigarette consumption by smokers is significantly lower than cigarette sales. The degree of underreporting is likely to be positively related to the social disapproval of smoking. The conditional demand for cigarettes is the actual number of cigarettes smoked by those consumers who have declared being cigarette smokers. For an example of nationally aggregated conditional demand figures, please see Appendix 2. Cigarette # of Cigarettes (or Packs) = Consumption per Unit of Time Define Pack Size! Keep Units Consistent! Standard Definition: Cigarettes smoked per: 20 cigarettes/pack - Day? - Week? Researchers may define cigarette consumption as the number of - Month? cigarette packs (usually understood to consist of twenty cigarettes) smoked by individuals or households during a given unit of time (i.e. during the last month, during the last week or daily). Tobacco consumption questions generally ask about the number of packs or number of cigarettes consumed per month. What is “a pack� of cigarettes? Survey designers should be sensitive to the fact that standard pack sizes vary from country to country (i.e.10, 12, 20 or 25 individual cigarettes or “pieces� per pack). Also, in some countries, the sales of single cigarettes (cigarette “sticks�) is common. Survey questions, which inquire about the number of packs consumed per month, should clearly define the size of a cigarette pack in the survey questionnaire. Errors in the design and analysis of consumption variables stem from confusion over the unit of consumption measure. For example, a researcher may be under the impression that the consumption measure which he/she inquires about in a survey is defined as the number of packs consumed per month (i.e. 1 pack per day translates 12 into approximately 30 packs per month) while the survey respondent may be reporting the number of single cigarettes (i.e. 20-30 individual cigarettes) smoked per day. In an effort to avoid the miscoding of consumption information, a worthwhile check on consumption measures includes verifying the corresponding price per pack that’s supplied by the respondent of the survey. Sales Cigarette sales information, specifically tax paid sales data, can be used as a proxy (substitute measure) for cigarette consumption in aggregate cigarette demand models. This means, total annual tax paid cigarette sales can be modified to produce per capita proxies of cigarette consumption. Per capita cigarette sales are computed by dividing total annual cigarette sales in country X at time y by the total population of country X in time y. Similarly, adult per capita cigarette sales can be obtained by dividing total annual cigarette sales by the appropriately defined adult population measure (15 years and older or 18 years and older are commonly used). As with aggregate estimates of prevalence and consumption obtained from survey data, tax paid sales data can be systematically biased. This is particularly true for countries where there is a significant black market in tobacco products. In this case, cigarette sales provide an underestimate of total consumption. See tool 7 for alternative approaches to estimating the magnitude of the black market for tobacco products. In addition, cigarette sales may provide misestimates for reasons related to hoarding. Harding scenarios include: - although a consumer purchases a pack of twenty cigarettes in time y, we cannot be certain that this individual consumes all twenty cigarettes in time y - cigarettes may be purchased in large quantities in time y in safeguard against higher taxes in time z. Such quantities often go unsold and unused in time y or time z – and are discarded after expiration Therefore, cigarette sales, although an appropriate proxy for tobacco consumption, are likely to provide a distorted estimate of cigarette consumption and by definition, should be clearly distinguished from consumption data. Tobacco Price Cigarette and other tobacco price information are critical to each of the economic tools discussed in this volume. Price plays a critical 13 role in tobacco demand estimates and is a key factor in most, if not all, economic issues related to tobacco including smuggling and taxation. Microeconomic theory dictates that as the price of a normal good rises, the quantity of that good that is demanded by a consumer falls. Price of Good X Microeconomic Theory Teaches As the Price of a normal good Increased Price · rises, the Quantity Demanded of that good falls Original Price · Fallen Original Quantity Demanded Quantity Quantity of Good X However, for many years, economists believed that because of their addictive nature, cigarettes and other tobacco products, were not normal goods. As a result, it was believed that the consumption patterns of a tobacco consumer would not be responsive to changes in price. Economists once Believed Price of Tobacco Because of their addictive Increased Price · nature, as the Price of tobacco products rises, the Quantity Demanded of tobacco will remain unchanged Original Price · No Change Quantity Demanded in Quantity of Tobacco Today, through improved econometric techniques and sophisticated statistical programs, many studies have shown that the demand for tobacco is in fact, sensitive to changes in the price of tobacco. Many studies conclude that by altering the price of cigarettes (through tobacco taxation) governments can change tobacco use. 14 Tobacco demand’s sensitivity to changes in tobacco prices is the price elasticity of demand. It is defined as the percentage change in consumption that results from a 1% change in the price of a good. Price Elasticity % Change in Cigarette Consumption = of Demand % Change in the Price of Cigarettes In order to understand how price changes may influence smoking decisions, we need to measure the above ratio within the population at hand. This relationship between price and consumer consumption carries very strong policy implications and helps us determine which taxes and in what magnitude, need to be altered to achieve a planned reduction in consumption. This in turn also provides estimates of how much government revenue will increase as a result of higher taxes and decreased consumption. As Cigarette Price ↑ the Quantity Demanded of Cigarettes ↓ An increase in cigarette taxes and cigarette prices will affect smokers’ decisions about their smoking behavior through a number of mechanisms. For the addicted cigarette smoker, higher taxes and prices on cigarettes: - have a negative effect on the number of cigarettes consumed - often stimulate the decision to switch to smoking cheaper brands of cigarettes - enhance the decision to quit or begin to think about quitting the smoking habit By the same token, higher tobacco prices also have a discouraging effect on the consumer decisions of those who do not smoke. That is, non-smokers’, when faced with rising cigarette prices, may think twice before initiating smoking behaviors. Cigarette = Σ [sales, excise, ad valorem, VAT ] Taxes Price The monetary price of a pack of cigarettes which consumers encounter when purchasing their cigarettes consists of several individual and variable components of price. It includes the retail 15 price of a pack of cigarettes plus any combination of the following tobacco taxes including: - percentage sales tax - flat excise tax - ad valorem tax - Value Added Taxes (VAT) A variety of tobacco price data may be used in demand analysis including the prices of various categories and types of tobacco products. For example, including the prices of alternative tobacco products in the demand analysis, is useful to understanding the potential for substitution among tobacco products in response to relative price changes. Real versus Nominal Cigarette Prices! The actual price paid by an individual at a particular moment in time is called the nominal price. However, in many econometric analyse of cigarette demand, a set of nominal prices should not be used. Instead, it is correct to use the real value of price. A deflated price measure Here, the price variable is adjusted for inflation. The common method for converting nominal prices into real prices is to divide the nominal price by the CPI level and multiply by 100. (For further details, see tool 3 and tool 6.) When price data are unavailable, tobacco product excise tax data are often a good proxy for price. Similarly to prices, tax levels on tobacco products tend to vary depending on the type, origin and size of the tobacco products. Research from developed countries has found that tobacco prices are very highly correlated with tobacco taxes and that increases in taxes are generally fully passed on to consumers. Using tax per pack in a demand equation gives an estimate of tax elasticity. Elasticities estimated from demand models which use tobacco tax rather than tobacco price must be converted to price elasticities (for discussion of conversion process, see Section in this Tool and Section in Tool 3). Employment Four types of employment related data are required to count the total number of jobs (employment) directly related to tobacco. Gathering tobacco employment information includes obtaining information on the number of jobs associated with: 16 1). tobacco farming 2). leaf marketing and processing 3). cigarette manufacturing 4). cigarette wholesaling and retailing. To obtain such information, researchers need to check specific data sources and publication agencies in their own countries. Generally speaking, such detailed employment data can be found in government statistical offices. For example, in the United Kingdom, this information is available from the Department of Employments. In most countries of Central and Eastern Europe, such information is available from the Central Statistical Office. In the U.S., the Bureau of Labor Statistics and the Department of Commerce publish most information on employment. 1. Jobs in Tobacco Farming 2. Jobs in Tobacco Leaf Processing Total Total Tobacco Tobacco = Σ Employment Employment 3. Jobs in Cigarette Manufacturing 4. Jobs in Cigarette Wholesale/Retail Tobacco Leaf Processing Tobacco leaf processing can be broken down into two specific components needed for the preparation of raw tobacco leaves for use in production. These two components are: - the auctioning and warehousing of raw tobacco leaves - the stemming and redrying of raw tobacco Example: The United States In the US, employment associated with leaf marketing and processing can be obtained from various publications produced by the Bureau of the Census. - Auction warehousing information and specifically, the number of auctioning establishments and corresponding employment statistics can be obtained from the Census of Wholesale Trade. - Information on the number of stemming and redrying establishments and corresponding employment is available from the Census of Manufactures. 17 The organization of tobacco production and therefore, the organization of tobacco leaf processing varies from country to country. As a result, in many countries, tobacco leaf auction warehousing may not be regarded as a separate production activity. Similarly, in many countries, the stemming and redrying of tobacco leaves may be considered as a part of the cigarette manufacturing industry. In such cases, employment associated with both of these activities should not be estimated since they are already counted in tobacco farming and manufacturing. Cigarette Manufacturing Data on employment in cigarette manufacturing are commonly available in government statistical offices. This information is usually classified according to market sectors or industries of the national economy. Example: In the United States - Information on the number of tobacco producing establishments and tobacco manufacturing jobs can be obtained from the Bureau of the Census’ Census of Manufacturers. Example: Across Countries - The number of persons employed by cigarette manufacturing is often published by international organizations. The United Nation’s, International Development Origination Database, is a good place to start your search. Cigarette Wholesaling and Retailing Cigarette wholesaling is performed by different entities in different countries. Case 1: Monopolized Tobacco Industry In many countries, cigarette manufacturing and sales are monopolized. In such environments, the wholesale of cigarettes is part of the cigarette manufacturing industry. Here, centralized manufacturers have regional depots and transport facilities for the distribution of tobacco products (a function otherwise performed by wholesalers). In these countries, the number of jobs related to wholesaling is contained by statistics that measure total employment in cigarette manufacturing. Case 2: Tobacco Industry Functioning in a Competitive Market In many other countries, wholesaling is a distinct function in the competitive and open market. Here, wholesalers begin to handle tobacco products immediately after they leave the manufacturer. 18 Example: The United States - In the U.S., employment associated with wholesaling can be obtained from the Bureau of the Census’, Census of Wholesale Trade. - Estimates of jobs associated with cigarette retailing can be imputed from information on the number of distribution outlets, the total number employed in each outlet and the share of tobacco product sales. - Information on the number of distribution outlets and the total number employed can be found in Employment and Earnings published by the Bureau of Labor Statistics of the US Department of Labor. - Tobacco’s share of total retail sales by individual retail outlet can be obtained from the Census of Retail Trade. Note: In most countries, statistics on the distribution channel of tobacco products and tobacco share of the total sales is poor. In such cases, a retailer survey is required to capture such information. Tobacco-Related Employment versus Total Employment Macro-employment measures are needed to estimate the proportion of tobacco related employment to total employment by sectors. The relevant sectors of employment include: agricultural production, agricultural marketing, manufacturing, wholesaling and retail trade. These macro measures are used to create four ratios: - # employed in tobacco farming # employed in total agricultural production - # employed in tobacco leaf marketing and processing total # employed in agricultural marketing - # employed in cigarette manufacturing total # employed in total manufacturing - # employed in tobacco wholesaling and retailing total # employed in wholesaling and retail trade In many countries, information on employment by sectors is available from government statistical data on employment. Researchers must check specific sources and publication agencies in their own countries. Examples: - In the United States, the Bureau of Labor Statistics and the Department of Commerce publish information on employment. - In the United Kingdom, this data is available from the Department of Employments. 19 - In Poland and other former centrally planned CEE countries, employment information by industry and sector is available through the Central Statistical Office Other Data Relevant to Tobacco Employment - Data on consumer expenditures on finished tobacco products is needed to examine the impacts of tobacco control policies on national and/or regional tobacco employment. - Other information that’s relevant to studying the impacts of tobacco control policies on employment and production (see next section) includes: the amount of labor input required for the production of a unit of tobacco or an acre of planted (or harvested) tobacco. In the United States, this information is available through the Census Bureau’s, Census of Agriculture. Tobacco Production Annual, national production of cigarettes (often reported in billions of cigarettes) is available in most countries. Information on tobacco production and acreage used in tobacco farming are frequently available through national agricultural statistics of individual countries. When gathering information on tobacco production, it may also be useful to capture data, which reflects the economic importance of tobacco (both the value of raw tobacco and the value of finished tobacco products), to a given economy. Such measures include: - the monetary value of tobacco leaf grown within a defined area - the value added by tobacco manufacturing. Example: Poland 1923-1998 Figure 1 provides an example of cigarette production data from Poland for the years 1923 through 1998. This data was obtained from Poland’s Central Statistical Office yearbooks. Note: as in many European countries, a gap in cigarette production occurs during the World War II time period (1939-1947). Sources of Tobacco Production Information: - The Food and Agricultural Organization (FAO) of the United Nations (production yearbook) and the Department of Agriculture of the United States (World Tobacco Situations) regularly publish data on several sectors of agricultural production including tobacco. 20 Consumer Expenditures Tools 4, 5 and 6 show how information concerning individual expenditure or household expenditures on tobacco products as well as household expenditures on other types of goods and services are important to the economic analyses of tobacco. Toolkit Examples Tool 4 In tool 4, you will learn that national tobacco tax revenues are contingent on the level of domestic legal sales of tobacco products. Consumer tobacco expenditure data provide important sales information to tobacco taxation analyses, particularly because tobacco expenditure data is a good proxy for domestic cigarette sales. As a result, because tobacco tax revenues are contingent on the level of national, legal sales of tobacco products, tobacco expenditure data allows for both simulation of optimal tobacco taxes and estimation of future tobacco sales and revenues. Tool 5 Tool 5 shows how changes in consumer expenditures on finished tobacco products have an indirect yet determining effect on national (and/or sub-national) levels of employment. Therefore, because tobacco expenditures help shape employment in the tobacco industry and other sectors of the economy, consumer expenditure information is valuable for basic employment analyses and simulations. Tool 6 Consumer expenditures on tobacco and particularly expenditure ratios (explained below) are particularly interesting when examined in the context of total household income. A key tobacco control policy concern is to understand how households from different socio- economic backgrounds differ in their tobacco expenditures. These and other equity issues are addressed further in Tool 6. Similarly to tobacco consumption, information on consumer- spending patterns (either by households or individuals) can be obtained from published or non-published governmental statistics on consumer expenditure. Tobacco expenditure data is based on information collected through national surveys of random samples of households and/or individual consumers. These surveys generally include a few direct questions concerning general household expenditures, including tobacco product expenditures. National household surveys generally inquire about expenditures on a variety of household items including perishables (i.e. fruits, vegetables, meats, poultry, fish), dairy products, rice, potatoes, eggs, tea and coffee, alcoholic beverages, oils and fats, sugar, salt and of course, tobacco. From such questions, a central statistical office or national 21 bureau of statistics is able to directly gather expenditure information from individuals and households, which can later be used to represent total expenditures made by the national population. Example: Cambodia National Socio-Economic Survey, 1999 Section Title: “Household Consumption Expenditures and Main Sources of Income� Directions: The following questions should be asked of the head of household, spouse of the head of household, or of another adult household member, if both head and spouse are absent. Question: What was the total value of food, beverages and tobacco consumed in your household during the past week? Value of Consumption____________________ Food (1) Purchased (2) Own produce, gifts etc. (3)Total Consumption Item Riels Riels Riels_[=(2)+(3)]__ Rice Sugar, salt Fruit (banana, orange, mango, pineapple, lemon, watermelon, papaya, durian, grape, apple, canned and dried fruit, etc) Meat (pork, beef, buffalo, mutton, dried meat, innards – liver, spleen and other meat) Tea, Coffee, Cocoa Tobacco Products (cigarettes, mild tobacco, strong tobacco) * question 16, taken from the 1999 Cambodia Socio-Economic Survey As can be deduced from the questionnaire presented above, a simple comparison of household expenditures made towards tobacco products relative to household expenditures for other consumables reveals how important tobacco is to the national economy. That is, countries with low expenditures on tobacco products relative to expenditures on other necessary goods (i.e. rice, fruits, vegetables) conceivably have lower smoking prevalence rates and depend less on tobacco sales within their economy. The Expenditure Ratio 22 By calculating an expenditure ratio, a researcher can better understand how important tobacco is in the lives of the national population and the livelihood of the local economy. The figure below maps the relationship between tobacco expenditures and expenditures on other goods and services. Expenditure Monthly expenditures on tobacco product = Ratio Monthly expenditure on: - Food - Housing - Energy - Other necessity Case A: If the Expenditure Ratio>1 A value of 1 or more indicates that the average monthly amount of money spent on tobacco products is larger than the total monthly amount spent on other household consumables (ie. food and beverages) or household necessities (ie. housing and energy). Case B: If the Expenditure Ratio = ½ A value of ½ means that individuals spend the same relative monthly amount on tobacco products as on other household consumables. Case C: If the Expenditure Ratio = 0 A zero value for the expenditure ratio signals that a household spends no household financial sources on tobacco products. Information on expenditure patterns. Expenditures by Type of Tobacco Product Household expenditures by type of tobacco product may also be available through the statistical offices of some countries. Depending on the country in question and the mix of legally available tobacco products, expenditure figures may be reported for one or more of the following: - Cigarettes - Bidis - Cigars - Cigarillos - Chewing Tobacco Comparisons of expenditure values across the above listed categories provide proxy measures for the market shares of each of these 23 tobacco product categories. Traditionally, cigarettes are prominently used tobacco products in most countries around the world. As a result, cigarette expenditures generally dominate total tobacco expenditures. Example: Expenditure Data in the United States In the U.S., consumer expenditure information on various tobacco products can be found in tobacco statistics published by the U.S. Department of Agriculture and the U.S. Department of Commerce. In other countries, and particularly in the post Stalinist economies of Central and Eastern Europe, this data is collected by the national, Central Statistical Office. Similarly, in the Southeast Asian countries of Vietnam and Cambodia, this data has just recently begun to be collected by each country’s National Office of Statistics. Demographic Information National socio-demographic information is used to summarize or define the population sample being examined. Commonly collected national demographic information includes: 1. Figures which provide a statistical breakdown of the population by year and by: - Age - Gender - Education level - Religious denomination - Area of Residence (rural, urban etc,) 2. Annual measures of gross household or gross per capita income 3. Annual measures of net household or net per capita income Such information can be obtained from either the Central Statistical Office or the National Statistical Bureau’s of most middle income and developing countries. For an example of national socio- demographic data, please see Appendix 2. Economic Indices National economic indicators are required for even the simplest descriptive analyses of aggregate data. Two particular data measures are always important. Important economic indices include: - Gross Domestic Product (GDP): Is a measure of national income. This measure is also often cited as a good indicator of national economic performance. By dividing GDP by national 24 population, a researcher can also obtain a satisfactory proxy of individual income per capita. Similarly, by dividing GDP by the number of households, a researcher also obtains a proxy for household income. - Consumer Price Index (CPI): Is an aggregate measure of overall prices and serves as a popular indicator of the rate of inflation in a given economy. The rate of change in the CPI tells us what the rate of inflation is in a given country. The CPI makes it possible to measure today’s prices against an overall price level. In other words, the CPI is an instrument (a deflator) that allows economists and other researchers to deflate monetary measures (ie. taxes, prices, income) to make them comparable over time. By deflating prices by the CPI, researchers are able to measure prices in real rather than nominal terms. Example: Nominal versus Real Cigarette Prices The nominal price of a pack of cigarettes is the pack’s current monetary value or absolute price. Consider the following scenario. Today, supermarkets in the United States sell a pack of regular Marlboro cigarettes for a nominal price of approximately $3.25 while ten years ago, a pack of Marlboro cigarettes sold for $2.75 a pack. From this example scenario, we can conclude that the nominal price of a pack of Marlboro cigarettes in 1991 was $2.75 while the nominal price in 2001 measured at $3.25. The real price of a pack of cigarettes is its nominal price relative to the CPI. By dividing each nominal price by its respective CPI measure, a researcher is able to compare the two prices against one another. Assuming a 1991 CPI of 1.20 and a 2001 CPI of 1.32 Real Cigarette Price in 1991 =$2.75/120 = $2.29 While the Real Cigarette Price in 2001 =$3.25/132 = $2.46 Therefore, according to the data provided above, both the nominal and real price of a pack of regular Marlboro cigarettes was larger in 1991 than in 2001. The economic indices described above are easily obtained through a number of different sources. As mentioned at the beginning of this tool, the International Monetary Fund’s (IMF), International Financial Statistics (IFS) provide monthly, up to date reports of both GDP and CPI. 25 Tobacco Trade Information Trade information is important to both tool 5 (analyses of tobacco related employment) and tool 7 (issues in tobacco smuggling). Important trade measures include tobacco and/or cigarette: 1. Exports 2. Imports 3. Domestic sales 4. Export sales Tobacco trade related data is available through the Central Statistical Office or the National Statistics Bureau of most countries. Information is also likely to be available although less accessible through the national Ministry or Department of Trade. The Market for Tobacco In addition to understanding the regulatory environment surrounding tobacco, it is also often helpful to gain a detailed and descriptive understanding of what’s happening on both the demand and supply sides of the tobacco market. This includes understanding annual production as well as market share of various tobacco producers, their product brands sizes and subcategories. Descriptions of the tobacco market can be calculated from tobacco related data that is available through various government sources (the CSO or Bureau of Statistics, Ministry of Finance and the Ministry of Commerce). Private institutions also focus on the monitoring and tracking of the activities, practices and performance of tobacco. The Marketfile is an example of a privately held organization that monitors tobacco in countries worldwide. For more detailed information regarding the Marketfile’s tobacco data and reports, see the company website: http://www.marketfile.com/market/tobacco/) Examples of Market Analyses Calculating Market Shares of Various Tobacco Products Market Shares according to: I. Type of Tobacco Product - Cigarettes - Cigars - Cigarillos - Smokeless Tobacco - Loose Tobacco - Other Tobacco products 26 II. Cigarette Category - Filtered Cigarettes - Unfiltered Cigarettes - Menthol Flavored Cigarettes - Lights (and ultra light) Cigarettes - Other emerging categories III. Cigarette Size - under 70mm - regular size - 70mm - king size - superkings IV. Cigarette Packaging - Soft packs - Box packs - Cartons V. Cigarette Producer - Domestic Producers (very from country to country) - International Conglomerates - Examples: - Phillip Morris - RJ Reynolds - British American Tobacco VI. Cigarette Brand - Domestic Brands (vary from country to country) - International Brands - Examples: - Marlboro - L&M - Winston - Lucky Strike - Salem Tobacco Price Measures The retail prices of cigarette packs are available from a number of governmental as well as private data sources. Generally, cigarette price information can be obtained, by request, from the Ministry of Finance and/or Ministry of Commerce. Both these Ministries are required to track the retail prices of tobacco products. The Ministry of Finance does so because of the tax implications associated with varying tobacco prices. The Ministry of Commerce monitors the retail prices of nearly all goods sold in the domestic market. In addition to Ministries, many CSO’s or governmental data bureau’s 27 also report the retail price of cigarettes and/or smokeless tobacco in their annual data yearbooks. Alternatively, tobacco price data can be purchased from a number of private data collection firms. For example, AC Nielsen collects cigarette price data in a wide range of developed, middle income and developing countries. Other international private data collection firms include: Information Resources International (IRI) and Sofres, Taylor, Nelson Inc. See section __ for additional information regarding these sources. Tobacco Regulatory Environment Any researcher conducting tobacco related economic analyses in a given country must first and foremost, understand the regulatory environment surrounding tobacco products in that country. A complete understanding of a country’s tobacco regulatory environment is required for a number of analyses, particularly by the tobacco demand analyses presented in Tool 3 as well as studies of tobacco smuggling presented in Tool 7. In collecting this data, the date on which the regulation was announced to the public and the actual date of enactment should be recorded. Key regulatory information includes: I. Tobacco Taxation - How is tobacco taxed? - Taxes on raw tobacco leaves - Import duties - Excise, sales, ad valorem and sales taxes II. Restrictions on smoking - Are there legal restrictions on smoking? - If so, where is smoking restricted? - What is the extent of these restrictions? - Is smoking totally banned in workplaces, theaters, health care facilities, etc? - Is smoking partially banned in workplaces, theaters, health care facilities - To what extent are these smoking restrictions enforced? III. Advertising restrictions - Are there legal restrictions on cigarette advertising? - If so, what is the extent of the restriction? - Is cigarette advertising totally banned? - Is cigarette advertising partially banned - Are advertising restrictions strictly enforced by authorities? IV. Restrictions on Youth Access 28 - Is these a minimum age requirement for the legal sale and/or purchase tobacco products? - To what extent are youth access laws enforced? - What are the associated fees, fines, etc for violations of youth access laws? V. Counter advertising - Is information on the consequences of tobacco use (counter advertising) propagated nationally and locally? - If so, how? Which of the following policies are required by government? - Government issued Health Warning labels on cigarette packs - Government issued Health Warning labels on cigarette advertisements - Warnings against underage purchase of tobacco products - Warnings of penalties for underage purchase of tobacco products - What is the industry’s policy on counter-advertising? - Does the industry post warnings against underage purchase of tobacco products? VI. Access to smoking cessation therapies - Are cessation therapies accessible in the market? - If so, then: - Which therapies are available? 1. Pharmaceutical treatments - Nicotine Replacement Therapies (NRT’s) including nasal sprays, microtabs, patches, gum, inhalators - Zyban - Nicotine analogs (i.e. Tabex) - Herbal curatives (i.e. Tobaccoff, Nicofree) 2. Non-Pharmaceutical methods - Hypnosis - Acupuncture - Behavioral methods (i.e. individual, family and group therapies; self-control) 3. Cessation accessories - filters - fake cigarettes - lock-boxes - Are they available over the counter - or by prescription only? - How are they priced? 29 The Pitfalls of Using Aggregate Data As with any analysis that involves data, a researcher should be aware of a number of obstacles that may arise when using a gathered set of aggregated variables. The following paragraphs highlight some of the issues that every researcher should be aware of when using aggregate measures in economic analyses. Pitfall #1: Mulitcollinearity Several difficulties are encountered in studies that use aggregate level time-series data. One such difficulty originates from the high correlations that exist between price and many other key independent variables. For example, in cigarette demand models, estimated price and income elasticities of demand would depend on the descriptive variables (those which control for the effects of other important determinants of smoking such as advertising, health awareness, etc.) have been included in the model. Consequently, estimates of the impacts of price and other factors on the demand for cigarettes will be sensitive to which variables are or are not included in the econometric models. Including highly correlated variables may result in multicollinearity and unstable estimates. At the same time, excluding potentially significant and important variables to cigarette demand may produce biased estimates for the impact of price on demand. Pitfall #2: Bias A second complication which arises when tax paid sales are used as measures of sales and/or consumption. These measures are likely to be understated particularly when tax paid cigarette sales are used. More specifically, in those countries or regions where cross-border shopping and smuggling are significant, sales are likely to understate consumption in jurisdictions with relatively high tobacco taxes and prices. At the same time, consumption may be overstated in relatively low tax and price jurisdictions. Failing to account for such factors can produce upward-biased estimates of the impact of price and taxes. Pitfall #3: Simultaneity A third problem in the analysis of aggregate data exists as cigarette (or other tobacco product) prices, sales, and consumption are simultaneously determined – that is, all three measures are determined by the simultaneous interaction of both the supply and demand for cigarettes or other tobacco prices. Failing to account for this simultaneity leads to biased estimates on price. Several studies have tried to theoretically model the supply and demand for cigarettes and others have used data from large natural experiments (i.e. large increases in cigarette taxes) to avoid the simultaneity issue. 30 Pitfall #4: Limitations from Units of Measure Finally, studies that use aggregate data are limited to estimating the impact of changes in prices and other factors on aggregate or per capita estimates of cigarette consumption. Therefore, these studies cannot provide information on the effects of these factors on specific issues such as the prevalence of tobacco use, initiation, cessation, or quantity and/or type of tobacco product consumed. Also, these studies do not allow one to explore differences in responsiveness to changes in price or other factors among various subgroups of the population which may be of particular interest (i.e. age, gender, race/ethnicity, socioeconomic status, education , etc.) Individual Level Data An increasing number of studies use data on individuals derived from large-scale surveys. In cigarette demand models, the estimated price elasticities of demand using individual level data are comparable to those estimated using aggregate data. Individual data taken from surveys helps avoid some of the problems that arise with the use of aggregate data. For example, data collected by individual surveys provides measures for smoking prevalence and consumption of cigarettes. This helps avoid some of the difficulties associated with using sales data as a proxy for consumption. Second, because an individual’s smoking decisions are too small to affect the market price of cigarettes, potential simultaneity biases are less likely. Similarly, individual-level income data and other key socio- demographic determinants of demand are less correlated with price and policy variables than among comparable aggregate measures. This creates fewer estimation problems and is likely to produce more stable parameter estimates. Finally, the use of individual-level data allows for the exploration of issues that are more difficult to address with aggregate data, including estimating a separate effect of price and other factors on smoking prevalence, frequency and level of use, initiation, cessation, and type of product consumed. Also, each of these can be examined in the context of various population subgroups. For example, the Living Standards Measurement Surveys (LSMS) are country level examples of household surveys conducted in collaboration with the World Bank. The LSMS surveys collect information that is representative of entire households as well as information for individuals residing within a household. Such survey data allows researchers to explore the effects of individual or general population characteristics such as gender, age, income, marital status, education, religion, social status and occupation on smoker responsiveness to changes in tobacco prices, taxes, availability and access. Individual level survey data in particular, allows for the estimation of the impacts of prices and tobacco related policies on 31 smoking prevalence, initiation, cessation as well as on the quantity or type of cigarettes purchased and consumed. The following paragraphs define a number of important variables needed to conduct economic analyses related to tobacco and tobacco control. The following paragraphs also include a number of examples of survey questions, which have been used to gather social and economic information from individual respondents. Consumption Various forms of tobacco consumption data can be obtained from an individual survey respondent. Individual information on current smoking participation (do you smoke presently) and the nature of smoking behavior (are you a daily, occasional, never or ex-smoker) may be obtained from carefully designed survey questionnaires. The following series of example questions were taken from surveys conducted in Poland since 1973. Example – Questions Used to Capture Cigarette Consumption Q1. Have you smoked at least 100 cigarettes during the course of your lifetime? A. Yes B. No Q2. How old were you when you began smoking regularly? Age: _________________ Q3. Do you presently smoke tobacco? A. Yes B. No Q4. Have you ever smoked tobacco daily for a period of 6 months? A. Yes B. No Q5. How old were you when you quit smoking? Age:___________________ Q6. During the past 6 months, did you smoke tobacco daily? A. Yes B. No Q7_A1. During the past six months did you smoke filtered cigarettes? A. Yes B. No Q7_A2. How many filtered cigarettes do you usually smoke? _____________________________ 32 Q7_B1. During the past six months did you smoke unfiltered cigarettes? A. Yes B. No Q7_B2. How many unfiltered cigarettes do you usually smoke? _______________________________ This series of questions allows a researcher to extract various consumption related information for an individual respondent including: smoking participation, the number of cigarettes smoked, smoking frequency, smoking intensity and type of smoker (daily, occasional, never and ex-smoker). Questions Q2 , Q4 and Q5 show that smoking behaviors, as they relate to the age of a respondent, are also important in defining the key consumption data. As a result, information on age when first tried smoking (question Q2), age when began smoking regularly (question Q4) and age when quit smoking (question Q5) is collected in order to extract information on average age of initiation, length of use, extent of addiction and average age of successful cessation. Tobacco Price Self-reported price per pack measures provide researchers with an alternative measure of tobacco price. Although highly endogenous, these reported prices can be aggregated to either a city or regional level to reflect the average local or regional price paid per pack of cigarettes. These price measures also provide a good scale of comparison to the price data collected by governments and/or private agencies. Example 1- Open ended questions concerning consumer cigarette price Q1. What is the price per pack (a pack is 20 cigarettes) of the cigarettes you smoke most often? Answer: __________________ Example 2 – Closed questions concerning consumer cigarette price Q1.How much do you usually pay for a pack of your usually smoked cigarettes? A. Do not smoke B. Less than $3.00 per pack C. $3.00-3.49 per pack D. $3.50-$3.99 per pack E. $4.00-$4.49 per pack F. $4.50-5.00 per pack 33 H. Over $5.00 per pack This price information in not entirely independent of their decisions about whether to smoke and how much to smoke. That is, because surveys collect self-reported cigarette price information from those respondents who already smoke, these sets of reported prices can reflect endogenous choices, particularly when it comes to choice of cigarette brands and cigarette quality. As a result, the price variable may be correlated with unobservable differences in preferences, yielding biased estimates in analyses that depend on this price measure. This produces a number of analytical concerns. 1. Smokers who smoke heavily may be more likely than other smokers to seek out lower priced cigarettes 2. Smokers may be more likely to purchase cigarettes in greater quantities to which significant market discount may apply (i.e. by the carton rather than the single pack) 3. Heavy smokers in particular may be prone to smoke less expensive cigarette brands, and more. Given any of the above rationales, analyses using these self-reported prices may produce biased estimates of the effects of price on smoking behavior. One way to help reduce biased price estimates is to include a few additional questions concerning brand and product type in a survey which already asks for self-reported cigarette price. Example 3 highlights two possible survey additional questions. Example 3 – Additional Questions to Help Minimize Biased Estimates Q2. Which brand of cigarettes do you smoke most often? _______________________ Q3. What size cigarettes do you smoke most often? A. Less than 70mm B. 70mm (Regular Size) C. Over 100mm (King Size) D. Other: __________________________ Q4. What type of cigarettes so you smoke most often? (Mark all that apply) A. Lights B. Ultra Lights C. Filtered D. Unfiltered 34 E. Menthol In order to help avoid biased estimates, researchers should test whether or not the price variable used in the relevant econometric model is exogenous. Various estimation methods can be applied including: A. A 2-stage least square (2SLS) estimation with an instrumental variable (IV) approach B. Applying Craig’s (1971) two part model Additional methods to help solve the endogeneity problem in self- reported price variables are discussed in greater detail in tool 3. Tobacco Taxes as a Proxy for Price In cases where a cigarette price measure is clearly endogenous to an estimating equation, tobacco taxes may be used as a proxy for retial price. Tobacco taxes are regarded as good proxies for tobacco prices, particularly because tobacco taxes (either national or local) are generally independent of an individual’s decision to smoke and/or how much to smoke. As a result, the most appropriate proxy for the retail price of a pack of cigarettes is the total per pack tobacco tax. In cigarette demand equations, the use of a tax per pack measure yields an estimate of tax elasticity. The tax elasticity must be converted into a price elasticity in the following way. In the following linear demand model: Consumption = � + � Tax + � Here, the tax variable is used instead of a price variable to estimate the demand for cigarettes. β(t) is the estimated coefficient of the tax variable in the regression equation above; p is the sample mean of the cigarette price; y is the sample mean of per capita cigarette consumption; and p/t is the change in cigarette prices resulting from a change in excise taxes. This could be estimated by regressing price as a function of tax where the estimated coefficient of tax (�) will be p/t. Price = � + � Tax + � In econometric models, variation in data points allows for statistically significant and sound findings. This is also true for price data measures. Cigarette demand studies typically obtain variations in price from tax differences across time and jurisdictions. For example, in the United States, the fifty U.S. states and Washington 35 DC have different levels of cigarette taxes and a single cross-section of a national survey has considerable variation in tax measures. On the other hand, tax levels in most developing countries, particularly smaller countries, rarely vary within country as local taxes are rarely levied. Here, one, two or even three years of household or individual level survey data does not provide enough variation in prices or taxes to be able to be used in statistical analyses. In most countries, cigarettes are frequently taxed at different rates based on length, production size, quality, type, manufacture process (hand-made, machine-made), and origin. Once characteristics of the cigarettes which individuals smoke is identified from the survey data, there may be enough tax variation within a single cross- sectional sample. If there is no information on tobacco product characteristics other than price, then the researcher should find other sources that show very detailed price information by type, size, quality and origin etc. This information is generally available from commerce departments and/or customs and tax administration departments in a country’s Ministry of Finance. Such information allows researchers to use prices to figure out the types of cigarettes smoked and assign a corresponding tax level. Researchers should be aware of price variations of brands in urban versus rural areas, and across different types of points of sale. Measures of Income A survey respondent is often asked to provide information on the amount of his or her income. Common income measures include net total per capita household income per month or net total per capita household income per month. Example 1: Per capita Income Q. What is your household’s total net per capita income per month (include all employment, investment and governmental or non-governmental benefit earnings)? Answer: ________________ Alternatively, a survey may ask for net total household income and in a separate question, may also inquire about the number of persons residing in the household. This question format allows a researcher to obtain information about household size, household income and per capita income. 36 Example 2: Household Income Q1. What is your net total household income per month (include all employment, investment and governmental or non-governmental benefit earnings)? Answer: _____________ Q2. How many people constitute your household? (Mark one reply) A. 1 B. 2 C. 3 D. 4 E. 5 F. 6 G. 7 H. 8 I. 9 Example 3: Using a Proxy for Income Surveys often ask individuals or households to report information on their: A. Education B. Self reported standard of living. C. Occupation These measures are often highly correlated with measures of income. That is, as educational attainment, standard of living or level of occupation increases, so does the associated level of income earned. As a result, these variables serve as good proxies of per capita income or household income. Example 3A: Education as a Proxy for Income The following provides samples of survey questions pertaining to an individual respondent’s education. Q. What is your educational background? A. Less than or equal to primary education B. Technical/Vocational School C. Less than high school D. High School E. Some technical schooling beyond high school F. Some college level schooling G. A college degree In cases where income levels of a child, youth or young adult are needed, question regarding parental education may be used. Examples of these types of questions include: Q. Did your parents (mother, father) attend college? 37 A. Neither father nor mother or father attended college B. Father attended college C. Mother attended college D. Both father and mother attended college Q. How far did you father (mother) go in school? A. Less than high school B. High School C. Some college or technical schooling beyond high school D. Four year college degree or more E. Don’t know F. Not applicable Note: In addition to income effects, education also has a negative effect on smoking decisions. From this perspective, more highly educated individuals are more likely to have access to information on the adverse health impacts of tobacco use and therefore, may reduce their tobacco consumption even as income levels rise. To this extent the overall effects of this income proxy on tobacco consumption is dependent upon, which of the two effects, the income versus information effect, is stronger. Example 3B: Standard of Living as a Proxy for Income The example question provided below capture income information by asking about an individual respondent’s standard of living. Q. How would you best describe your standard of living? A. Very good B. Good C. Fair D. Rather poor E. Very poor Example 3C: Occupation as a Proxy for Income The following question provides an example of the categories of occupations, which may be asked of a respondent during an individual level survey. Q. Which best describes your current occupation? A. Management B. Unskilled C. Skilled D. Farmer E. Self-employed F. Student G. Disabled 38 H. Unemployed I. Housewife J. Do not work Socio-Demographic Information Other socio-demographic data which can easily be asked in individual surveys includes measures for: age, gender, race, ethnicity, religious denomination, religious participation, religiosity, marital status, number of children, household structure, employment status, type of employment, educational attainment, area of residence, and more. Prior research has shown that these socio- economic and demographic factors can be important determinants of tobacco use, expenditures, and other related issues. The following highlights a few examples of individual level socio-economic and demographic information. Age Surveys can ask a respondent his or her age or inquire about the respondent’s date or birth. Once ages of respondents are known, then age groups can be defined (i.e. 16-25, 26-40, 41-55 etc.) and dummy indicators of each age range variable can be constructed. Each person is assigned a value of 1 for the age group variable, which corresponds to his or her current age. Religion Some religions are openly opposed to smoking and other addictive or substance use behaviors. Examples of such religions include Mormons in the United States or Muslims in Egypt. Surveys often aim to identify the religious denomination, religious participation and/or religiosity of respondents. Examples: Questions pertaining to Religion What is your religious denomination? a. Atheist b. Catholic c. Jewish d. Muslim e. Protestant f. Other religion. Please specify: ______________ How religious are you? a. Very religious b. Somewhat religious c. Little Religious d. Not religious 39 Do you participate in religious services or practices? a. Yes, a few times per week b. Yes, once per week c. Yes, once or twice per month d. Yes, few times per year e. Do not participate in religious services or practices How would you describe your position in relation to your faith? a. Religious and regularly attend services b. Religious but irregularly attend services c. Religious but do not practice my faith d. Atheist How important is your religion in your decision not to smoke? a. Very Important b. Important c. Somewhat Important d. Not at all Important The Pitfalls with Using Individual Level Data Like aggregate data, analyses which use individual-level data also face a number of challenges. Pitfall #1: Bias First, such data may be subject to an ecological bias in that omitted variables that do affect tobacco use are correlated with the included variables. Excluding such variables may produce biased estimates for the included variables. Second, the use of individual-level data is subject to potential reporting biases. A comparison of self-reported consumption with aggregate sales data, by Warner (1978) shows that survey-based, self-reported consumption understates actual sales. Potential underreporting of consumption may cause problems in the interpretation of estimates produced from using individual-level data. Note, in general, studies using individual-level data assume that the extent of underreporting among respondents is proportional to their actual level of use. This assumption implies that the estimated effects of price and other factors will not be systematically biased. However, this assumption has yet to be demonstrated. Third, similarly to aggregate data, by failing to account for differences in cigarette prices across countries or regional borders, elasticity estimates may become biased (biased towards zero). When using individual level data, one often has information on where an individual resides. Studies that use individual-level often use a number of approaches to control for potential cross-border shopping which result from differing tobacco prices. Some studies have 40 limited their samples to those individuals who do not lie near lower- price localities (Lewit and Coate, 1982; Wasserman et al., 1991; Chaloupka and Grossman, 1996; Chaloupka and Wschsler, 1997). Other analyses have included an indicator of a price differential (Lewit et al., 1981; Chaloupka and Pacula, 1998a, 1998b). Other studies have used a weighted average price that is based on the price in the own-locality as well as on the price found in nearby localities (Chaloupka, 1991). Pitfall #2: Limitations in available data Another limitation to using individual level survey data is that data on price, availability, advertising, policies, and other important, macro-level determinants of demand are generally not collected in the surveys. As a result, many relevant variables may be omitted from the analysis. 41 III. Data Preparation and Management: Easy Steps to Building Your Own Database Choosing a Software Package Today, statistical researchers and analysts can draw upon a number of computer-based tools, which facilitate data manipulation, variable construction and analysis. In general, a successful software package is one, which is flexible and easy to use yet powerful enough to handle large amounts of data in the shortest amount of time possible. The selection of a software package is most dependent on budgets and desired program performance. The market price of statistical software packages varies from producer to producer as does the power and sophistication of the software. For the purposes of this tool kit, a “good� software package should provide the following at an affordable price: • ease of data access • sufficient capacity to manage and manipulate data • availability of moderately advanced statistical tools • capability to present analysis results easily and clearly The statistical software market offers a number of varyingly equipped packages at largely ranging costs. The following paragraphs highlight some of the possibilities. Spreadsheets In recent years, traditionally easy to use spreadsheet packages have been greatly improved and as a result, have become quite sophisticated analysis tools. For example, Microsoft Excel or Corel 42 Quattro Pro, are both two popular spreadsheet programs that are compatible with all Microsoft Windows operating systems. Spreadsheet programs are easily accessible ad are almost always included with the basic software of a new computer. In general, spreadsheet programs, although reasonably priced, offer straightforward methods for data access and manipulation but provide only moderate capabilities for statistical data analysis. Because of their limited space and computing power, spreadsheet programs are really only equipped to handle aggregated sets of data. Statistical Packages Popular higher-powered statistical programs include SAS, SPSS and STATA. These packages are equipped to handle much larger bodies of data than the spreadsheet programs mentioned above. All three statistical packages offer a large array of data manipulation and data analysis tools at largely varying prices. Retail Price Each of the packages mentioned above greatly differs in retail price. As a result, the choice in packages is greatly determined by a researcher’s budget. In terms of retail price, SAS is by far, one of the most expensive statistical programs on the market. In addition to the cost of purchasing the package, SAS requires annual updates to its corporate license. STATA on the other hand is quite affordable and standard packages may be purchased in bulk for as little as $50 a copy. SPSS is also quite expensive but falls somewhere in between SAS and STATA. Capacity Statistical packages are by far, better equipped to manipulate and maintain data sets than spreadsheet packages. Statistical packages, particularly SAS, can house very large sets of data. The amount that can be stored is dependent upon the memory of the computer that is hosting the SAS program. Spreadsheet, for example, Microsoft Excel is much more limited (i.e. holds just under 300 columns and just over 65000 lines of data). Data Management The grid-like nature of spreadsheet programs makes it very easy to view data and use functions or equations to create new variables. However, spreadsheets are limited in the calculations that they can perform. In addition, they do not allow for easy merging with other aggregations and types of data files. 43 Statistical packages require only a few lines of code to create new variables merge several different data sets as well as sort and aggregate data. Such data manipulations are carried out quickly and efficiently, even with very large sets of information. Statistical Tools In recent years, spreadsheet programs have been greatly enhanced to perform a number of sophisticated operations. Most spreadsheet packages today are equipped to calculate statistical summaries of data and produce basic ordinary least squares and logit estimates. As a result, studies using small data sets and requiring only basic regression models can easily make due with Microsoft Excel or comparable spreadsheet program for the study analysis. However, in relative terms, statistical packages are much more high powered than conventional spreadsheet programs. Each of these statistical packages contains advanced modeling tools and various statistical tests to allow for sophisticated econometric estimation. Presentation In terms of presentation, spreadsheets are by far, best equipped to create elaborate tables, figures and graphs. Although statistical packages have the capability to plot observations, the level of sophistication of these plots is nothing more than basic. SAS offers additional SAS Graph packages that can be purchased at an additional cost. However, the use of this supplemental program is complicated and requires relatively large amounts of programming. The spreadsheet packages make it easy to quickly and effectively plot or graph data and use for presentation. In addition, these graphs can be individually saved and imported into other word processing or presentation programs. Data Manipulation Reading the Raw Data When either aggregate or survey data is received from an institution or data agency it may require some manipulation and cleaning before it is ready for even the most basic forms of statistical or econometric analysis. That is, raw data must first be converted into a form that can be read by one or more statistical estimation packages (i.e. SAS, SPSS, LIMDEP, RATS, TSP, Microfit, or STATA). Most often, raw data files are imported into statistical programs in the form of ASCII or text format data files (example extensions .txt or .csv). For 44 additional information and guidance on reading data files into SAS, reference Chapter 2 of The Little SAS Book. Example: Viewing an ASCII data file An ASCII file (identified by an .asc extension on the data file name) is a simple text file which contains rows and columns of numerical information. Any ASCII file can be easily opened and viewed in a Word, Wordpad or Notepad file. The following is an ASCII file containing 1994 individual level survey data from Poland. Note: Although this example uses individual level data, identical steps will apply to ASCII and other text files containing aggregate measures. The filename for this data is: data94.asc 11340.3.211112112. .. ... .. .. .276432.36 3.......222222122. .. ... .. .. .258122.25 110...212121129111 .12511151 32 .1541211.2 140...212111111111 .19911202 .2 .2506612.3 110...222121112111 .12111202 .2 .1674211.2 11230.211111111111 .13011 42 .2 .250112.42 110...1.221111222. .. ... .. .. .276732.34 110...1.2211112511 .12011202 .2 .128712.12 21230.3.2222222212171152. .. .. .1631411.3 11230.1.122111222. .. ... .. .. .278112.32 110...1.221111112. .. ... .. .. .2401414.5 21230.1.111311212. .. ... .. .. .276432.36 210...3.1121122211 .11811302 .2 .171122.52 210...3.1121122211 .12011202 .2 .1392211.2 110...1.111111112. .. ... .. .. .269142.23 110...1.2213211212221202. .. .. .2686411.3 110...1.123113222. .. ... .. .. .239642.13 210...3.1123223511 .11511402 .2 .163612.42 110...1.111111222. .. ... .. .. .231642.13 1130..211111112111 .12311 62 .2 .2536511.3 210...1.2211221112321172. .. .. .1596715.6 2120..1.122111212. .. ... .. .. .271672.33 210...1.112112322. .. ... .. .. .2576211.1 2130..1.2211112112351222. .. .. .2426712.4 220...3.1221121211 .12012 .1202 .2541314.5 3.......223221112. .. ... .. .. .229112.15 220...3.111111112. .. ... .. .. .225112.15 220...3.2211111211 .19912 .1202 .134112.15 220...3.122121112. .. ... .. .. .176122.42 110...211121112111 .11712 .1202 .163122.42 3.......222222112. .. ... .. .. .278112.32 210...1.1121121112501212. .. .. .128122.12 210...1.221111212. .. ... .. .. .126112.12 3.......111122112. .. ... .. .. .2411412.3 110...1.221112112. .. ... .. .. .154442.12 11230.1.2121112112491182. .. .. .144452.14 110...211221111111 .11911152 .2 .1564112.2 45 210...1.222122112. .. ... .. .. .2735211.2 1120..1.111111112. .. ... .. .. .2485112.6 In order to be used further, the above data must be read into either a spreadsheet or directly into a statistical package. Example: Reading a Text file into Microsoft Excel The largest advantage to reading an ASCII file directly into Microsoft Excel is the ease of viewing and manipulating data. Because it is a easy to use spreadsheet program, Microsoft Excel allows even a beginning researcher with limited Excel experience to quickly and easily view data. In addition, formulas can easily be constructed to quickly capture descriptive statistics of the raw data. Finally, sophisticated graphics can be easily produced using the Excel program. To read the above or other text file into Microsoft Excel, the following steps must be taken: Step 1: Open the Microsoft Excel program Step 2: Move to open the ASCII file by clicking first on “File� and then “Open� in the command bar found at the top of the screen. Note:In order to locate the text file in your computer’s directories look to “Files of Type� and select “All Files� Step 3: Click and open the selected text file. In this example, file data94.asc will be opened. Step 4: Use Steps 1-3 in the “Text Import Wizard� to properly open the ASCII file. In Step 1: Select “Fixed Width� 46 In Step 2: Click on the vertical lines in “Data Preview� window and place them in such a manner as to break up columns of data. In order to properly segment columns in this text file, a code book or index of column codes will be needed. The following column codes were obtained from the Polish data institution, which prepared this set of raw data. Once the data has entered this software environment, raw data columns must assigned with appropriate variable names. Often, an index or code book containing information for each column of survey data accompanies the raw survey data. The following is an example of a column index. CODEBOOK/INDEX OF COLUMN CODES Question Column p45 1 p4601-p4605 2-6 p47 7 p48 8 p49 9 p50 10 p51 11 p52 12 p53 13 p54 14 p55 15 p56 16 p57 17 p58l 18 p58, 19-20 p59l 21 47 p59m 22-23 p60 24 p6101l 25 p6101m 26-27 p6102l 28 p6102m 29-30 p6103l 31 p6103m 32-33 m1 34 m2 35-36 m4 37 m6 38 m7 39 m8 40 m9 41 m10 42 In Step 3: Under “Column Data Format � select “General� and then click FINISH. Step 5: Use the index of column codes to assign titles (in this example, column titles read p45 through m10) to each column of data in the excel file. Note: Given the above example, a researcher must manually assign and enter titles to each column. In spreadsheet programs, column headings are simply types into the top cell of each column. In SAS, STATA or other statistical program, a program must 48 be written and called in order to assign a variable name to each column of data. The following table provides a quick glimpse on how a researcher could assign column headings to data94.xls. p45 p46_01 p46_02 p46_03 p46_04 1 1 3 4 0 3 . . . . 1 1 0 . . 1 4 0 . . 1 1 0 . . Step 6: Save this data file as an Excel spreadsheet (for purposes of this example, let’s call this file data94.xls), note the location of the data file and close Excel. Steps 1 through 6 have successfully transformed ASCII file data94.asc into a Microsoft Excel Worksheet called data94.xls. Example: Reading an Excel data file into SAS Once a data set exists in the Microsoft Excel environment, it can easily be moved for use into SAS Statistical Software. In order to import our example file, data94.xls, into SAS, the following codes must be constructed in the “Program Editor� window of version eight of SAS. /**************** Set SAS Options ****************/ OPTIONS mprint ps=55 ls=100 ERROR=1; Libname cc 'c:\Data\SASData\Indata'; Libname cbos 'c:\Data\SASData\Outdata'; Libname wb 'c:\Data\Raw'; /************************************************** Use the Proc Import Command to read in xls file into SAS Create Permanent SAS data file: cbos.data94.sd2 **************************************************/ PROC IMPORT OUT=cbos.data94 49 DATAFILE= "C:\Data\SASData\Raw\data94.xls" DBMS=EXCEL2000 REPLACE; GETNAMES=YES; RUN; The above import procedure successfully imports Excel file data94.xls into SAS to create a permanent SAS data set called: data94.sd2 Example: Reading a Text file directly into SAS Codes presented in the previous two examples assume that a researcher would first like to carry the text file into a spreadsheet environment before moving the data into a more sophisticated statistical package. However, researchers who feel at ease with their data and are comfortable with statistical software, would probably choose to move their data directly into SAS in order to conduct simple analyses, plots and summary statistics. The following pieces of code are designed to directly import a .csv (or other type of text file) into SAS and also create a permanent SAS dataset called: cbos.data94.sd2 /********************************************************* Read in Raw Data and Create Temporary SAS file x Note: use delimiter=',' as this raw csv file is comma delimited use firstobs=2 as data begins in row #2 *********************************************************/ Data x; infile 'c:\Data\Raw\data94.csv' delimiter=',' firstobs=2 missover; input p45 p46_1 p46_2 p46_3 p46_4 p46_5 p47 p48 p49 p50 p51 p52 p53 p54 p55 p56 p57 p58l p58m p59l p59m p60 p6101l p6101m p6102l p6102m p6103l p6103m m1 m2 m4 m6 m7 m8 m9 m10; Run; /******************************************** Create Permanent SAS file cbos.svy9412.sd2 ********************************************/ Data cbos.data94; Set x; Run; 50 Example: Quality Checks on the Raw Data Before a researcher begins analyzing and reporting the findings of the study, he or she should quality check the data by using a series of basic tests on the data. This will help assure that the observations and variables are in order. Because some data sets, particularly survey data sets, are quite large, it is almost impossible for a researcher to look at the data set to find outliers or other odd quirks in the data. As a result, a researcher can use two procedures to quickly scan and troubleshoot the data. These test procedures include: 1. Generating descriptive statistics for each of the raw variables 2. Checking the frequencies in the values of each variable PART 1: Calculating Summary Statistics for Each Raw Variable The SAS statistical package offers several statistical procedures, which can be used to generate summary or descriptive statistics for each raw variable acquired in a data set. The two most commonly used procedures include a means procedure and a univariate procedure. Although both procedures generate largely similar output, the commands used to produce summary statistics in a proc means are most straightforward. The following code contains a Proc Means command to generate a set of descriptive or summary statistics for our raw SAS data set: data94.sd2 This code is entered and run from the Program Editor window of SAS. /*********************************************************** Produce Summary Statistics including: Average, Minimum, Maximum, Sum and Standard Deviation values. **********************************************************/ PROC MEANS data=cbos.data94 N Nmiss mean min max sum stdev; RUN; This proc means requests that SAS generate seven individual pieces of information about this data set. The requested statistics include: N (number of non-missing observations); 51 Nmiss (number of missing observations); Mean (a variable’s means); Min (a variable’s minimum value); Max (a variable’s maximum value); Sum (the sum of values contained in a single variable); StDev (standard deviation in values of a single variable). The output generated by the above means procedure is displayed in a SAS Output window. The means output is presented as follows: Te nu Column 1: “Variable� Lists all the variables contained in the SAS version of the raw data (data name: data94.sd2). Column 2: “N� Indicates the number of non-missing observations reported for each variable. Column 3: “N MISS� The information contained here is exactly opposite to the information reported in column 2. This is the number of missing observations for each variable. Note: The sum of columns 2 and 3 should total to the total number of observations in the data set. Column 4: “MINIMUM� Across variables this is the lowest value reported for each variable in the data set. 52 Column 5: “MAXIMUM� Similarly to column 4, across variables this column presents the highest value reported for each variable in the data set. Column 6: “STD DEV� This column reports the average standard deviation between the values reported for each variable in the set. Column 7: Sum This column reports the sum of all values reported for each variable in the data set. PART 2: Generating Frequencies Frequency checks on variables contained in a data set allow a researcher to easily and exactly check the range and frequency of values for any given variable. As will be discussed in the following section on data recodes, the output from a frequency check is also needed to correctly reconstruct raw survey observations into informative econometric and statistical variables. /*********************************************************** Produce Frequencies of each variable. Save Output as a codebook. **********************************************************/ PROC FREQ data=cbos.data94; TABLES p45 p46_1 p46_2 p46_3 p46_4 p46_5 p47 p48 p49 p50 p51 p52 p53 p54 p55 p56 p57 p58l p58m p59l p59m p60 p6101l p6101mp6102l p6102m p6103l p6103m m1 m2 m4 m6 m7 m8 m9 m10/list; RUN; This frequency procedure asks SAS to generate information on the frequency of values for each and every variable in data set data94.sd2 As in the case of a Proc Means, the output generated by the Proc Freq command is displayed in a SAS Output window. The SAS output containing frequency information is presented as follows: 53 The SAS output window presented above captures the proc freq output for variables p45 p46_1 p46_2 from data set data94.sd2. The results of the proc freq for raw variable p45 show that this variable takes on several values ranging from 1 to 3. Among the 1041 respondents captured by variable p45, 50.9 percent of respondents (or 530 persons) reported a value of 1, another 40.35% (or 420 of respondents) answered 2 and finally, 8.74% (or 91 respodents) reported a value of 3. This range of values captures responses for the full sample of 1041 persons surveyed. As a result, there are no missing observations for variable p45. The results for the frequency on variable p46_1 reveal that the range in values for this variable stretches from 1 to 5. Here, from among the 1041 total possible respondents, 87.79% (or 834 respondents) reported a value of 1, another 8.00% (or 76 persons) reported a value of 2, and so on. The frequency on this variable captures values reported by only 950 respondents. A remaining 91 respondents did not provide a value (answer) for variable 46_1. As a result, the “Frequency Missing� field reports a value of 91. Once raw items of information are successfully imported into a statistical program and all visually apparent peculiarities have been identified and fixed, additional steps can be taken to prepare a strong set of variables for statistical analysis. These steps include: • construction of new variables • merge in other sources of data 54 • clean the final data Creating New Variables Many raw variables, particularly may require recoding before they can be used for statistical modeling. Some survey questions have simple yes or no answers. In this case, a variable is coded 1 for yes and 0 for no. For example, a variable SMOKE, is constructed for the question “Do you currently smoke?� and has two possible answers: 1). Yes or 2). No. The variable SMOKE equals a value of 1 if the respondent is a smoker and equals a value of 0 if the respondent is not a smoker. These types of measures are dichotomous variables. Other survey questions require various continuous or categorical coding. First, survey responses, which are categorical in nature, take on a range of values, depending on possible responses to survey questions. For example, possible answers to the question “What is your level of education?� include: no education, completed elementary level schooling, completed high school or obtained a college degree. Here, the variable EDUCATION takes on some whole number value between 1 and 4 for each respondent. The variable EDUCATION is coded as 1 for respondents with no education, respondents with elementary schooling equal 2, secondary or high school equals 3 and completion of a college degree equals 4. Second, some questions (e.g. age or income) are coded as continuous measures. Here, the variable AGE is set equal to the actual age of the respondent. Each of the variables discussed above can be recoded into additional variables, depending on the purpose of the research and the interest of the researcher. For example, the variable AGE can be transformed from a continuous variable with values of 18, 19, 20, …. into a categorical age variables which equal one for all respondent under the age of 18, a value of 2 for respondents between 18 and 21, a value of 3 for respondents between 21 and 31 and so on. Groups of variables can be constructed to reflect individuals from different income strata. Here, different income group variables can be created (e.g. poor, middle income group, high income) where the variable POOR equals a value of 1 for all respondents earning less than $2000 per month; the variable MIDDLE equals a value of 1 for all respondents earning over $2000 per month but less than $4000 per month and the variable HIGH equals a value of $4000 per month or more. Merging Data Sets Often, researchers want to include specific sets of information in their analyses. This may require the use of additional data sources, 55 which contain information on specific variables. When using more than one data set, a researcher should merge data sets together. Merging data sets is common and often necessary practice for a study. For example, when a household survey and a survey of individuals both contain tobacco-related information, it will enrich and simplify analyses to merge two data sets together. Household size can be gathered from household survey data while employment status, education level and marital status for an individual member of a household are obtained from individual level survey data. Two or more data sets can be merged together if each of the data sets have one or more fields or variables in common. For example, every household in a survey is usually assigned two identification numbers. One is the household identifier and the other is an identifier for the household’s location (city, region, state etc.). When individuals in households are surveyed, they are assigned an individual identification number as well as the same household identification number that’s used in the household survey. As a result, household level survey data can be merged with individual level survey data to create a larger and more comprehensive data set. In other situation, household survey data and individual level survey data may lack information on particular variables important to the analysis. In this case, researchers try to find another data set to merge and use with the household and individual level data. For example, if income information is not contained in either the household or individual level data, a researcher may turn to state, city or region data for an average income measure. The national or central statistical offices of most countries usually collect such information. The researcher should use his/her judgement when deciding on income measures which would be most suitable for a household level or individual level analysis. State, province or city identification numbers may make it possible to merge household data with government income data. After data sets are merged and results are obtained, it is important to remember the level at which information when interpreting results. For example, in cigarette demand analyses, if the demand for cigarettes is estimated at the household level, but the income variable is an average city income measure, the result should be interpreted as changes in household consumption (by packs, pieces etc) in response to changes in the average city income. Cleaning the Data Once data sets have been merged and necessary variables have been created, then the next step is to filter or clean the data of inconsistencies in coding. The most important filtering approach is to deal with missing or miscoded information. First, once a data set has been constructed, researchers examine the descriptive statistics (including the standard deviation, mean, 56 minimum and maximum and the number of observations for each variable of interest that is contained in the data set. Careful examination of the descriptive statistics is a first quick step to finding data with missing values, outliners and miscoded information. If a variable is missing information for one or more observations, then the total number of observations will differ from the total number of observations for other variables contained in the data set. When minimum and maximum values are checked, researchers can identify outliners or miscoded variables. For example, gender variables are generally coded as 0 or 1 variables where male take a value of 1 and females take a value of 0 (or vice-versa). If there is “2� in the maximum value for the gender variable, it becomes clear that one or more observations have been miscoded or have been incorrectly imported into the working data file. Similarly, when checking income values a 0 value or an extremely high income value flags a potential problem with the data. If this is not possible to clean problematic variables from odd values, observations with missing or miscoded information are usually deleted or dropped from the estimation. Second, it is important to look at frequency tables and the distribution of the variables as a means for checking the data and deciding on an appropriate model form or data transformation. For example, the distribution of cigarette consumption is usually skewed. Given this, researchers can consider using the log of consumption in the demand model and may choose an appropriate estimation method (linear, log linear, two-part model, etc.) for the analysis. If for example, the data show that 80% of individuals are male and only 20% are female, then there appears to be selection bias in the survey data, and it may be a good idea to weight the data, or stratify the sample by gender. Inputting Missing Values Most statistical packages automatically drop observations with missing values from the estimation. When the remaining number of observations is not large enough, researchers do not like to drop observations with missing values, and instead try to impute values. For example, if the income variable is missing, researchers will impute income, based on the income levels of other households with the same characteristics. An alternative approach is to run regression for income as a function of other characteristics (age, education, occupation etc), and use the regression result to predict or estimate income for observations with missing values for income. Before running the regression, all missing variables should be assigned a value (0, or 1) so that observations are not dropped from the regression. The following presents a regression equation used to impute income where the level of income is a function of age, gender, neighborhood 57 of residence, size, type of location of house, education attained, marital status, occupation and assets (i.e. a car, bicycle, television, personal computer, etc): Income = f(age, gender, house, education, marital status, occupation, assets) In the above model, the regression technique estimates income for each observation. Researchers can use the regression coefficients to estimate income for observations with missing income values, where the income of individual i is estimated as the constant term + age*coefficient for age + sex*coefficient for sex + each other variable in the equation * the variable coefficient. Adjusting for Reporting Bias in Cigarette Consumption As mentioned earlier in this tool, individuals often underreport their actual levels of consumption of tobacco and alcohol. Researchers should adjust the survey data to agree with aggregate sales data (and carefully explain how and why this is done). In making this adjustment, the researcher needs to know what fraction of the total population is covered by the survey and how representative the sample is of the total population. At the very least, researchers should estimate the degree of underestimation and acknowledge the reporting bias in their results. Variable Selection The first step in any analysis is to identify a data set, which can be analyzed. In some countries, where data sources are plentiful, a researcher may have choice between various sets of aggregate and individual level data. The following outlines the strengths and weaknesses associated with the use and analysis of various types of data. 58 59 V. Suggestions for Data Sources Where Can I Find Economic and Tobacco Related Data? The World Health Organization International Organizations Organization: World Bank Web site: http://www1.worldbank.org/tobacco/ Keyword: Tobacco Information Organization: International Monetary fund Web site: http://www.imf.org and http://www.imf.org/external/pubs/ft/fandd/1999/12/jha.htm Keyword: Economics of Tobacco Control Organization: WHO - Tobacco Free Initiative Web site: http://tobacco.who.int/ or http://tobacco.who.int/en/research/index.html Keyword: activities for tobacco prevention and awareness Non-Governmental Organizations Organization: The International Tobacco Control Network Web site: http://www.globalink.org/ Organization: Reseach on Nicotine and tobacco Web site: http://www.srnt.org/ 60 Organization: Research for International Tobacco Control Web site: http://www.idrc.ca/tobacco/en/index.html Keyword: tobacco production and consumption to human health Organization: PATH Canada Web site: http://www.pathcanada.org/english/tobacco.html (vietnam) Keyword: POLITICAL MAPPING FOR TOBACCO CONTROL Organization: Research for International Tobacco Control Web site: http://www.idrc.ca/tobacco/en/index.html Keyword: tobacco production and consumption to human health Organization: PATH Canada Web site: http://www.pathcanada.org/english/tobacco.html (vietnam) Keyword: POLITICAL MAPPING FOR TOBACCO CONTROL Organization: TobaccoPedia Web site: http://www.tobaccopedia.org/ Organization: Web site: http://tobacco.org/ Keyword: Private Data Agencies Organization: Market file Web site: http://www.marketfile.com/market/tobacco/ Organization: The Tobacco Manufacturers' Association (TMA) Web site: http://www.the-tma.org.uk/miscellaneous/main.htm Keyword: tobacco industry Organization: The Retail Tobacco Dealers of America, Inc. Web site: http://www.rtda.org/ Keyword: sale and promotion of legal tobacco products Organization: AC Nielsen Inc. Web site: http://www.acnielsen.com US Agencies with International Interests Organization: Centers of disease control and prevention Web site: http://www.cdc.gov/tobacco/ Keyword: tobacco information and links Organization: U.S. Department of Health and Human Services Web site: http://www.dhhs.gov/ or http://www.dhhs.gov/topics/smoking.html Keyword: 61 Governmental Publications International Monetary Fund (IMF). 1999. Government Financial Statistics. Washington D.C. Data on total revenues, revenues from excise taxes and all taxes Industry Reports 62 Sources of Aggregate Level Data International Sources Food and Agriculture Organization (FAO) Statistical Database (http://apps.fao.org/cgi-bin/nph-db.pl?subset=agriculture). Data on tobacco production and harvest area, producer prices for tobacco leaves, tobacco leaves and cigarette trade (export-import) by volume and value. International Monetary Fund (IMF). 1999. Government Financial Statistics. Washington D.C. Data on total revenues, revenues from excise taxes and all taxes. MarketFile (http://www.marketfile.com). A commercial online tobacco database. Data on cigarette consumption, production, price and tobacco control measures. Subscription is required to obtain access to these data. United Nations Industrial Development Organization (UNIDO) (http://www.unido.org) Tobacco manufacturing employment. World Bank. 1998. World Bank Economic Survey of Tobacco Use: http://www.worldbank.org/tobacco. Data on average retail price for most popular domestic and foreign cigarettes, cigarette excise tax, tobacco tax revenue. (need to check if already available?) World Bank. 1998. World Development Indicators. Washington D.C. General socioeconomic, population, and health indicators for 148 countries and 14 country groups. Select pieces of the database are available at http://www.worldbank.org/data/wdi/home.html. World Health Organization. 1996. Investing in Health Research and Development: Report of the Ad Hoc Committee on Health Research Relating to Future Intervention Options. Geneva:, Switzerland. Estimates of tobacco-attributable burden of disease by region, 1990- 2020. World Health Organization. 1997. Tobacco or Health: a Global Status Report. Geneva, Switzerland. Country-level data on smoking prevalence, cigarette consumption, tobacco production, trade, industry, health impact and tobacco control legislation. Available online at http://www.cdc.gov/tobacco/who/whofirst.htm . World Health Organization. 1999. The World Health Report 1999: Making a Difference. Geneva, Switzerland. Estimates of tobacco- attributable burden of disease by region, 1998. 63 Sources in the United States United States Department of Agriculture (USDA), Economic Research Service (ERS) (http://www.econ.ag.gov/briefing/tobacco/). Data on cigarette sales, cigarette and tobacco leaves production. U.S. Centers For Disease Control and Prevention (CDC), Office on Smoking and Health (OSH) (http://www.cdc.gov/tobacco/index.htm) Current and historical state-level data on the prevalence of tobacco use, the health impact and costs associated with tobacco use, tobacco agriculture and manufacturing, and tobacco control laws in the United States. US Department of Commerce Bureau of the Census, provides US population data Tobacco Situation and Outlook Reports of the US Department of Agriculture, Economic Research Service. Provides data on total US expenditure on cigarettes, per capita cigarette consumption, US cigarette production, exports, wholesale prices, and the market share of filter cigarettes The Tobacco Institute, various publications and years, provides data on total US cigarette prices and sales. US Federal Trade Commission’s reports to Congress pursuant to the Federal Cigarette Labeling and Advertising Act provides data on US market share of low-tar cigarettes, US tobacco advertising expenditure and nicotine delivery per cigarette (Barnett, Keeler and Hu, 1995; Harris, 1994) Publications and reports by J.C. Maxwell in Business Week, Advertising Age, and Tobacco Reporter provide information on US cigarette brand sales and the Herfindahl index of industry concentration for the tobacco sector. US Department of Commerce, Bureau of the Census annual surveys of industry groups and industries, and census reports on wholesale trade and retail trade publish employment and wage data for US tobacco manufacturers, wholesalers and retailers US Department of Labor, Bureau of Labor Statistics publications provide information on US tobacco industry capital stock estimates US Department of Commerce, Bureau of the Census annual surveys of industry groups and industries provide US Data on inventories and capital expenditures US Department of Health and Human Services provides information on US statutes restricting smoking in general categories of public places but state laws are not available in published compilations (Ohsfeldt, Boyle and Capilouto, 1999). Wasserman et al. (1991) and Ohsfeldt, Boyle and Capilouto (1999) provide updated indices of the rigour of smoking regulations in US states 64 US Surgeon-General’s Report (1989) contains a four-step scale of restrictiveness of smoking laws (Barnett, Keeler and Hu, 1995). Information on US laws which restrict smoking in public areas are available from state legislative records Alternative measures of anti-smoking regulation (based on major local smoking ordinances published by Americans for Non-smokers’ Rights and weighted by local population) in Keeler et al.,1993 and Sung, Hu and Keeler,1994. Gruber (2000) categorizes clean indoor air laws according to a Youth Access Index based on an index developed by the National Cancer Institute to evaluate state laws limiting youth access to cigarettes. Sources in the United Kingdom National Income and Expenditure Yearbook produced by the Central Statistical Office (CSO) provides economic data for the United Kingdom including expenditure on tobacco and other consumer nondurables. Monthly Digest of Statistics also published by the UK’s CSO provides population data The Advertising Statistics Yearbook of the UK Advertising Association and Quarterly Digest of Advertising Expenditure provide data on UK tobacco and other advertising expenditure (this information has been partially abstracted by Duffy (1995) Sources in Australia National Accounts published by the Australian Bureau of Statistics provide data on nominal expenditures on cigarette and tobacco products, an implicit price deflator for tobacco products, and measures of nominal household disposable income. The Bureau of Statistics’ All Groups: Capital Cities Consumer Price Index produces Australia’s Consumer Price Index Commonwealth of Australia Year Book produced by the Australian Government Publishing Service provides Australian population Hu, Sung and Keeler (1995) quantified total pages of cigarette advertising in Life magazine distributed in California as a representative sample of industry media presence in the state. They point out that comprehensive and systematic data on tobacco industry advertising and promotion activities are very difficult to obtain, especially at sub-national level. They suggest therefore that the frequency of advertisements in newspapers and magazines may be the best proxy for the industry’s countervailing behavior in response to tobacco control policies. 65 Commercial Economic Advisory Service of Australia publishes data on Australian advertising expenditure by cigarette companies (Bardsley and Olekalns, (1999). Examples of Subnational Level Data Sources Tobacco Institute provides state level cigarette sales, prices and tax rates for the United States (see Wasserman et al., 1991; Sung, Hu and Keeler, 1994; Barnett, Keeler and Hu, 1995; Hu, Sung and Keeler, 1995; Chaloupka and Grossman, 1996; Harris, Connelly and Davis, 1996; Chaloupka and Pacula, 1998a, 1998b; Gruber, 2000) The Tobacco Tax Council also provides state level cigarette sales, prices and tax rates for the United States (Becker, Grossman and Murphy, 1994). The California Board of Equalisation obtain monthly cigarette sales data for California from tax-paid wholesale sales Population Research Unit of the California Department of Finance provide Californian metropolitan area population data The Bureau for Economic Analysis of the US Department of Commerce reports per capita income data for California as calculated from estimates of total personal income (see: Keeler et al., 1993; Hu, Sung and Keeler, 1995). Examples of Household Survey Data Sources US Current Population Survey files contain economic and demographic data on a large sample size of respondents as both individuals and households. This source includes data on cigarette prevalence and intensity, and the prevalence of smokeless tobacco use. The large size of this data source allows for age cohort analyses. tate and metropolitan area identifiers which permit assessment of applicable tobacco tax rates and restrictions. Disadvantages are that proxy responses are often given for tobacco use, particularly in the case of teenage householders, which tends to underestimate tobacco use even more substantially than the systematic under-reporting generally associated with surveys (Ohsfeldt, Boyle and Capilouto, 1999). A similar reservation about proxy responses is raised by Gruber (2000) concerning the National Survey of Household Drug Abuse, as well as the fact that the data do not contain state identifiers. Townsend (1987) obtained survey data on UK cigarette consumption by social class from the Tobacco Research Council, which inflated the data to agree with cigarette sales figures to correct for the problem of survey under-reporting of tobacco use (as discussed in Mitchell Hoyt and Chaloupka, 1994). Data on incomes of selected occupational groups were obtained from the Family Expenditure Survey. 66 Townsend, Roderick and Cooper (1994) obtained data on UK smoking prevalence by sex, age and socio-economic group from the UK General Household Survey. Sources of Individual Level Data Examples of Survey Data Sources Data on percentage of US adults currently smoking cigarettes and US cigarette consumption are available from the National Health Interview Survey - results are provided by the Office of Smoking and Health, US Centers for Disease Control. Data on youth smoking prevalence are obtainable from the Monitoring the Future surveys conducted by the Institute for Social Research of the University of Michigan. These provide data on a variety of independent variables with which to control for age, income, gender, ethnicity, marital status, parental educational level, family structure, mother’s work status during respondent’s childhood, presence of siblings, average weekly working hours, rural vs. urban location, and religious observation. Data on smoking propensity by different demographic groups in the state of New South Wales, Australia, survey conducted by the Australian Bureau of Statistics. 67 68 VI. References 69