WPS4216 Underlying Dimensions of Knowledge Assessment: Factor Analysis of the Knowledge Assessment Methodology Data Derek H. C. Chen* The World Bank Kishore Gawande** Texas A&M University The Knowledge Assessment Methodology (KAM) database measures variables that may be used to provide an assessment of countries' readiness for the knowledge economy, and has many policy uses. Formal analysis employing KAM data is faced with the problem of which variables to choose and why. Rather than make these decisions in an ad hoc manner, we recommend factor-analytic methods to distill the information contained in the many KAM variables into a smaller set of ``factors". The main objective of the paper is to quantify the factors for each country, and do so in a way that allows comparisons of the factor scores over time. We investigate both principal components as well as true factor analytic methods, and emphasize simple structures which help to not only provide a clear political-economic meaning of the factors, but also allow comparisons over time. World Bank Policy Research Working Paper 4216, April 2007 The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the view of the World Bank, its Executive Directors, or the countries they represent. Policy Research Working Papers are available online at http://econ.worldbank.org. *Economist, Knowledge for Development Program, Human Development Department, World Bank Institute. ** Professor, Bush School of Government and Public Service. 1. Introduction In order to facilitate the attempt of countries to make the transition to the knowledge economy, the Knowledge Assessment Methodology (KAM) was developed at the World Bank (Chen and Dahlman, 2004, 2005). It is designed to provide an assessment of countries' readiness for the knowledge economy, and identifies sectors or areas in which policymakers should focus their attention and make future investments. The KAM is currently being widely used both internally and externally to the World Bank, and frequently facilitates engagements and policy discussions with government officials from client countries. This rich database is also potentially useful for research by political economists and political scientists. The KAM database includes variables such as tariff and non-tariff barriers, regulatory qual- ity, rule of law, adult literacy rate, secondary enrollment, tertiary enrollment, researchers in R&D , patent applications granted by the USPTO, acientific and technical journal articles, telephones, computers, and internet users.1 They are constructed for over 120 countries, and are available at different points in time. Any formal analysis employing KAM data must confront the problem of which variables to choose and why. Rather than make these decisions in an ad hoc manner, we recommend "reducing" the set of KAM variables to a smaller set of variables without losing information contained in the full set of variables. Factor-analytic methods are concerned with precisely this problem ­ reducing the data in a way that parsimoniously represents essentially the same information contained in the many variables. The parsimonious set of variables is the set of "factors" to which the data in the large number of variables is reduced. 1Source: The Knowledge Assessment Methodology (KAM) website (www.worldbank.org/kam). 1 Our main objective in undertaking the factor analysis is to quantify the factors for each country, that is, compute "factor scores" on each factor. Importantly, we wish to accomplish this in a way that allow comparisons of the factor scores over time. To this end, the paper details four issues in the factor analysis of the KAM data in detail. The first is whether the KAM data should be factor-analyzed and what factor-analytic method may be most appropriate; the second is determining the optimal dimensionality of the data, that is, the number of factors to which the data may be adequately reduced; the third, and perhaps most important, is giving clear meaning to the factors. Each of the above issues is treated exhaustively in the paper. If subsets of variables are correlated, then depending on the extent of the correlation, factor analysis is worth doing. A formal test shows that the KAM data are not just amenable to factor analysis but they greatly benefit from it. There are enough inter-correlations among the variables that the real information in the data can be distilled down to a smaller number of dimensions. What is the optimal dimensionality to which the information contained in the variables can be reduced? Depending on the factor analytic method that is chosen, the answer is different. For example, in principal components analysis this is determined by the number of principal components required to explain, say, 95% of the total variance in the data. In "true" factor analysis (which we estimate using maximum likelihood) a formal chi-squared test or information criteria that measure fit in terms of explained intercorrelations, not just variance, are used to determine optimal dimensionality. The most important contribution of the paper is that it gives political-economic meaning to the dimensions, whether they go by the name of "factors" or "principal components". Ultimately, we hope to make factor analysis a useful policy tool to indicate warning signals 2 about the health of countries. The tool we will use to give political-economic meaning to the factors are the "factor loadings". Intuitively, these are the coefficients of the regression of each variable on the factors. Thus, if one variable has a very high coefficient on one factor but not on any of the others, we say that variable loads heavily on that factor. If the data, or the information in the variables can be reduced to a smaller set of factors then what we should find is that some variables load heavily on the same factor and other variables load heavily on other factors. That is, the structure of factor loadings should be simple". One definition of a simple structure of loadings is as follows: a structure in which any single variable loads only on one factor and minimally on the others, and that more than one variable load on one factor. We will spend considerable effort in producing simple structures, because simple structures make the political-economic content of the factors unambiguous and clear. We will also test for the adequacy of our simple structures. Obviously, the preceding discussion about factors loadings as regression coefficients is meant only to set ideas because, unlike in regression analysis, the factors themselves are unknown. In other words, the factor scores, or the value that the factors take, are not known and no regression in the usual sense can be estimated. Section 2 provides the theory behind how factor scores and factor loadings are computed simultaneously within a factor analytic framework. The paper proceeds as follows. In section 2, we outline the generic factor model. In this sec- tion, two fundamentally different methods of factor analysis ­ principal components analysis and true factor analysis are explained in detail. A special case of the true factor analy- sis, the error components method, is also discussed here. Section 3 discusses the data and sources. The analysis is carried out on 12 variables measured across 120 countries. The data are from two time periods, 1995 and a more recent vintage around 2003. The same section discusses how we impute missing data in order to cover the sample of 120 countries, not all 3 of which have complete data on all 12 variables. We also point to a data pitfall that should be avoided before doing the factor analysis. Section 4 contains the empirical results and the main contribution of the paper. We analyze principal components separately from the true factor analysis results. There are three main components to this section. The first is the use of factor loadings in order to name the factors. We show that with the KAM data we are able to achieve a fairly simple structure. The second is a set of formal tests for the dimensionality. The third is another methodological pitfall, whose resolution confronts us with a tradeoff. We indicate how and why we choose to resolve this in the manner we do. The choice is obvious from our overriding objective of computing factor scores as precisely as possible. Section 5 discusses the output from this factor analysis. We use graphs to show how countries have changed their rankings on the underlying dimensions over this ten year period. Section 6 concludes. 2. Factor Analysis Models The notation and material in this section borrows from Reyment and Joreskog (1993, Sections 2, 4). The general factor analysis model is X(N×p) = F( N×k)A( k×p)+ E( N×p), (1) where X is the data matrix of p variables, F is the matrix of k < p factors, and N is the sample size. The k × p "factor loadings" matrix A is used to linearly sum the factors to predict each column of X. What cannot be predicted is collected in the error matrix E. In the context of the KAM data, each column of X is a measure (i.e. variable) containing "scores" for a set of N countries. There are p such measures on which country scores have 4 been compiled.2 The individual components of F are the "scores" for common factors since they are common to several different measures. The KAM measures are thus predicted as linear combinations of the factors. The coefficients of the factors, called the factor loadings, are the elements of A . For example, consider the ith measure (variable) xi. It can be written as a regression model xi = ai f + ai f + . . . aikfk + ei. 1 1 2 2 (2) where f1, . . . , fk are the "exogenous" factors, and the coefficients ai , . . . , aik are the "load- 1 ings" contained in the ith column of A . While ei is given the interpretation of a regression residual, in fact it is made up of the measurement error in the measure xi plus a "specific" factor that xi does not share in common with other measures. Thus, each of the p variables xi, i = 1, . . . , p can be written as a regression model with the factors acting as the common "exogenous" variables weighted by the coefficients ai , . . . , aik, and where ei is the regression 1 residual. Written in this form makes it clear that factor analysis is a method of data-reduction. The method seeks to parsimoniously represent in a small set of variables (f1, . . . , fk) essentially the same information contained in a much larger set of variables (x1, . . . , xp). We will reduce the KAM data variables to their essential factors using two different factor-analytic methods. The difference between model (1) and ordinary regression models is that the factors and coefficients are both unknown. That is, neither F nor A is known and must be estimated. There is a fundamental indeterminacy in the model. If we (linearly) transform F and A , respectively, as F* = F C- and A* = C A , then (1) is equivalently written as: 1 2p need not be fixed. Factor analysis of the KAM variables may be performed separately on subsets of the KAM variables. For example, each of the four pillars of the KAM data ­ (i) economic and institutional regime data, (ii) education and skills data, (iii) infrastructure data, and (iv) innovation potential data ­ may be distilled down to one or two factors. 5 X( N×p) = F*( N×k)A*( k×p) + E( N×p) , (3) Then, by observing X we cannot distinguish between these two models. This should be familiar from econometric textbook discussions on identification (e.g. Greene, 2004). De- vising "simple" structures in which as many factor loading as possible are zeros, facilitates identification and interpretations of the factors. We will explore simple structures in detail. We now formally discuss the two popular methods of factor analysis that we will use: the Principal Components (PC) method and the pure factor analysis model which we estimate by maximum likelihood (ML). Fixed versus Random Factors A distinction is made between models that presume the factor matrix F in (1) to be fixed, and models that presume F to be random. The random factors model is appropriate when we want to extend our inferences to different samples (say, of individuals), while the non- random factors model is appropriate when the specific observations (here countries), and not just the model structure, are of interest. The KAM data pertain to specific countries, and are exhaustive across countries, which makes a compelling case for the use of fixed-factor models. However, if inferences from the factor analysis were to be applied to countries not in the sample, or to the same countries but in a future period, then it is advisable to use random-factor models. The likelihood function for (identified) models with random F is well defined (see e.g. Anderson, 1984 p 552). Estimation of models with non-random F proceeds based on least squares criteria (for which, unlike the random factors case, no distributional assumptions need be made unless statistical testing is to be done). 6 2.1 Principal Components Analysis of the Fixed Factors Model3 In Stata, estimation of the Principal components model proceeds as in a fixed factor model. Let Y be the mean-removed data matrix and scaled by 1/N so that the matrix S = Y Y is the data covariance matrix. Consider the (non-random factor) model for Y: Y( N×p) = F( N×k)A( k×p) + E( N×p) , (4) where A is the factor loadings matrix and F is the matrix containing the factor scores. Using least squares to fit the (fixed data) model implies estimating F and A (for a given k, see Section 4 on determining k) in order to minimize the sum of the squares of the residual matrix: E = Y - FA . (5) The singular value decomposition (SVD) theorem indicates that a solution with the largest k singular values 1, 2, . . . , k is given as: FA = 1v1u1 + 2v2u2 + . . . + kvkuk, ^ ^ (6) where uj is a (p × 1) vector, and vj is a (N × 1) vector, and j is the jth eigenvalue of the data covariance S. Define the matrices: Vk = [v1, v2, . . . , vk], Uk = [u1, u2, . . . , uk] and k = diag[1, 2, . . . , k]. Their dimensionalities are: Vk : (N × k), Uk : (p × k), k : (k × k). Then the solution is: FA = VkkUk. ^ ^ (7) Note that there is not a unique solution for F and A individually. Our solution will be in ^ ^ the direction of "simple" structures for A. ^ 3The principal components (PC) method is applicable to both, fixed and random factors models (Reyment and Joreskog, 1993). We focus on PC as applied to fixed factors since Stata estimates PC for the fixed factors model. We indicate how to estimate the PC model for random factors in fn. 4 7 Consider the following solution: F = Vk, ^ A = kUk. ^ (8) Then the factor scores for the k factors, F, are also in standardized form with covariance ^ equal to the identity matrix. That is, they are pairwise uncorrelated. If E is small so that Y is approximated by FA , then the data covariance is approximately: ^ ^ S = Y Y AF FA = AA ^ ^ ^ ^ ^ ^ (9) The "PCA" routine in Stata calculates Principal Components in the following steps: 1. Compute the covariance matrix S.4 2. Compute the k eigenvectors corresponding to the largest eigenvalues of S. Arrange the eigenvectors in the p × k matrix Uk. 3. Estimate factor loadings as A = Uk ^ 4. Estimate Factor Scores as F = ZA, where Z is the data matrix with the p variables ^ ^ standardized to have zero mean and unit variance. Thus, the factor loading matrix is the set of eigenvectors corresponding to the largest k eigenvalues. This is also the factor scoring matrix. Note that Stata computes factor scores using the standardized variables Z, not Y. In this solution, the factors have different variances and they are not comparable (their "units" are different). They must be scaled by - 1/2 k to be comparable (and have unit variance). 4In our analysis we use the Stata default to analyze the data correlation matrix, which produces quan- titatively somewhat different loadings and scores from the analysis of the variance matrix ­ known as the "scaling" problem of PC analysis ­ but qualitatively the results are close. 8 2.2 True Factor Analysis of Intercorrelations (using Maximum Likelihood) True factor analysis is based on the random factors model. While the model in the random factors case is the same as (1), the population covariance matrix is: = AA + (10) if the factors are uncorrelated, and = AF FA + (11) if the factors are correlated. In (10) and (11) is the true error covariance matrix.5 In order to estimate the parameters of the model, we proceed by analyzing data that are mean-removed, so that the data covariance S = X'X. We make the following assumptions about the true covariances: 1 1 1 1 (12) N X X , N F F , N F E 0, N E E , that is, finite second moments and orthogonality of the error and factor score matrices. We will assume that the error covariance is diagonal, that is, measurement (and other) errors are uncorrelated across different variables. This diagonal error covariance is constant across observations (or "homoskedastic" covariance). The factors may be correlated, that is, is 5PC analysis of the random factors model is also possible, but requires the assumption that is small (that is, E in (1) is small). The Unweighted Least-Squares (ULS) criteria fits the factor model so that the sum of squares of the elements of S - AA (presuming factors are uncorrelated) is minimized. The PC solution to this problem may be computed using the following steps: 1. Compute the covariance matrix S. 2. Compute the k largest eigenvalues and arrange them in a diagonal matrix k. 3. Compute the corresponding k eigenvectors Uk of S. Compute A = Uk1k . Each eigenvector is now ^ /2 scaled so that its length equals the corresponding eigenvalue. 4. Compute the factor scores as F = YA- . ^ ^ 1 k While this solution is different from the fixed factor solution, it is applicable to the fixed factor case with the ULS criterion applied to the error matrix E in (4) and (5). 9 permetted to be non-diagonal (if the factors are uncorrelated ­ more on this below ­ then = I). Therefore, the population variance is a function of the model parameters A, and . = A A + . (13) In PC analysis of the random factors model (see fn 5), factors are determined so that they account for maximum variance of all the observed variables. Thus, the emphasis in PC analysis is on eigenvalues, because the sum of all eigenvalues is the total variance is all the variables. In true factor analysis, the factors are determined so that they best account for the intercorrelations of the variables. In true factor analysis, the errors are presumed to be uncorrelated with each other so that is diagonal (in PC is simply assumed to be small in the sense that A A. The rank of A A, and therefore of is approximately k. In true factor analysis, in (13) has diagonal elements only, so that the off-diagonals of are exactly equal to the off-diagonals elements of A A, and the parameters are estimated to make the off-diagonal elements of the data correlation matrix as close as possible to the off-diagonals of A A. The diagonal elements of are equal to the sum of the diagonal elements of A A (the "communalities" of the variables) and those of (the "uniqueness" of the variables). The off-diagonal elements assume greater importance in true factor analysis than in PC analysis (where they are assumed away). The ML estimates of A and is based on the assumption that the error vector for obser- vation i, Ei., is multivariate normal with mean 0 and variance . The likelihood function for the multivariate data is ln|| + tr S- - ln|S| - p, 1 (14) which is maximized over the parameters A and . Asymptotic properties of the MLE have 10 well-defined limiting distributions which are used for testing.6 Computing Factor Scores and Standard Errors In order to estimate factor scores from the ML method, consider a single observation on the factor model: x(p×1) = A( p×k) (k×1) f + e(p×1) , (15) where the lower case letters denote the vector elements of their matrix counterparts in (1). We proceed as described in Anderson (1971, p. 575). The transposed vectors are column vectors. The data vector x and the factor score vector a have a joint normal distribution distribution with mean (0 , 0 ) and covariance: x + AA A cov = f A . The factor scores are computed by the regression of f on x . In terms of the population parameters, this is: E(f |x ) = A ( + AA )- x . 1 (16) Using the conditional variance formula, the covariance of the regression is cov(f |x ) = - A ( + AA )- A. 1 (17) Replacing the parameters by their ML estimates yields an estimate of the (conditional) co- variance. They may be used to test hypotheses about scores (for an observation) on different 6In Stata, the "factor" command is used together with the "ml" option in order to estimate the parameters of the factor model. 11 factors. The square roots of the diagonal elements of the (estimated) covariance are the stan- dard errors of the estimated k-vector of factor scores on that observation. These standard erros are constant across observations. Dividing the factor score with the corresponding standard error produces a t-statistic for testing statistical significance of individual factor scores. To take a simple, example, suppose the data are aggregated into a single factor, k = 1. Then the matrix collapses to unity. Then the estimator for the (scalar) factor score is E(f |x ) = A ( + AA )- x , 1 (18) and it's (scalar) variance is cov(f |x ) = 1 - A ( + AA )- A. 1 (19) For this single factor case, denoting the ML parameter estimates with "hats", the factor score (for the single observation) is computed as the conditional mean E(f |x ) = A + AA -1 x , (20) and its standard error is 0.5 se(f |x ) = 1 - A + AA -1 A . (21) 2.3 Error Components Method The error-components approach (EC) used by Kaufmann et al. (2005) to measure gover- nance across several countries is a random-factors approach based on econometric methods developed for latent data models (see e.g. Goldberger (1972) and MIMIC models of Joreskog 12 (1967) and Joreskog and Sorbom (1979)). The Kaufmann et al. approach is to fix the num- ber of variables that map into a factor and then estimate score for the factor as conditional means, conditional on parameters estimated by maximum likelihood. Thus, one major dif- ference from the PC and ML methods of factor analysis described above is that the number of variables that map into a factor is prespecified.7 Thus, the number of variables p (and which ones they are) is treated as prior information. The computation of EC factor scores proceeds in two steps. First, the model parameters are estimated using maxmimum likeli- hood. Next, they are used to compute the scores as conditional means. The method also produces conditional variances which may be used to construct confidence intervals for the factor scores or for testing. They (implicitly) consider the following factor model: X( N×p) = F( N×1) (1 ×p)+ E( N×p) . (22) This corresponds to (1) except that the factor loading vector takes the place of the factor loading matrix A in (1). Whereas in (1) p variables mapped into k factors, here p variables map into a single factor. Note that while we have chosen to use the same notation to indicate matrix dimensions, the number of variables p may be chosen to be a specific set of variables, and not the entire data matrix at hand (as we did in the case of the ML and PC methods in which the number of factors k is determined by the data). Since in the EC method k = 1, the p variables may be chosen to be a "homogeneous" subset of the variables designed for mapping into that factor. The EC likelihood function is as follows. Let , , be (p×1) parameter vectors. As before, 7For example, Chen and Dahlman (2005) partition the KAM variables into four "pillars": Economic Incentive and Institutional Regime, Education and Human Resources, Innovation System, and Information Infrastructure. In the Chen-Dahlman scheme, since tariff and non-tariff barriers, regulatory quality, rule of law represent the Economic Incentive and Institutional Regime pillar, p = 3 for this factor. 13 is defined to be the diagonal (p×p) error covariance matrix with elements 1, . . . , p. Let the (p × p) matrix = + . Then, the likelihood function for the data is: N L = -0.5 × N × ln|| + (xj - ) - (xj - ). 1 (23) j=1 In (23) the parameter is simply the vector of the means of the p variables in X. For observation j the 1 × p data vector is denoted xj. Denoting the ML parameter estimates with "hats", the factor score for observation j is computed as the conditional mean (conditional on xj) Fj = ^ - (xj - ), 1 ^ (24) and the standard error of this estimates is computed as sej = 1 - - 1 0.5. (25) This is exactly the same as (21), where A = (so that ( + AA = + = ). Where the EC method differs from the (random) factor method estimated by ML is in the specification of the likelihood functions. Whereas in the EC method the data likelihood is maximized over the parameters (A, ), in the ML factor method the likelihood of the inter- correlations in the data is maximized over the parameters. In this sense, the EC method is still a variance method (driven by a squared error loss objective) while the ML factor method pays attention to the intercorrelations among the variables. 3. Data The Knowledge Assessment Methodology (KAM) database consists of more than 80 struc- tural and qualitative variables to measure countries' performance on how they perform as "knowledge economies". We will use the subset of 12 variables that are used by the KAM 14 method to compute each country's "basic scorecard". They are: tariff and non-tariff barriers, regulatory quality, rule of law, adult literacy rate (% age 15 and above), secondary enroll- ment, tertiary enrollment, researchers in R&D, patent applications granted by the USPTO, scientific and technical journal articles, telephones (mainlines + mobile phones), comput- ers, and internet users. The KAM website (see fn 1) indicates the variety of sources from which the data are drawn. In addition to this unscaled data, we will also perform factor analysis on these variables, but now a subset of the variables will be scaled so that country size does not influence the analysis. The scaled set of variables are: tariff and non-tariff barriers, regulatory quality, rule of law, adult literacy rate (% age 15 and above), secondary enrollment, tertiary enrollment, researchers in R&D (per million population), patent ap- plications granted by the USPTO (per million population), scientific and technical journal articles (per million population) telephones per 1000 persons (telephone mainlines + mobile phones), computers per 1000 persons, and internet users per 10,000 persons. Data on the 12 variables are available at two points in time, one measured in 1995 and another during a more recent period, between 2002-04. We will use the term "2002" to indicate the recent data. Table 1 describes the variables and reports descriptive statistics for the 12 variable-pairs. 3.1 Missing Data Imputation The factor analysis restricts the sample to one which has complete data on all included variables. Hence, a crucial pre-estimation step is to impute missing data in order to have as broad a coverage of countries as possible. The imputations are carried out using a simple regression of the variable with missing data, using as the independent variable a conceptually closely related variable. For example, (unscaled) research95 has data for only 86 countries. However, the closely related research03 has data for 95 countries. Therefore, nine observa- 15 tions can be additionally imputed by regressing research95 on research03. The first column of Table 2 shows the results of this regression. The R-squared of 0.91 indicates a good fit for the imputation. Having filled these nine data points, we now have data for 95 countries for research95. That is still not enough. The next closely related variable is technical journal output in 1995 (techjour95). The second column of Table 2 indicates that this regression has an averagely good fit, with an R-squared of 0.70. The variable, techjour95 is statisti- cally significant at 1%. Therefore, the two-step regression process makes available data on research95 for 120 countries. A similar two-step regression process is used to impute missing research03 data via the regressions shown in columns 3 and 4 of Table 2. The last three columns in Table 2 impute data for computer95, computer04 and tariffs and NTBs for '95 (tntb95) using, respectively, GDP per capita for the two computer variables and tntb05 as regressors.8 After completing the imputations we have available data for 120 out of the 128 countries. Unavailability of data on the regressors prevents imputing missing data for the remaining 8 countries. The factor analysis is based on the sample of these 120 countries. The authors' working paper provide details on the countries for which variables are imputed and the imputed values. 3.2 Is It Worth Doing Factor Analysis on the KAM Data? The main objective of the factor analysis is to understand whether countries have advanced their positions over the 10-year period in terms of (i) the absolute measures of the factors, and (ii) their factor score ranks vis-a-vis other countries. Before proceeding with factor analysis and the computation of factor scores, it is important to understand whether and how much we can gain from undertaking a factor analysis. The cross-country data on the 12 8For imputing missing country values we use not only available data across countries but also for the ten regions including the world. There is additional information in these aggregated regions which can be brought to bear on the imputations. 16 variables have considerable correlations among them. However, if the correlations are driven by common underlying factors, then the factors become the main objects of interest for us. If two variables share a common factor with other variables, their partial correlation, con- trolling for all remaining variables, will be small. The Kaiser-Meyer-Olkin (KMO) statistic, based on this idea, computes the ratio of (i) the sum of squared correlations of each variable in the analysis with every other variable to (ii) the same sum plus the sum of squared partial correlations of each variable with every other variable, controlling for all remaining variables. Large values for this "overall" KMO measure indicate that the partial correlations are small, that is, common underlying factors are responsible for the correlations among variables. A large value for the KMO measure indicates considerable gains from undertaking a factor analysis. Table 3 indicates that the overall KMO measure is 0.875, which provides solid support for proceeding with factor analysis of the KAM data. Further, the KMO statistic for each variable individually indicates that their high correlations are driven by underlying factors.9 3.3 Avoiding a Pitfall Performing factor analyses separately on the 1995 variables and the 2002 variables is a pitfall one should avoid if the purpose of the factor analysis is to compare factor scores across the two periods. Separate analyses produce factor scores (that is, the quantity of a factor contained in each country) that are not strictly comparable. For example, in PC analysis separate analyses produces factors with different variances so that their magnitudes are not comparable (that is, their "units" are different). In order to solve this problem, we proceed as follows. First, we combine the 1995 and 2002 variables into one set of 12 variables. To be consistent, the same factor analytic method is used to combine each pair 9The variables in Table 3 are an amalgam of each variable-pair over the two years, see below. 17 of variables as is used in the factor analysis for the full set of 12 variables. For example, Computer95 and Computer04 are factor-analytically combined into one computer variable using either maximum likelihood (ML) or principal components (PC). Second, we proceed with the factor analysis of this set of 12 amalgamated variables. Third, we use the common estimate of the "scoring coefficients" matrix (see below) produced by the factor analysis, but apply that matrix separately to the 1995 and 2002 variables in order to compute separate sets of factor scores, one 1995 and one for 2002. These scores are used to analyze changes in the factors over the two periods. In this paper the scores are used ordinally to rank countries. However, making simple adjustments to the mean and standard deviation allows cardinal comparisons as well, for example in regression analyses. In the following section we report and analyze the results from the Principal Components (PC) method and the Maximum Likelihood (ML) method. The discussion is from the ground-up and provides details about (i) why the specific number of factors are chosen in each method, (ii) the reason why we choose "simple" factor loading structures for our analyses, (iii) the reason we choose the specific method for obtaining the simple structure, and (iv) an approximation that makes the structure especially simple and is essential to achieve our objective of comparing factor scores across years (plus a chi-squared test that tests whether the approximation is statistically accurate). 4. Empirical Results 4.1 Principal Components Analysis The first step in factor analysis is choosing k, or the number of factors that will fit the data "adequately". In PC analysis, an oft-used criterion is to set k to be no less than the number 18 required to explain 95% or more of the total variance in the data.10 Table 4.1 shows that six factors are required to explain at least 95% of the variance in the 12 KAM variables. Thus, we choose k = 6. Even though all of six factors are required to account for 95% of the variance in the data, the first factor accounts for the giant's share of the data variance. Table 4.1 indicates that the first principal component accounts for 60.2% of the total variance. We might expect that this component will also have the maximum number of large loadings among all principal components. Table 4.2 reports the loadings with k=6. As expected, a majority of the variables load heavily on this factor. Not only does this make it a catch-all factor, but since variables do not load on the remaining factors they have little political- economic content. For this reason, factor analysis has sought to design factors with "simple" structures so that all factors have meaningful content. Simple Structure: Orthogonal and Oblique Rotations The criteria advanced by Thurstone (1947) have been influential in producing computation- ally feasible methods that deliver simple structures: · There should be at least one zero in each row of the factor loadings matrix. · There should be many (at least k) zeros in each column of the factor matrix. 10 While this criterion is the most popular, other criteria have also been used. They are: (i) The size of individual factor loadings: the factor loadings squared (for orthogonal factors) indicate the variance of a variable accounted for by a particular factor. Factors not contributing much may be dropped. if parsimony is a driving concern, the thumb-rule proposed by Reyment and Joreskog (1993) ­ that there should be at least three significant loadings in each factor­ may be used. (ii) The variance explained by a factor: The sum of squared loadings for a given factor represents the information content in the factor. The ratio of the sum of squares to the trace of the correlation matrix is the proportion of total information residing in the factor. A cutoff value can be used to then determine how many factors to retain. (iii) Significant residuals: A residual correlation matrix may be calculated after each factor has been extracted. random error determines k. The standard error of the residual correlations (estimated roughly as 1/ N - 1 k is determined at the point when the residual matrix consists of correlations solelydue to can be used to determine whether the correlations are significantly greater than zero. 19 · For every pair of factors ­ only a few variables should have near-zero loadings on both. ­ some variables should load heavily on one and not at all on the other. ­ several variables should have near-zero loadings on both factors. Two classes of methods have evolved that produce simple structures. The first class of methods ­ orthogonal rotation ­ maintains the uncorrelatedness of the factors, while the second class of methods ­ oblique rotations - seeks to find simple structures with correlated factors. Since the latter relax the constraint on orthogonality of factors, they are capable of producing even simpler structures than orthogonal rotations. Technically, rotations work as follows. In the Stata code, a k × k rotation matrix T rotates the factor loadings in Step 4 (see Section 2.1) so that the rotated factor loadings matrix, denoted A^R, is given by AR = AT = UkT. ^ ^ (26) If T is an orthogonal transformation matrix, the rotation preserves the orthogonality of the factor score matrix F. Otherwise, the rotation is oblique, that is, the factors are correlated. Table 4.3 displays the oblique rotation matrix T that produces the simple structure that we use to proceed with our analysis and computations. In PC analysis, even after rotation the total variance explained by the factors is still the same (95.30%), but the portion accounted by each factor is now different. As Table 4.4 shows, rotation distributes the portion of the variance explained more evenly than the unrotated factors. This is the point of the simple structure: to identify a factor associated with only few variables. A graphical analysis makes the connection between rotation and simple structure clear. Figure A1 (see appendix) plots the unrotated factor loadings for k = 6 factors. A row of 20 the factor loadings matrix, indicating how the corresponding variable loads on each of the 6 factors, is depicted as a point in the 6-dimensional space of factors. The projection of these points on the (15) possible 2-dimensional subspaces is displayed, respectively, in each panel of Figure A1. Consider the top row of five graphs in which the y-axis measures the loadings of the 12 variables on the first principal component (C1). While the C1 vs. C2 graph shows some evidence of clustering, the structure of the loadings is not very simple as we move across the row of graphs. Now compare the same row in Figure A1 with the first row in Figure A2, in which the axes have been rotated orthogonally (that is, rotating the axes while keeping the origin at the same point and maintaining the angle between the axes at 90 degrees) in order to achieve a simpler structure. This type of rotation is known as the "varimax" rotation. There is a clear separation of the loadings into two y-clusters: one set of variables (patapp, techjour, tel) projects into high y-values, that is, high C1 loadings, and the other into low y-values. This type of simple structure is in evidence not only for loadings on C1 but on C4 (tertiary and secondary enrollment), C5(adult literacy), and C6 (tariffs & ntbs). While the C2 and C3 rows indicate the presence of clusters with high loadings (regulation quality and law load on C2 and computers and net users load on C3), the structure of loadings on these two components is not as simple as Thurstone's ideal. Regardless, the varimax rotation has made the structure of loadings much simpler. Can an oblique rotation that relaxes the constraint on uncorrelatedness of the principal components (i.e. the 90 degree angle between the axes) achieve an even simpler structure? Figure A3 depicts the result of an oblique rotation (known as the "Oblimin" rotation).11 There is no visible difference between the orthogonal rotation results in Figure A2 and the 11Although in the figure the axes appear perpendicular to each other, they are not. That is done merely for convenience. Correlation between two components implies that the angle is less than 90 degrees. However what is important to us is the projection of the points on the axes. 21 Oblique rotation loadings in Figure A3. As Figure A2 indicates the difference in the load- ings is almost negligible. In other words, in Principal Components Analysis the orthogonal rotation considerably simplifies the structure of loadings, and the oblique rotation repro- duces it but does not simplify it further. However, in the True Factor analysis below (using ML) an oblique rotation produces significant improvement over the orthogonal rotation. We therefore adopt the oblique rotation results for computing factor scores. Outside of economics and political science, in psychometrics for example, researchers con- ducting exploratory factor analysis have generally assumed orthogonal factors.12 In eco- nomics and political science, however, there is every reason for believing that factors should be correlated. Multiple regression is prevalent in economics and political science precisely because non-experimental data are correlated. In order to satisfy the ceteris paribus as- sumption, considerable care is taken to include appropriate control variables. We should embrace the idea that political-economic data are correlated when such data are determined in general equilibrium. It is therefore almost impossible for the data to be orthogonal. Within a data class there may be strong interdependencies while across data classes these interdependencies may be weak. In that case, the assumption of partial equilibrium for each data class may be justified. In Chen and Dahlman (2004) this assumption leads the authors to think of their data classes as "pillars". Here, we let the data decide how to form into groups. There are two related messages here. The first is that the main objective of the factor analysis is to be able to identify the underlying dimensions that the observed data purport to measure. Second, and related to this objective, there is absolutely no theoretical reason why the underlying factors should be uncorrelated. The underlying dimensions are 12Traditionally, a clear distinction has been made between confirmatory factor analysis (CFA) and ex- ploratory factor analysis (EFA). In CFA, if theory suggests two factors are correlated, then an oblique rotation is justified. In EFA, there is neither is there a theoretical basis for knowing how many factors there are nor whether they are correlated. In economics and political science, we argue, CFA will generally indicate correlated factors due to interdependencies of general equilibrium under which the data are generated. 22 determined by the same general equilibrium mechanism that generates measures of these factors (i.e. the variables). Theoretically, factors should be correlated.13 Political-Economic Dimensions of the Data: Naming the Factors What names are appropriate for the principal components? Table 4.5 shows that the first principal component (RC1) has the variables researchers, technical journals, and patent ap- plications load heavily on it. Therefore, this component is named the Innovation Potential factor, since the ability of an economy to innovate is appropriately measured by these impor- tant inputs. Since RC2 has law and the quality of regulations load heavily on it, we call it the Law and Regulation factor. RC3 is named the ICP factor since computers, net users and telephones lines load heavily on it. RC4 is the Education factor since secondary and tertiary enrollments load heavily one it. RC5 is named the Literacy factor after the single variable, adult literacy. The final factor RC6 is the Openness factor because tariffs and NTBs almost entirely load on it. Of note is the fact that the unexplained variance (last column), after accounting for the six principle components, is quite small for every variable. This indicates that the factor model with six components fit the data well at the individual variable level (therefore satisfying criteria (ii) in fn 5, as well). Computing Factor Scores The scores on any factor indicate how much of the factor is "contained" in a particular country. We use the oblique-rotated factor loadings as the basis for our factor score com- putations. As indicated in (15), the unrotated principal components (eigenvectors) A are ^ transformed into the rotated components AR as AR = AT = UkT (Table 4.3 displays the ^ ^ ^ matrix T). In order to estimate the factor scores, we use the direct method (as different from the regression method, see Reyment and Joreskog, pp 223-225).14 In this method, the 13It is reasonable that correlations among factors should be weaker than correlations among variables measuring any single factor (else the two factor should be combined into one). 14This is different from the method used to compute the scoring matrix in the ML method below. 23 factor scores F are computed as ^ F = Z[(ARAR)- AR] , ^ ^ ^ 1 ^ (27) where F is the (n × k) matrix containing factor scores on each factor for the 120 countries, ^ and Z is the (n × p) matrix containing the standardized data variables. The (p × k) scor- ^ ing coefficient matrix in Table 4.6 (produced by Stata) is the transpose of the coefficients (ARAR)- AR. However, before computing factor scores using (16), an important step is ^ ^ 1 ^ required to avoid another pitfall. Avoiding a Pitfall (and trade-offs involved) The scoring coefficients in Table 4.6 show that a few coefficients, indicated in bold, should dominate the measurement of the factor scores. In practice, however, other elements of the matrix can and do influence the computation of the factor scores, with unexpected consequences. Consider the first factor, the ICT factor, in Table 4.6. It consists of three large positive scoring coefficients (computers, internet users and telephones) and nine small coefficients, some of which are negative. These negative coefficients can actually produce contrarian factor scores. Take Angola, for example. Applying the direct method in (16) pro- duces an ICT factor score for Angola that ranks it 70th among the 120 countries. However, when the countries in the sample are ranked individually according to the three variables that measure ICT, Angola ranks near the bottom of the list, below 115th, in all three rank- ings. The reason why its rank on the ICT factor score is much higher than its rank on any of the three variables is because some negative coefficients multiply into (large) negative values of the corresponding standardized variables to create positive numbers. The point is that, although the small coefficients appear innocuous, using them in a formulaic manner can lead to mismeasuring factor scores, sometimes quite poorly. Because accurately measuring the factors scores is critically important, we take care to 24 produce the simplest structure possible.15 The example above indicates that despite those efforts, the structure is still not as simple as Thurstone's ideal. Had that ideal been achieved by the oblique rotation, it would also have produced accurate factor scores. In order to overcome the pitfall, exemplified by the Angola case, we propose to keep only the leading scoring coefficients in each column while computing factor scores, and to set the remaining coefficients to zero. This scoring matrix with the embedded zeros is presented in Table 4.7. For example, in the first column we retain the first three elements of the scoring coefficient matrix in Table 4.6 that correspond to the main loadings on this factor. The remaining elements are set to zero. This approximation may be formally tested using Anderson's eigenvector test (Reyment and Joreskog, 1993, p. 101). In order to test whether a specific vector b is equal to the eigenvector ai associated with the eigenvalue i of a matrix S (ai is the ith principal component of the data correlation matrix), the Anderson test statistic is: 2eig = (N - 1) ib S- b + 1 1 (28) i b Sb - 2 The statistic is distributed as 2 with p - 1 degrees of freedom, where p is the number of elements of the eigenvector (here p = 12). Inserting the eigenvector a in place of b in (17) results in a value of 2eig = 0. We will use this property to adapt (17) to test for the rotated factor loadings AR which was created by transforming A using the rotation matrix T , AR = AT. Even though the columns of AR are not eigenvectors, since the columns of ART- are the original eigenvectors (17) can equivalently be written as a test statistic for 1 the equality of the rotated vector corresponding to eigenvector ai, aRi, with a specific vector bR as: 2Reig = (N - 1) iT- bRS- bRT- + 1 1 1 1 1 1 i i (29) i T- bRSbRT- - 2 , i i 15The ordinal measure is a unit-free standardized factor score, relative to the median country whose score is zero. 25 where T- is the column of the T- that corresponds to eigenvector being tested for equality 1 1 i with the specific vector bR. We use (18) to test for the equality of each component (column) of the rotated factor loading matrix AR ­ the simplest possible principal component structure ^ given the data ­ with the corresponding column of the rotated factor loading matrix with embedded zeros AR ­ our "ideal" Thurstonian structure given the data.16 The statistic ^ 0 re-rotates the components back into the original unrotated eigenvector space and compares whether the simpler structure can map back closely to the unrotated loadings (which is the basis for the test statistic). Thus, the computed statistics correspond to the order of the unrotated eigenvectors. For the six columns we get the calculated chi-squared statistics, with 11 degrees of freedom, to be: 304.8, 54.6, 2.17, 35.0, 28.8, and 7.60. The critical value of 24.7 rejects equality of four of the six principal components (i.e. columns of AR and AR ). ^ ^ 0 The trade-off that this result forces is between (a) accepting the results of the test and using the full scoring matrix to compute (sometimes unreliably) the factor scores, and (b) to pro- ceed using a scoring matrix with zeros replacing the elements for which the corresponding loadings are small. The latter option is the one we choose for two reasons. First, we believe that while imperfect, as shown by the formal test procedure, it is a good approximation because the simple structure has delivered a clear picture of the variables that are strong measures of each factor. They come close to approximating the Thurstone ideal, and replac- ing the small loadings with zeros accomplishes that ideal. Second, and more important, is the overwhelming need to have consistent estimates of factor scores. As the Angola example drives home, the scores on a factor must be consistent with the underlying rankings of those variables that overwhelmingly determine the characteristics of the factors. For these reasons, we proceed with the use of the zero-embedded scoring coefficient matrix 16This method may not be used with the ML method below because the ML method's loadings matrix is not the matrix of eigenvectors, so there is no correspondence between the loadings matrix and the eigenvalues. 26 in Table 4.7 to determine the scores on the six factors. The same matrix is used to compute the 1995 scores and the 2002 scores on the six factors. so that the scores across the two periods may be compared. These factor scores will be used to depict two important features of the sample. First, the scores allow us to rank each country according to the values of each factor, for any specific period. Second, the scores from the two periods indicate how a country's relative position on a factor has changed over that time. Before performing these comparisons, we undertake a different kind of factor analysis, estimated by maximum likelihood. We will then be able to draw on a richer and robust set of results when we inspect country rankings and how they have changed. 4.2 True Factor Analysis with Maximum Likelihood The salience of many of the issues discussed while analyzing the PC results ­ rotation, simple structures, correlation of factors ­ are relevant for true factor analysis as well. Here too, our objective is to achieve the simplest structure, which for the ML estimates requires an oblique factor rotation. As was the case with PC analysis, in order to avoid inconsistent estimates and rankings in the true factor analysis, we set the factor scoring coefficients corresponding to small factor loadings to zero. The important difference from the PC analysis is that ML favors fewer factors. The communalities are reasonably high as indicated by the fairly low (below 0.35) uniqueness in the variables. With the focus now on intercorrelations rather than variances (see (14)), the appropriate measure of fit used to assess how many factors best fit the data, is no longer the amount of total variance explained by the factors as in PC analysis. Three measures are appropriate here, a chi-squared measure of fit, denotes 2fit, and two information-based criteria - the Bayes information criterion (BIC), and Akaike's information criterion (AIC). The first column Table 5.1 indicates the number of factors. The next five columns are related to the chi-squared fit 27 statistic corresponding to the number of factors in the first column. 2fit is distributed with 0.5[(p - k)2 - (p + k)] degrees of freedom. It is used to test the hypothesis that k or less factors are required to rationalize the data. At the 1% level of significance the calculated statistic rejects k = 1 and k = 2 but fails to reject k = 3. The smallest k is therefore three according to this measure of fit. Another use of this statistic is to see if the difference in the statistic with every increase in k is "statistically significant". Thus, going from k = 5 to k = 6 is the first increase (starting from k = 1) for which the change in the statistic is not significant. According to this variant, k = 5. The two information criteria reward parsimony and penalize over-parameterization, with the BIC penalizing over-parameterization more strictly. The smaller the BIC and AIC, the more preferred the model. The BIC chooses k = 4 while the AIC chooses k = 5. Thus, the statistical tests conclude that we should focus our attention on no more than five and no less than four factors. We estimated the model with both four and five factors. Upon examining the simplest loading structure we found the four factor model to have cleaner political- economic content since the fifth factor is not distinct in the sense of clearly generating even one of the variables. That is, it consists of many small undistinguished loadings that are collectively significant but not individually so. Thus, we proceed with k = 4. Table 5.2 indicates the oblique-rotated factor loading matrix with four factors (the rotation matrix T is reported in Table 5.3). The oblique rotation improves upon the orthogonal (varimax) rotation and produces a simple structure. The first factor is named the ICT factor because the variables computers, internet users and telephones load heavily on this factor. Further, these three variables do not load heavily on any of the remaining factors, thus satisfying an important simple structure requirement. The second factor is named the Law, Regulation and Openness Factor because the three variables law, quality of regulation, and tariff and non-tariff barriers load heavily on this factor. These variables also do not load 28 heavily on the other factors. Factor 3 is named the Literacy and Education Factor because the three variables adult literacy, secondary enrollment and tertiary enrollment load heavily on this factor (and not on any other). Finally, factor four is named the Innovations Fac- tor because the number of patent applications, research and number of articles in technical journals load heavily on this factor. Thus, Table 5.2 indicates a clear and simple struc- ture of factors. These four factors define the underlying dimensions in the data, which are measured by the observed variables. That is, computers, internet users and telephones are essentially different measures of the ICT dimension, and adult literacy, secondary enrollment and tertiary enrollment are different measures of the Literacy and Education dimension. An attractive feature of the four factors is that they account for the communalities in the variables quite well. The residual variances are small, as indicated by the last column of Table 5.2. None of the variables have a large measure of "uniqueness". If one of the variables did, then it would mean that the error variance from a regression of that variable on the factors would be large. As a thumb rule, a uniqueness measure for a variable greater than 0.50 would indicate the presence of a unique factor, uncorrelated with the four common factors. Fortunately, the four factors rationalize our data well. Finally, just as for the PC analysis, in order to compute factor scores we use the Thurstonian scoring coefficient matrix in Table 5.5, achieved by replacing the undistinguished elements in the full scoring coefficient matrix in Table 5.4 by zero and retaining the significant loadings in each column. 4.3 Weighted data In addition to the data set analyzed above, it is instructive to analyze a data set in which variables that increase with the size of the country are scaled down. Thus, we also factor- analyze a "weighted" data set which is different from the "unweighted" data (analyzed thus far) with regard to three variables: patent applications, researchers in R&D, and scientific 29 and technical journal articles. In the "weighted" data these three variables are scaled by population, while the remaining nine variables are exactly the same as in the "unweighted" data. The scaling of these three variables does influence the optimal number of principal components required to rationalize the data. For brevity, we refer the reader to Chen and Gawande (2006) for details such as the factor loading matrices for the "weighted" data. The methods for estimating those matrices and then using them to estimate the factor scores are exactly the same as described in Section 3. The main differences between the two data sets are that in the "weighted" with the ML method the optimal number of factors (according to the Bayes information criterion) is three, one less than for the "unweighted" data, while in PC analysis the optimal number of components is seven, one more than for the "unweighted" data. In the graphical analysis of the factor scores and rankings below, we differentiate the findings from "unweighted" and "weighted" data sets. 5. Analysis of the Factor Output The main objective of the factor score computations is to use them to describe how countries rank on the basis of these factors, and how those rankings have changed over the two peri- ods. The authors' working paper contains a more complete analysis for 20 underdeveloped, developing, emerging, oil-rich and industrialized economies. Here we discuss these results for five countries. Figure 1 for Albania has four panels in it. The panel on the top left depicts Albania's rank vis-a-vis the other 120 countries in the sample on each of the six principal components. The spider chart on the top right depicts Albania's rank on the four ML factors. The bottom row panel contains the weighted data counterparts to the top row. There are seven principal 30 components and three ML factors in this data. The green lines inside the spider chart shows how Albania ranked on each principal component or ML factor in 1995. The red line in the spider graph shows Albania's ranking in the most recent period, around 2002. If, along any factor axis, the red line graph is closer to the center than the green line, then it indicates that Albania's position relative to other countries in the sample on that factor has worsened over the decade. This unpleasant and surprising finding applies to the Literacy factor, the ICP factor, and the Education factor. Albania's ranking on the Literacy factor dropped from being near the top 25th percentile to the bottom 35th percentile over this decade. Similar deteriorations are in evidence for the ICP factor and the Education factor. Whether this decline in rankings imply that Albania degraded in absolute terms on the factor score or whether it improved, but at a far slower pace than other countries, is not obvious from the graphs. However, since we have used a common factor scoring matrix for computing factor scores for the two periods, the scores can be put to use in cardinal comparisons as well. The ML factors, although fewer in number, convey the same difficult message about the change in Albania's ranking on the ICP factor and the Literacy & Education factor. Angola ranks towards the bottom of the list of 120 countries in almost all dimensions whether measured by principal components are maximum likelihood. It ranks abysmally in literacy, law, education, and innovation potential. The "unweighted" data may stack the odds against small countries like Angola since the variable patent applications, number of researchers and technical journal output is unscaled by population. The "weighted" data do indicate hope for Angola. Its rank in terms of its (scaled) patent applications is closer to the median. Its ranking on net users and (scaled) number of researchers has also increased over the 10 year period indicating the country is taking steps to keep up with the technological changes in the world. One reason for separately analyzing the "weighted" and "unweighted" data sets is the belief 31 that there is a scale effect in the sheer numbers. That is, there may be threshold effects in innovation potential based on the stock of intellectual R&D and capital as measured by technical journal output, number of patent applications, number of researchers. This is the sense in which the "unweighted" data are different from the "weighted" data. In addition to the obvious examples of the US, Japan and Western European countries, India and China have also demonstrated such threshold effects. On the other hand, scaling these variables by population indicates the extent to which the full technical potential of the population is being tapped. High levels of these scaled measures are also indicators of innovation potential as countries like Finland and Iceland have demonstrated in the last decade. So while there is no compelling reason that sheer numbers should be more or less important than the proportion of the population that is involved in technical pursuits, it is clear that both have led to the potential to innovate. Argentina has, as one might expect after a major currency and banking crisis, degraded along many dimensions. In the "unweighted" data, it has fallen to the bottom quartile on the law dimension, as well as in openness. Rising inequality due to the recession are prob- ably responsible for the degradations in the law dimension. The devaluation was probably not enough to make their exports competitive and therefore, while the rest of the world has cut back on trade barriers, Argentina has maintained or increased them. The four dimen- sional ML factors show a stark picture on the law and openness dimension. Surprisingly, Argentina has not lost its ranking in the other three dimensions. Its literacy ranking has actually increased, on innovation potential it has kept pace, and on the ICT dimension it has maintained its position. The "weighted" data reiterate the same messages from the "unweighted" data. Brazil has made gains and presents a contrasting picture to Argentina on at least the law dimension. While its high income inequality is probably responsible for placing Brazil in 32 the bottom 50% percentile on the law and regulation dimension, the country has improved on this dimension during the last decade. In the four dimensional ML graph, the green line contains the red line, indicating that over this ten year period Brazil has improved its ranking on each dimension. The principal components show that its rank on the openness dimension has lowered, which probably has to do with Mercosur Argentina and Brazil shared similar rankings on openness in 1995 ­ or is a result of keeping trade barriers at fixed levels while the rest of the world has liberalized). The ML graph indicates impressive gains in literacy and education in Brazil. It is probably a good bet that this trend will also lead to an increase in Brazil's rankings on the law dimension in future years (recall that the factors are correlated). The "weighted" data paint a similar picure. China, being a populous country, will obviously show different rankings for "weighted" versus "unweighted" dimensions. We should be cautious about interpreting the meaning of the innovation potential factors in the "unweighted" versus the "weighted" data. In the unscaled data China ranks high on the innovation potential list because of the sheer strength of its size. The "weighted" data present quite a contrast along the dimension measured by researcher and technical journals. In other words, while China has a critical mass in innovation potential (which may be the reason it attracts foreign direct investment), China still has a long way to go in achieving its full potential on innovation as measured by the scaled data. If it produced patents, researchers and technical journal at the same per capita rate as the more advanced countries, China would probably be an OECD country. Such trends are already in evidence. Along each of these dimensions in the "weighted" data, China is already at the median of the sample and has made strides to move ahead, especially in patent applications. On other dimensions, literacy has not improved greatly. However, the ICT factor leaped from the bottom quartile to close to the median among the sample. 33 6. Conclusion We factor-analyze the Knowledge Assessment Methodology (KAM) data. The KAM data was developed at the World Bank to assess countries' readiness for the knowledge economy. The data potentially draw the attention of policymakers to specific areas deserving of more attention and future investments. We factor-analyze KAM data in order to reduce those variables to their essential dimensions or factors. Our main objective in undertaking the factor analysis is to quantify the factors for each country, that is, compute factor scores on each factor. To this end, the paper details these issues in the factor analysis of the KAM data in detail ­ whether the KAM data should be factor-analyzed, the optimal dimensionality of the data, and giving political-economic meaning to the factors. We find that the KAM data are not just amenable to factor analysis but they greatly benefit from it. There are enough inter-correlations among the variables that the real information in the data can be distilled down to a smaller number of dimensions. We use two factor analytic methods ­ Principal Components (PC) analysis and "true" factor analysis which we estimate using maximum likelihood (ML). While PC analysis focuses on explaining the variance in the data, the ML method seeks to explain the intercorrelations in the data. We should therefore expect the two methods to produce different results. While the results are different (PC analysis requires many more dimensions to rationalize the data than ML analysis), there are common themes. A contribution of the paper is identifying the political-economic dimensions in the KAM data and measuring them for (ordinal) comparisons over time. We embrace the idea of a simple structure of the dimensions and allow these dimensions to be correlated with each other. The output from the factor analysis is used to graphically analyze how countries have changed their rankings on the underlying dimensions over the 1995-2002 period. 34 References Anderson. T. W. 1984. An Introduction to Multivariate Statistical Analysis. New York, Wiley. Bohara, A. K., A. I. Camargo, T. Grijalva, and K. Gawande. 2005. "Fundamental Dimen- sions Underlying the Regulation of U.S. Trade." Journal of International Economics 65(1): 93-125. Bollen, K.A., 1989. Structural Equations with Latent Variables. New York, NY: Wiley. Chen, H. C. Derek, and C. J. Dahlman, 2005. "The Knowledge Economy, the KAM Method- ology, and World Bank Operations." Manuscript. Chen, H. C. Derek, and C. J. Dahlman, 2004. "Knowledge and Development: A Cross- Section Approach." World Bank Policy Research Working Paper #3366. Goldberger, A., 1972. "Maximum Likelihood Estimation of Regressions Containing Unob- servable Independent Variables." International Economic Review 13: 1-15. Joreskog, K.G. and Sorbom, D., 1996. LISREL 8: User's Reference Guide. Chicago, IL: Scientific Software International Inc. Joreskog, K.G. and Sorbom, D., 1979. Advances in Factor Analysis and Structural Equations Models. Cambridge, MA: Abt Books. Joreskog, K. G., 1967, "A general approach to confirmatory maximum likelihood factor analysis", Psychometrika 34, 183-202. 35 Kaufmann, D., A. Kraay, and M. Mastruzzi, 1999. "Aggregating Governance Indicators." World Bank Policy Research Working Paper #2195. Kaufmann, D., A. Kraay, and M. Mastruzzi, 2004. "Governance Matters III: Governance Indicators for 1996, 1998, 2000, and 2002." World Bank Economic Review 18: 253-287. Lawley, D. N. and A. E. Maxwell, 1971. Factor analysis as a statistical method. New York, NY: American Elsevier. Reyment, R. and Joreskog, K.G., 1993. Applied Factor Analysis in the Natural Sciences. Cambridge, UK: Cambridge University Press. Rubin, D. B. and D. T. Thayer, 1982. "EM Algorithms for ML Factor Analysis". Psychome- trika, Vol 47, No. 1, March, 1982. Theil, H., 1971. Principles of Econometrics. New York, NY: John Wiley. 36 37 38 39 40 41 42 43 44 45 46 47 48