WPS4216
Underlying Dimensions of Knowledge
Assessment:
Factor Analysis of the Knowledge Assessment
Methodology Data
Derek H. C. Chen*
The World Bank
Kishore Gawande**
Texas A&M University
The Knowledge Assessment Methodology (KAM) database measures variables that may
be used to provide an assessment of countries' readiness for the knowledge economy, and
has many policy uses. Formal analysis employing KAM data is faced with the problem of
which variables to choose and why. Rather than make these decisions in an ad hoc
manner, we recommend factor-analytic methods to distill the information contained in the
many KAM variables into a smaller set of ``factors". The main objective of the paper is
to quantify the factors for each country, and do so in a way that allows comparisons of
the factor scores over time. We investigate both principal components as well as true
factor analytic methods, and emphasize simple structures which help to not only provide
a clear political-economic meaning of the factors, but also allow comparisons over time.
World Bank Policy Research Working Paper 4216, April 2007
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the
exchange of ideas about development issues. An objective of the series is to get the findings out quickly,
even if the presentations are less than fully polished. The papers carry the names of the authors and should
be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely
those of the authors. They do not necessarily represent the view of the World Bank, its Executive Directors,
or the countries they represent. Policy Research Working Papers are available online at
http://econ.worldbank.org.
*Economist, Knowledge for Development Program, Human Development Department, World Bank
Institute. ** Professor, Bush School of Government and Public Service.
1. Introduction
In order to facilitate the attempt of countries to make the transition to the knowledge
economy, the Knowledge Assessment Methodology (KAM) was developed at the World Bank
(Chen and Dahlman, 2004, 2005). It is designed to provide an assessment of countries'
readiness for the knowledge economy, and identifies sectors or areas in which policymakers
should focus their attention and make future investments. The KAM is currently being
widely used both internally and externally to the World Bank, and frequently facilitates
engagements and policy discussions with government officials from client countries. This
rich database is also potentially useful for research by political economists and political
scientists.
The KAM database includes variables such as tariff and non-tariff barriers, regulatory qual-
ity, rule of law, adult literacy rate, secondary enrollment, tertiary enrollment, researchers in
R&D , patent applications granted by the USPTO, acientific and technical journal articles,
telephones, computers, and internet users.1 They are constructed for over 120 countries, and
are available at different points in time.
Any formal analysis employing KAM data must confront the problem of which variables to
choose and why. Rather than make these decisions in an ad hoc manner, we recommend
"reducing" the set of KAM variables to a smaller set of variables without losing information
contained in the full set of variables. Factor-analytic methods are concerned with precisely
this problem reducing the data in a way that parsimoniously represents essentially the
same information contained in the many variables. The parsimonious set of variables is the
set of "factors" to which the data in the large number of variables is reduced.
1Source: The Knowledge Assessment Methodology (KAM) website (www.worldbank.org/kam).
1
Our main objective in undertaking the factor analysis is to quantify the factors for each
country, that is, compute "factor scores" on each factor. Importantly, we wish to accomplish
this in a way that allow comparisons of the factor scores over time. To this end, the paper
details four issues in the factor analysis of the KAM data in detail. The first is whether
the KAM data should be factor-analyzed and what factor-analytic method may be most
appropriate; the second is determining the optimal dimensionality of the data, that is, the
number of factors to which the data may be adequately reduced; the third, and perhaps
most important, is giving clear meaning to the factors. Each of the above issues is treated
exhaustively in the paper.
If subsets of variables are correlated, then depending on the extent of the correlation, factor
analysis is worth doing. A formal test shows that the KAM data are not just amenable to
factor analysis but they greatly benefit from it. There are enough inter-correlations among
the variables that the real information in the data can be distilled down to a smaller number
of dimensions.
What is the optimal dimensionality to which the information contained in the variables
can be reduced? Depending on the factor analytic method that is chosen, the answer is
different. For example, in principal components analysis this is determined by the number
of principal components required to explain, say, 95% of the total variance in the data. In
"true" factor analysis (which we estimate using maximum likelihood) a formal chi-squared
test or information criteria that measure fit in terms of explained intercorrelations, not just
variance, are used to determine optimal dimensionality.
The most important contribution of the paper is that it gives political-economic meaning
to the dimensions, whether they go by the name of "factors" or "principal components".
Ultimately, we hope to make factor analysis a useful policy tool to indicate warning signals
2
about the health of countries. The tool we will use to give political-economic meaning to the
factors are the "factor loadings". Intuitively, these are the coefficients of the regression of
each variable on the factors. Thus, if one variable has a very high coefficient on one factor but
not on any of the others, we say that variable loads heavily on that factor. If the data, or the
information in the variables can be reduced to a smaller set of factors then what we should
find is that some variables load heavily on the same factor and other variables load heavily
on other factors. That is, the structure of factor loadings should be simple". One definition
of a simple structure of loadings is as follows: a structure in which any single variable loads
only on one factor and minimally on the others, and that more than one variable load on
one factor. We will spend considerable effort in producing simple structures, because simple
structures make the political-economic content of the factors unambiguous and clear. We
will also test for the adequacy of our simple structures.
Obviously, the preceding discussion about factors loadings as regression coefficients is meant
only to set ideas because, unlike in regression analysis, the factors themselves are unknown.
In other words, the factor scores, or the value that the factors take, are not known and
no regression in the usual sense can be estimated. Section 2 provides the theory behind
how factor scores and factor loadings are computed simultaneously within a factor analytic
framework.
The paper proceeds as follows. In section 2, we outline the generic factor model. In this sec-
tion, two fundamentally different methods of factor analysis principal components analysis
and true factor analysis are explained in detail. A special case of the true factor analy-
sis, the error components method, is also discussed here. Section 3 discusses the data and
sources. The analysis is carried out on 12 variables measured across 120 countries. The data
are from two time periods, 1995 and a more recent vintage around 2003. The same section
discusses how we impute missing data in order to cover the sample of 120 countries, not all
3
of which have complete data on all 12 variables. We also point to a data pitfall that should
be avoided before doing the factor analysis. Section 4 contains the empirical results and
the main contribution of the paper. We analyze principal components separately from the
true factor analysis results. There are three main components to this section. The first is
the use of factor loadings in order to name the factors. We show that with the KAM data
we are able to achieve a fairly simple structure. The second is a set of formal tests for the
dimensionality. The third is another methodological pitfall, whose resolution confronts us
with a tradeoff. We indicate how and why we choose to resolve this in the manner we do.
The choice is obvious from our overriding objective of computing factor scores as precisely
as possible. Section 5 discusses the output from this factor analysis. We use graphs to show
how countries have changed their rankings on the underlying dimensions over this ten year
period. Section 6 concludes.
2. Factor Analysis Models
The notation and material in this section borrows from Reyment and Joreskog (1993, Sections
2, 4). The general factor analysis model is
X(N×p) = F( N×k)A( k×p)+ E( N×p), (1)
where X is the data matrix of p variables, F is the matrix of k < p factors, and N is the
sample size. The k × p "factor loadings" matrix A is used to linearly sum the factors to
predict each column of X. What cannot be predicted is collected in the error matrix E.
In the context of the KAM data, each column of X is a measure (i.e. variable) containing
"scores" for a set of N countries. There are p such measures on which country scores have
4
been compiled.2 The individual components of F are the "scores" for common factors since
they are common to several different measures. The KAM measures are thus predicted as
linear combinations of the factors. The coefficients of the factors, called the factor loadings,
are the elements of A . For example, consider the ith measure (variable) xi. It can be written
as a regression model
xi = ai f + ai f + . . . aikfk + ei.
1 1 2 2 (2)
where f1, . . . , fk are the "exogenous" factors, and the coefficients ai , . . . , aik are the "load-
1
ings" contained in the ith column of A . While ei is given the interpretation of a regression
residual, in fact it is made up of the measurement error in the measure xi plus a "specific"
factor that xi does not share in common with other measures. Thus, each of the p variables
xi, i = 1, . . . , p can be written as a regression model with the factors acting as the common
"exogenous" variables weighted by the coefficients ai , . . . , aik, and where ei is the regression
1
residual.
Written in this form makes it clear that factor analysis is a method of data-reduction. The
method seeks to parsimoniously represent in a small set of variables (f1, . . . , fk) essentially
the same information contained in a much larger set of variables (x1, . . . , xp). We will reduce
the KAM data variables to their essential factors using two different factor-analytic methods.
The difference between model (1) and ordinary regression models is that the factors and
coefficients are both unknown. That is, neither F nor A is known and must be estimated.
There is a fundamental indeterminacy in the model. If we (linearly) transform F and A ,
respectively, as F* = F C- and A* = C A , then (1) is equivalently written as:
1
2p need not be fixed. Factor analysis of the KAM variables may be performed separately on subsets of
the KAM variables. For example, each of the four pillars of the KAM data (i) economic and institutional
regime data, (ii) education and skills data, (iii) infrastructure data, and (iv) innovation potential data may
be distilled down to one or two factors.
5
X( N×p) = F*( N×k)A*( k×p) + E( N×p) , (3)
Then, by observing X we cannot distinguish between these two models. This should be
familiar from econometric textbook discussions on identification (e.g. Greene, 2004). De-
vising "simple" structures in which as many factor loading as possible are zeros, facilitates
identification and interpretations of the factors. We will explore simple structures in detail.
We now formally discuss the two popular methods of factor analysis that we will use: the
Principal Components (PC) method and the pure factor analysis model which we estimate
by maximum likelihood (ML).
Fixed versus Random Factors
A distinction is made between models that presume the factor matrix F in (1) to be fixed,
and models that presume F to be random. The random factors model is appropriate when
we want to extend our inferences to different samples (say, of individuals), while the non-
random factors model is appropriate when the specific observations (here countries), and not
just the model structure, are of interest. The KAM data pertain to specific countries, and
are exhaustive across countries, which makes a compelling case for the use of fixed-factor
models. However, if inferences from the factor analysis were to be applied to countries not
in the sample, or to the same countries but in a future period, then it is advisable to use
random-factor models. The likelihood function for (identified) models with random F is well
defined (see e.g. Anderson, 1984 p 552). Estimation of models with non-random F proceeds
based on least squares criteria (for which, unlike the random factors case, no distributional
assumptions need be made unless statistical testing is to be done).
6
2.1 Principal Components Analysis of the Fixed Factors Model3
In Stata, estimation of the Principal components model proceeds as in a fixed factor model.
Let Y be the mean-removed data matrix and scaled by 1/N so that the matrix S = Y Y
is the data covariance matrix. Consider the (non-random factor) model for Y:
Y( N×p) = F( N×k)A( k×p) + E( N×p) , (4)
where A is the factor loadings matrix and F is the matrix containing the factor scores. Using
least squares to fit the (fixed data) model implies estimating F and A (for a given k, see
Section 4 on determining k) in order to minimize the sum of the squares of the residual
matrix:
E = Y - FA . (5)
The singular value decomposition (SVD) theorem indicates that a solution with the largest
k singular values 1, 2, . . . , k is given as:
FA = 1v1u1 + 2v2u2 + . . . + kvkuk,
^ ^ (6)
where uj is a (p × 1) vector, and vj is a (N × 1) vector, and j is the jth eigenvalue of
the data covariance S. Define the matrices: Vk = [v1, v2, . . . , vk], Uk = [u1, u2, . . . , uk] and
k = diag[1, 2, . . . , k]. Their dimensionalities are: Vk : (N × k), Uk : (p × k), k : (k × k).
Then the solution is:
FA = VkkUk.
^ ^ (7)
Note that there is not a unique solution for F and A individually. Our solution will be in
^ ^
the direction of "simple" structures for A. ^
3The principal components (PC) method is applicable to both, fixed and random factors models (Reyment
and Joreskog, 1993). We focus on PC as applied to fixed factors since Stata estimates PC for the fixed factors
model. We indicate how to estimate the PC model for random factors in fn. 4
7
Consider the following solution:
F = Vk,
^ A = kUk.
^ (8)
Then the factor scores for the k factors, F, are also in standardized form with covariance
^
equal to the identity matrix. That is, they are pairwise uncorrelated.
If E is small so that Y is approximated by FA , then the data covariance is approximately:
^ ^
S = Y Y AF FA = AA
^ ^ ^ ^ ^ ^ (9)
The "PCA" routine in Stata calculates Principal Components in the following steps:
1. Compute the covariance matrix S.4
2. Compute the k eigenvectors corresponding to the largest eigenvalues of S. Arrange the
eigenvectors in the p × k matrix Uk.
3. Estimate factor loadings as A = Uk
^
4. Estimate Factor Scores as F = ZA, where Z is the data matrix with the p variables
^ ^
standardized to have zero mean and unit variance.
Thus, the factor loading matrix is the set of eigenvectors corresponding to the largest k
eigenvalues. This is also the factor scoring matrix. Note that Stata computes factor scores
using the standardized variables Z, not Y. In this solution, the factors have different variances
and they are not comparable (their "units" are different). They must be scaled by - 1/2
k to
be comparable (and have unit variance).
4In our analysis we use the Stata default to analyze the data correlation matrix, which produces quan-
titatively somewhat different loadings and scores from the analysis of the variance matrix known as the
"scaling" problem of PC analysis but qualitatively the results are close.
8
2.2 True Factor Analysis of Intercorrelations (using Maximum Likelihood)
True factor analysis is based on the random factors model. While the model in the random
factors case is the same as (1), the population covariance matrix is:
= AA + (10)
if the factors are uncorrelated, and
= AF FA + (11)
if the factors are correlated. In (10) and (11) is the true error covariance matrix.5
In order to estimate the parameters of the model, we proceed by analyzing data that are
mean-removed, so that the data covariance S = X'X. We make the following assumptions
about the true covariances:
1 1 1 1
(12)
N X X , N F F , N F E 0, N E E ,
that is, finite second moments and orthogonality of the error and factor score matrices. We
will assume that the error covariance is diagonal, that is, measurement (and other) errors
are uncorrelated across different variables. This diagonal error covariance is constant across
observations (or "homoskedastic" covariance). The factors may be correlated, that is, is
5PC analysis of the random factors model is also possible, but requires the assumption that is small
(that is, E in (1) is small). The Unweighted Least-Squares (ULS) criteria fits the factor model so that the
sum of squares of the elements of S - AA (presuming factors are uncorrelated) is minimized. The PC
solution to this problem may be computed using the following steps:
1. Compute the covariance matrix S.
2. Compute the k largest eigenvalues and arrange them in a diagonal matrix k.
3. Compute the corresponding k eigenvectors Uk of S. Compute A = Uk1k . Each eigenvector is now
^ /2
scaled so that its length equals the corresponding eigenvalue.
4. Compute the factor scores as F = YA- .
^ ^ 1
k
While this solution is different from the fixed factor solution, it is applicable to the fixed factor case with
the ULS criterion applied to the error matrix E in (4) and (5).
9
permetted to be non-diagonal (if the factors are uncorrelated more on this below then
= I). Therefore, the population variance is a function of the model parameters A,
and .
= A A + . (13)
In PC analysis of the random factors model (see fn 5), factors are determined so that they
account for maximum variance of all the observed variables. Thus, the emphasis in PC
analysis is on eigenvalues, because the sum of all eigenvalues is the total variance is all the
variables. In true factor analysis, the factors are determined so that they best account for
the intercorrelations of the variables. In true factor analysis, the errors are presumed to be
uncorrelated with each other so that is diagonal (in PC is simply assumed to be small
in the sense that A A. The rank of A A, and therefore of is approximately k.
In true factor analysis, in (13) has diagonal elements only, so that the off-diagonals of
are exactly equal to the off-diagonals elements of A A, and the parameters are estimated
to make the off-diagonal elements of the data correlation matrix as close as possible to the
off-diagonals of A A. The diagonal elements of are equal to the sum of the diagonal
elements of A A (the "communalities" of the variables) and those of (the "uniqueness"
of the variables). The off-diagonal elements assume greater importance in true factor analysis
than in PC analysis (where they are assumed away).
The ML estimates of A and is based on the assumption that the error vector for obser-
vation i, Ei., is multivariate normal with mean 0 and variance . The likelihood function
for the multivariate data is
ln|| + tr S- - ln|S| - p,
1 (14)
which is maximized over the parameters A and . Asymptotic properties of the MLE have
10
well-defined limiting distributions which are used for testing.6
Computing Factor Scores and Standard Errors
In order to estimate factor scores from the ML method, consider a single observation on the
factor model:
x(p×1) = A( p×k) (k×1)
f + e(p×1) , (15)
where the lower case letters denote the vector elements of their matrix counterparts in (1).
We proceed as described in Anderson (1971, p. 575). The transposed vectors are column
vectors. The data vector x and the factor score vector a have a joint normal distribution
distribution with mean (0 , 0 ) and covariance:
x + AA A
cov =
f A
. The factor scores are computed by the regression of f on x . In terms of the population
parameters, this is:
E(f |x ) = A ( + AA )- x . 1 (16)
Using the conditional variance formula, the covariance of the regression is
cov(f |x ) = - A ( + AA )- A. 1 (17)
Replacing the parameters by their ML estimates yields an estimate of the (conditional) co-
variance. They may be used to test hypotheses about scores (for an observation) on different
6In Stata, the "factor" command is used together with the "ml" option in order to estimate the parameters
of the factor model.
11
factors. The square roots of the diagonal elements of the (estimated) covariance are the stan-
dard errors of the estimated k-vector of factor scores on that observation. These standard
erros are constant across observations. Dividing the factor score with the corresponding
standard error produces a t-statistic for testing statistical significance of individual factor
scores. To take a simple, example, suppose the data are aggregated into a single factor,
k = 1. Then the matrix collapses to unity. Then the estimator for the (scalar) factor
score is
E(f |x ) = A ( + AA )- x , 1 (18)
and it's (scalar) variance is
cov(f |x ) = 1 - A ( + AA )- A. 1 (19)
For this single factor case, denoting the ML parameter estimates with "hats", the factor
score (for the single observation) is computed as the conditional mean
E(f |x ) = A + AA -1 x , (20)
and its standard error is
0.5
se(f |x ) = 1 - A + AA -1
A . (21)
2.3 Error Components Method
The error-components approach (EC) used by Kaufmann et al. (2005) to measure gover-
nance across several countries is a random-factors approach based on econometric methods
developed for latent data models (see e.g. Goldberger (1972) and MIMIC models of Joreskog
12
(1967) and Joreskog and Sorbom (1979)). The Kaufmann et al. approach is to fix the num-
ber of variables that map into a factor and then estimate score for the factor as conditional
means, conditional on parameters estimated by maximum likelihood. Thus, one major dif-
ference from the PC and ML methods of factor analysis described above is that the number
of variables that map into a factor is prespecified.7 Thus, the number of variables p (and
which ones they are) is treated as prior information. The computation of EC factor scores
proceeds in two steps. First, the model parameters are estimated using maxmimum likeli-
hood. Next, they are used to compute the scores as conditional means. The method also
produces conditional variances which may be used to construct confidence intervals for the
factor scores or for testing.
They (implicitly) consider the following factor model:
X( N×p) = F( N×1) (1 ×p)+ E( N×p) . (22)
This corresponds to (1) except that the factor loading vector takes the place of the factor
loading matrix A in (1). Whereas in (1) p variables mapped into k factors, here p variables
map into a single factor. Note that while we have chosen to use the same notation to indicate
matrix dimensions, the number of variables p may be chosen to be a specific set of variables,
and not the entire data matrix at hand (as we did in the case of the ML and PC methods in
which the number of factors k is determined by the data). Since in the EC method k = 1,
the p variables may be chosen to be a "homogeneous" subset of the variables designed for
mapping into that factor.
The EC likelihood function is as follows. Let , , be (p×1) parameter vectors. As before,
7For example, Chen and Dahlman (2005) partition the KAM variables into four "pillars": Economic
Incentive and Institutional Regime, Education and Human Resources, Innovation System, and Information
Infrastructure. In the Chen-Dahlman scheme, since tariff and non-tariff barriers, regulatory quality, rule of
law represent the Economic Incentive and Institutional Regime pillar, p = 3 for this factor.
13
is defined to be the diagonal (p×p) error covariance matrix with elements 1, . . . , p. Let
the (p × p) matrix = + . Then, the likelihood function for the data is:
N
L = -0.5 × N × ln|| + (xj - ) - (xj - ).
1 (23)
j=1
In (23) the parameter is simply the vector of the means of the p variables in X. For
observation j the 1 × p data vector is denoted xj.
Denoting the ML parameter estimates with "hats", the factor score for observation j is
computed as the conditional mean (conditional on xj)
Fj = ^ - (xj - ),
1 ^ (24)
and the standard error of this estimates is computed as
sej = 1 - - 1 0.5. (25)
This is exactly the same as (21), where A = (so that ( + AA = + = ).
Where the EC method differs from the (random) factor method estimated by ML is in the
specification of the likelihood functions. Whereas in the EC method the data likelihood is
maximized over the parameters (A, ), in the ML factor method the likelihood of the inter-
correlations in the data is maximized over the parameters. In this sense, the EC method
is still a variance method (driven by a squared error loss objective) while the ML factor
method pays attention to the intercorrelations among the variables.
3. Data
The Knowledge Assessment Methodology (KAM) database consists of more than 80 struc-
tural and qualitative variables to measure countries' performance on how they perform as
"knowledge economies". We will use the subset of 12 variables that are used by the KAM
14
method to compute each country's "basic scorecard". They are: tariff and non-tariff barriers,
regulatory quality, rule of law, adult literacy rate (% age 15 and above), secondary enroll-
ment, tertiary enrollment, researchers in R&D, patent applications granted by the USPTO,
scientific and technical journal articles, telephones (mainlines + mobile phones), comput-
ers, and internet users. The KAM website (see fn 1) indicates the variety of sources from
which the data are drawn. In addition to this unscaled data, we will also perform factor
analysis on these variables, but now a subset of the variables will be scaled so that country
size does not influence the analysis. The scaled set of variables are: tariff and non-tariff
barriers, regulatory quality, rule of law, adult literacy rate (% age 15 and above), secondary
enrollment, tertiary enrollment, researchers in R&D (per million population), patent ap-
plications granted by the USPTO (per million population), scientific and technical journal
articles (per million population) telephones per 1000 persons (telephone mainlines + mobile
phones), computers per 1000 persons, and internet users per 10,000 persons.
Data on the 12 variables are available at two points in time, one measured in 1995 and
another during a more recent period, between 2002-04. We will use the term "2002" to
indicate the recent data. Table 1 describes the variables and reports descriptive statistics
for the 12 variable-pairs.
3.1 Missing Data Imputation
The factor analysis restricts the sample to one which has complete data on all included
variables. Hence, a crucial pre-estimation step is to impute missing data in order to have as
broad a coverage of countries as possible. The imputations are carried out using a simple
regression of the variable with missing data, using as the independent variable a conceptually
closely related variable. For example, (unscaled) research95 has data for only 86 countries.
However, the closely related research03 has data for 95 countries. Therefore, nine observa-
15
tions can be additionally imputed by regressing research95 on research03. The first column
of Table 2 shows the results of this regression. The R-squared of 0.91 indicates a good fit for
the imputation. Having filled these nine data points, we now have data for 95 countries for
research95. That is still not enough. The next closely related variable is technical journal
output in 1995 (techjour95). The second column of Table 2 indicates that this regression
has an averagely good fit, with an R-squared of 0.70. The variable, techjour95 is statisti-
cally significant at 1%. Therefore, the two-step regression process makes available data on
research95 for 120 countries.
A similar two-step regression process is used to impute missing research03 data via the
regressions shown in columns 3 and 4 of Table 2. The last three columns in Table 2 impute
data for computer95, computer04 and tariffs and NTBs for '95 (tntb95) using, respectively,
GDP per capita for the two computer variables and tntb05 as regressors.8 After completing
the imputations we have available data for 120 out of the 128 countries. Unavailability of
data on the regressors prevents imputing missing data for the remaining 8 countries. The
factor analysis is based on the sample of these 120 countries. The authors' working paper
provide details on the countries for which variables are imputed and the imputed values.
3.2 Is It Worth Doing Factor Analysis on the KAM Data?
The main objective of the factor analysis is to understand whether countries have advanced
their positions over the 10-year period in terms of (i) the absolute measures of the factors,
and (ii) their factor score ranks vis-a-vis other countries. Before proceeding with factor
analysis and the computation of factor scores, it is important to understand whether and
how much we can gain from undertaking a factor analysis. The cross-country data on the 12
8For imputing missing country values we use not only available data across countries but also for the
ten regions including the world. There is additional information in these aggregated regions which can be
brought to bear on the imputations.
16
variables have considerable correlations among them. However, if the correlations are driven
by common underlying factors, then the factors become the main objects of interest for us.
If two variables share a common factor with other variables, their partial correlation, con-
trolling for all remaining variables, will be small. The Kaiser-Meyer-Olkin (KMO) statistic,
based on this idea, computes the ratio of (i) the sum of squared correlations of each variable
in the analysis with every other variable to (ii) the same sum plus the sum of squared partial
correlations of each variable with every other variable, controlling for all remaining variables.
Large values for this "overall" KMO measure indicate that the partial correlations are small,
that is, common underlying factors are responsible for the correlations among variables. A
large value for the KMO measure indicates considerable gains from undertaking a factor
analysis. Table 3 indicates that the overall KMO measure is 0.875, which provides solid
support for proceeding with factor analysis of the KAM data. Further, the KMO statistic
for each variable individually indicates that their high correlations are driven by underlying
factors.9
3.3 Avoiding a Pitfall
Performing factor analyses separately on the 1995 variables and the 2002 variables is a
pitfall one should avoid if the purpose of the factor analysis is to compare factor scores
across the two periods. Separate analyses produce factor scores (that is, the quantity of
a factor contained in each country) that are not strictly comparable. For example, in PC
analysis separate analyses produces factors with different variances so that their magnitudes
are not comparable (that is, their "units" are different). In order to solve this problem,
we proceed as follows. First, we combine the 1995 and 2002 variables into one set of 12
variables. To be consistent, the same factor analytic method is used to combine each pair
9The variables in Table 3 are an amalgam of each variable-pair over the two years, see below.
17
of variables as is used in the factor analysis for the full set of 12 variables. For example,
Computer95 and Computer04 are factor-analytically combined into one computer variable
using either maximum likelihood (ML) or principal components (PC). Second, we proceed
with the factor analysis of this set of 12 amalgamated variables. Third, we use the common
estimate of the "scoring coefficients" matrix (see below) produced by the factor analysis, but
apply that matrix separately to the 1995 and 2002 variables in order to compute separate
sets of factor scores, one 1995 and one for 2002. These scores are used to analyze changes in
the factors over the two periods. In this paper the scores are used ordinally to rank countries.
However, making simple adjustments to the mean and standard deviation allows cardinal
comparisons as well, for example in regression analyses.
In the following section we report and analyze the results from the Principal Components
(PC) method and the Maximum Likelihood (ML) method. The discussion is from the
ground-up and provides details about (i) why the specific number of factors are chosen
in each method, (ii) the reason why we choose "simple" factor loading structures for our
analyses, (iii) the reason we choose the specific method for obtaining the simple structure,
and (iv) an approximation that makes the structure especially simple and is essential to
achieve our objective of comparing factor scores across years (plus a chi-squared test that
tests whether the approximation is statistically accurate).
4. Empirical Results
4.1 Principal Components Analysis
The first step in factor analysis is choosing k, or the number of factors that will fit the data
"adequately". In PC analysis, an oft-used criterion is to set k to be no less than the number
18
required to explain 95% or more of the total variance in the data.10 Table 4.1 shows that
six factors are required to explain at least 95% of the variance in the 12 KAM variables.
Thus, we choose k = 6. Even though all of six factors are required to account for 95% of the
variance in the data, the first factor accounts for the giant's share of the data variance. Table
4.1 indicates that the first principal component accounts for 60.2% of the total variance. We
might expect that this component will also have the maximum number of large loadings
among all principal components. Table 4.2 reports the loadings with k=6. As expected, a
majority of the variables load heavily on this factor. Not only does this make it a catch-all
factor, but since variables do not load on the remaining factors they have little political-
economic content. For this reason, factor analysis has sought to design factors with "simple"
structures so that all factors have meaningful content.
Simple Structure: Orthogonal and Oblique Rotations
The criteria advanced by Thurstone (1947) have been influential in producing computation-
ally feasible methods that deliver simple structures:
· There should be at least one zero in each row of the factor loadings matrix.
· There should be many (at least k) zeros in each column of the factor matrix.
10 While this criterion is the most popular, other criteria have also been used. They are:
(i) The size of individual factor loadings: the factor loadings squared (for orthogonal factors) indicate the
variance of a variable accounted for by a particular factor. Factors not contributing much may be dropped.
if parsimony is a driving concern, the thumb-rule proposed by Reyment and Joreskog (1993) that there
should be at least three significant loadings in each factor may be used.
(ii) The variance explained by a factor: The sum of squared loadings for a given factor represents the
information content in the factor. The ratio of the sum of squares to the trace of the correlation matrix is
the proportion of total information residing in the factor. A cutoff value can be used to then determine how
many factors to retain.
(iii) Significant residuals: A residual correlation matrix may be calculated after each factor has been
extracted.
random error determines k. The standard error of the residual correlations (estimated roughly as 1/ N - 1
k is determined at the point when the residual matrix consists of correlations solelydue to
can be used to determine whether the correlations are significantly greater than zero.
19
· For every pair of factors
only a few variables should have near-zero loadings on both.
some variables should load heavily on one and not at all on the other.
several variables should have near-zero loadings on both factors.
Two classes of methods have evolved that produce simple structures. The first class of
methods orthogonal rotation maintains the uncorrelatedness of the factors, while the
second class of methods oblique rotations - seeks to find simple structures with correlated
factors. Since the latter relax the constraint on orthogonality of factors, they are capable of
producing even simpler structures than orthogonal rotations.
Technically, rotations work as follows. In the Stata code, a k × k rotation matrix T rotates
the factor loadings in Step 4 (see Section 2.1) so that the rotated factor loadings matrix,
denoted A^R, is given by
AR = AT = UkT.
^ ^ (26)
If T is an orthogonal transformation matrix, the rotation preserves the orthogonality of the
factor score matrix F. Otherwise, the rotation is oblique, that is, the factors are correlated.
Table 4.3 displays the oblique rotation matrix T that produces the simple structure that we
use to proceed with our analysis and computations. In PC analysis, even after rotation the
total variance explained by the factors is still the same (95.30%), but the portion accounted
by each factor is now different. As Table 4.4 shows, rotation distributes the portion of the
variance explained more evenly than the unrotated factors. This is the point of the simple
structure: to identify a factor associated with only few variables.
A graphical analysis makes the connection between rotation and simple structure clear.
Figure A1 (see appendix) plots the unrotated factor loadings for k = 6 factors. A row of
20
the factor loadings matrix, indicating how the corresponding variable loads on each of the
6 factors, is depicted as a point in the 6-dimensional space of factors. The projection of
these points on the (15) possible 2-dimensional subspaces is displayed, respectively, in each
panel of Figure A1. Consider the top row of five graphs in which the y-axis measures the
loadings of the 12 variables on the first principal component (C1). While the C1 vs. C2
graph shows some evidence of clustering, the structure of the loadings is not very simple as
we move across the row of graphs. Now compare the same row in Figure A1 with the first
row in Figure A2, in which the axes have been rotated orthogonally (that is, rotating the
axes while keeping the origin at the same point and maintaining the angle between the axes
at 90 degrees) in order to achieve a simpler structure. This type of rotation is known as the
"varimax" rotation. There is a clear separation of the loadings into two y-clusters: one set
of variables (patapp, techjour, tel) projects into high y-values, that is, high C1 loadings, and
the other into low y-values.
This type of simple structure is in evidence not only for loadings on C1 but on C4 (tertiary
and secondary enrollment), C5(adult literacy), and C6 (tariffs & ntbs). While the C2 and
C3 rows indicate the presence of clusters with high loadings (regulation quality and law
load on C2 and computers and net users load on C3), the structure of loadings on these
two components is not as simple as Thurstone's ideal. Regardless, the varimax rotation has
made the structure of loadings much simpler.
Can an oblique rotation that relaxes the constraint on uncorrelatedness of the principal
components (i.e. the 90 degree angle between the axes) achieve an even simpler structure?
Figure A3 depicts the result of an oblique rotation (known as the "Oblimin" rotation).11
There is no visible difference between the orthogonal rotation results in Figure A2 and the
11Although in the figure the axes appear perpendicular to each other, they are not. That is done merely
for convenience. Correlation between two components implies that the angle is less than 90 degrees. However
what is important to us is the projection of the points on the axes.
21
Oblique rotation loadings in Figure A3. As Figure A2 indicates the difference in the load-
ings is almost negligible. In other words, in Principal Components Analysis the orthogonal
rotation considerably simplifies the structure of loadings, and the oblique rotation repro-
duces it but does not simplify it further. However, in the True Factor analysis below (using
ML) an oblique rotation produces significant improvement over the orthogonal rotation. We
therefore adopt the oblique rotation results for computing factor scores.
Outside of economics and political science, in psychometrics for example, researchers con-
ducting exploratory factor analysis have generally assumed orthogonal factors.12 In eco-
nomics and political science, however, there is every reason for believing that factors should
be correlated. Multiple regression is prevalent in economics and political science precisely
because non-experimental data are correlated. In order to satisfy the ceteris paribus as-
sumption, considerable care is taken to include appropriate control variables. We should
embrace the idea that political-economic data are correlated when such data are determined
in general equilibrium. It is therefore almost impossible for the data to be orthogonal.
Within a data class there may be strong interdependencies while across data classes these
interdependencies may be weak. In that case, the assumption of partial equilibrium for each
data class may be justified. In Chen and Dahlman (2004) this assumption leads the authors
to think of their data classes as "pillars". Here, we let the data decide how to form into
groups. There are two related messages here. The first is that the main objective of the
factor analysis is to be able to identify the underlying dimensions that the observed data
purport to measure. Second, and related to this objective, there is absolutely no theoretical
reason why the underlying factors should be uncorrelated. The underlying dimensions are
12Traditionally, a clear distinction has been made between confirmatory factor analysis (CFA) and ex-
ploratory factor analysis (EFA). In CFA, if theory suggests two factors are correlated, then an oblique
rotation is justified. In EFA, there is neither is there a theoretical basis for knowing how many factors there
are nor whether they are correlated. In economics and political science, we argue, CFA will generally indicate
correlated factors due to interdependencies of general equilibrium under which the data are generated.
22
determined by the same general equilibrium mechanism that generates measures of these
factors (i.e. the variables). Theoretically, factors should be correlated.13
Political-Economic Dimensions of the Data: Naming the Factors
What names are appropriate for the principal components? Table 4.5 shows that the first
principal component (RC1) has the variables researchers, technical journals, and patent ap-
plications load heavily on it. Therefore, this component is named the Innovation Potential
factor, since the ability of an economy to innovate is appropriately measured by these impor-
tant inputs. Since RC2 has law and the quality of regulations load heavily on it, we call it
the Law and Regulation factor. RC3 is named the ICP factor since computers, net users and
telephones lines load heavily on it. RC4 is the Education factor since secondary and tertiary
enrollments load heavily one it. RC5 is named the Literacy factor after the single variable,
adult literacy. The final factor RC6 is the Openness factor because tariffs and NTBs almost
entirely load on it. Of note is the fact that the unexplained variance (last column), after
accounting for the six principle components, is quite small for every variable. This indicates
that the factor model with six components fit the data well at the individual variable level
(therefore satisfying criteria (ii) in fn 5, as well).
Computing Factor Scores
The scores on any factor indicate how much of the factor is "contained" in a particular
country. We use the oblique-rotated factor loadings as the basis for our factor score com-
putations. As indicated in (15), the unrotated principal components (eigenvectors) A are ^
transformed into the rotated components AR as AR = AT = UkT (Table 4.3 displays the
^ ^ ^
matrix T). In order to estimate the factor scores, we use the direct method (as different
from the regression method, see Reyment and Joreskog, pp 223-225).14 In this method, the
13It is reasonable that correlations among factors should be weaker than correlations among variables
measuring any single factor (else the two factor should be combined into one).
14This is different from the method used to compute the scoring matrix in the ML method below.
23
factor scores F are computed as
^
F = Z[(ARAR)- AR] ,
^ ^ ^ 1 ^ (27)
where F is the (n × k) matrix containing factor scores on each factor for the 120 countries,
^
and Z is the (n × p) matrix containing the standardized data variables. The (p × k) scor-
^
ing coefficient matrix in Table 4.6 (produced by Stata) is the transpose of the coefficients
(ARAR)- AR. However, before computing factor scores using (16), an important step is
^ ^ 1 ^
required to avoid another pitfall.
Avoiding a Pitfall (and trade-offs involved)
The scoring coefficients in Table 4.6 show that a few coefficients, indicated in bold, should
dominate the measurement of the factor scores. In practice, however, other elements of
the matrix can and do influence the computation of the factor scores, with unexpected
consequences. Consider the first factor, the ICT factor, in Table 4.6. It consists of three
large positive scoring coefficients (computers, internet users and telephones) and nine small
coefficients, some of which are negative. These negative coefficients can actually produce
contrarian factor scores. Take Angola, for example. Applying the direct method in (16) pro-
duces an ICT factor score for Angola that ranks it 70th among the 120 countries. However,
when the countries in the sample are ranked individually according to the three variables
that measure ICT, Angola ranks near the bottom of the list, below 115th, in all three rank-
ings. The reason why its rank on the ICT factor score is much higher than its rank on any of
the three variables is because some negative coefficients multiply into (large) negative values
of the corresponding standardized variables to create positive numbers. The point is that,
although the small coefficients appear innocuous, using them in a formulaic manner can lead
to mismeasuring factor scores, sometimes quite poorly.
Because accurately measuring the factors scores is critically important, we take care to
24
produce the simplest structure possible.15 The example above indicates that despite those
efforts, the structure is still not as simple as Thurstone's ideal. Had that ideal been achieved
by the oblique rotation, it would also have produced accurate factor scores. In order to
overcome the pitfall, exemplified by the Angola case, we propose to keep only the leading
scoring coefficients in each column while computing factor scores, and to set the remaining
coefficients to zero. This scoring matrix with the embedded zeros is presented in Table 4.7.
For example, in the first column we retain the first three elements of the scoring coefficient
matrix in Table 4.6 that correspond to the main loadings on this factor. The remaining
elements are set to zero.
This approximation may be formally tested using Anderson's eigenvector test (Reyment and
Joreskog, 1993, p. 101). In order to test whether a specific vector b is equal to the eigenvector
ai associated with the eigenvalue i of a matrix S (ai is the ith principal component of the
data correlation matrix), the Anderson test statistic is:
2eig = (N - 1) ib S- b + 1 1
(28)
i b Sb - 2
The statistic is distributed as 2 with p - 1 degrees of freedom, where p is the number
of elements of the eigenvector (here p = 12). Inserting the eigenvector a in place of b in
(17) results in a value of 2eig = 0. We will use this property to adapt (17) to test for the
rotated factor loadings AR which was created by transforming A using the rotation matrix
T , AR = AT. Even though the columns of AR are not eigenvectors, since the columns of
ART- are the original eigenvectors (17) can equivalently be written as a test statistic for
1
the equality of the rotated vector corresponding to eigenvector ai, aRi, with a specific vector
bR as:
2Reig = (N - 1) iT- bRS- bRT- +
1 1 1 1 1 1
i i (29)
i T- bRSbRT- - 2 ,
i i
15The ordinal measure is a unit-free standardized factor score, relative to the median country whose score
is zero.
25
where T- is the column of the T- that corresponds to eigenvector being tested for equality
1 1
i
with the specific vector bR. We use (18) to test for the equality of each component (column)
of the rotated factor loading matrix AR the simplest possible principal component structure
^
given the data with the corresponding column of the rotated factor loading matrix with
embedded zeros AR our "ideal" Thurstonian structure given the data.16 The statistic
^
0
re-rotates the components back into the original unrotated eigenvector space and compares
whether the simpler structure can map back closely to the unrotated loadings (which is the
basis for the test statistic). Thus, the computed statistics correspond to the order of the
unrotated eigenvectors. For the six columns we get the calculated chi-squared statistics, with
11 degrees of freedom, to be: 304.8, 54.6, 2.17, 35.0, 28.8, and 7.60. The critical value of
24.7 rejects equality of four of the six principal components (i.e. columns of AR and AR ). ^ ^
0
The trade-off that this result forces is between (a) accepting the results of the test and using
the full scoring matrix to compute (sometimes unreliably) the factor scores, and (b) to pro-
ceed using a scoring matrix with zeros replacing the elements for which the corresponding
loadings are small. The latter option is the one we choose for two reasons. First, we believe
that while imperfect, as shown by the formal test procedure, it is a good approximation
because the simple structure has delivered a clear picture of the variables that are strong
measures of each factor. They come close to approximating the Thurstone ideal, and replac-
ing the small loadings with zeros accomplishes that ideal. Second, and more important, is
the overwhelming need to have consistent estimates of factor scores. As the Angola example
drives home, the scores on a factor must be consistent with the underlying rankings of those
variables that overwhelmingly determine the characteristics of the factors.
For these reasons, we proceed with the use of the zero-embedded scoring coefficient matrix
16This method may not be used with the ML method below because the ML method's loadings matrix is
not the matrix of eigenvectors, so there is no correspondence between the loadings matrix and the eigenvalues.
26
in Table 4.7 to determine the scores on the six factors. The same matrix is used to compute
the 1995 scores and the 2002 scores on the six factors. so that the scores across the two
periods may be compared. These factor scores will be used to depict two important features
of the sample. First, the scores allow us to rank each country according to the values of
each factor, for any specific period. Second, the scores from the two periods indicate how
a country's relative position on a factor has changed over that time. Before performing
these comparisons, we undertake a different kind of factor analysis, estimated by maximum
likelihood. We will then be able to draw on a richer and robust set of results when we inspect
country rankings and how they have changed.
4.2 True Factor Analysis with Maximum Likelihood
The salience of many of the issues discussed while analyzing the PC results rotation, simple
structures, correlation of factors are relevant for true factor analysis as well. Here too, our
objective is to achieve the simplest structure, which for the ML estimates requires an oblique
factor rotation. As was the case with PC analysis, in order to avoid inconsistent estimates
and rankings in the true factor analysis, we set the factor scoring coefficients corresponding
to small factor loadings to zero. The important difference from the PC analysis is that ML
favors fewer factors. The communalities are reasonably high as indicated by the fairly low
(below 0.35) uniqueness in the variables.
With the focus now on intercorrelations rather than variances (see (14)), the appropriate
measure of fit used to assess how many factors best fit the data, is no longer the amount of
total variance explained by the factors as in PC analysis. Three measures are appropriate
here, a chi-squared measure of fit, denotes 2fit, and two information-based criteria - the Bayes
information criterion (BIC), and Akaike's information criterion (AIC). The first column Table
5.1 indicates the number of factors. The next five columns are related to the chi-squared fit
27
statistic corresponding to the number of factors in the first column. 2fit is distributed with
0.5[(p - k)2 - (p + k)] degrees of freedom. It is used to test the hypothesis that k or less
factors are required to rationalize the data. At the 1% level of significance the calculated
statistic rejects k = 1 and k = 2 but fails to reject k = 3. The smallest k is therefore three
according to this measure of fit. Another use of this statistic is to see if the difference in
the statistic with every increase in k is "statistically significant". Thus, going from k = 5 to
k = 6 is the first increase (starting from k = 1) for which the change in the statistic is not
significant. According to this variant, k = 5.
The two information criteria reward parsimony and penalize over-parameterization, with
the BIC penalizing over-parameterization more strictly. The smaller the BIC and AIC, the
more preferred the model. The BIC chooses k = 4 while the AIC chooses k = 5. Thus, the
statistical tests conclude that we should focus our attention on no more than five and no less
than four factors. We estimated the model with both four and five factors. Upon examining
the simplest loading structure we found the four factor model to have cleaner political-
economic content since the fifth factor is not distinct in the sense of clearly generating even
one of the variables. That is, it consists of many small undistinguished loadings that are
collectively significant but not individually so. Thus, we proceed with k = 4.
Table 5.2 indicates the oblique-rotated factor loading matrix with four factors (the rotation
matrix T is reported in Table 5.3). The oblique rotation improves upon the orthogonal
(varimax) rotation and produces a simple structure. The first factor is named the ICT
factor because the variables computers, internet users and telephones load heavily on this
factor. Further, these three variables do not load heavily on any of the remaining factors,
thus satisfying an important simple structure requirement. The second factor is named the
Law, Regulation and Openness Factor because the three variables law, quality of regulation,
and tariff and non-tariff barriers load heavily on this factor. These variables also do not load
28
heavily on the other factors. Factor 3 is named the Literacy and Education Factor because
the three variables adult literacy, secondary enrollment and tertiary enrollment load heavily
on this factor (and not on any other). Finally, factor four is named the Innovations Fac-
tor because the number of patent applications, research and number of articles in technical
journals load heavily on this factor. Thus, Table 5.2 indicates a clear and simple struc-
ture of factors. These four factors define the underlying dimensions in the data, which are
measured by the observed variables. That is, computers, internet users and telephones are
essentially different measures of the ICT dimension, and adult literacy, secondary enrollment
and tertiary enrollment are different measures of the Literacy and Education dimension.
An attractive feature of the four factors is that they account for the communalities in the
variables quite well. The residual variances are small, as indicated by the last column of
Table 5.2. None of the variables have a large measure of "uniqueness". If one of the variables
did, then it would mean that the error variance from a regression of that variable on the
factors would be large. As a thumb rule, a uniqueness measure for a variable greater than
0.50 would indicate the presence of a unique factor, uncorrelated with the four common
factors. Fortunately, the four factors rationalize our data well. Finally, just as for the PC
analysis, in order to compute factor scores we use the Thurstonian scoring coefficient matrix
in Table 5.5, achieved by replacing the undistinguished elements in the full scoring coefficient
matrix in Table 5.4 by zero and retaining the significant loadings in each column.
4.3 Weighted data
In addition to the data set analyzed above, it is instructive to analyze a data set in which
variables that increase with the size of the country are scaled down. Thus, we also factor-
analyze a "weighted" data set which is different from the "unweighted" data (analyzed thus
far) with regard to three variables: patent applications, researchers in R&D, and scientific
29
and technical journal articles. In the "weighted" data these three variables are scaled by
population, while the remaining nine variables are exactly the same as in the "unweighted"
data. The scaling of these three variables does influence the optimal number of principal
components required to rationalize the data. For brevity, we refer the reader to Chen and
Gawande (2006) for details such as the factor loading matrices for the "weighted" data. The
methods for estimating those matrices and then using them to estimate the factor scores are
exactly the same as described in Section 3.
The main differences between the two data sets are that in the "weighted" with the ML
method the optimal number of factors (according to the Bayes information criterion) is
three, one less than for the "unweighted" data, while in PC analysis the optimal number of
components is seven, one more than for the "unweighted" data. In the graphical analysis of
the factor scores and rankings below, we differentiate the findings from "unweighted" and
"weighted" data sets.
5. Analysis of the Factor Output
The main objective of the factor score computations is to use them to describe how countries
rank on the basis of these factors, and how those rankings have changed over the two peri-
ods. The authors' working paper contains a more complete analysis for 20 underdeveloped,
developing, emerging, oil-rich and industrialized economies. Here we discuss these results
for five countries.
Figure 1 for Albania has four panels in it. The panel on the top left depicts Albania's rank
vis-a-vis the other 120 countries in the sample on each of the six principal components. The
spider chart on the top right depicts Albania's rank on the four ML factors. The bottom
row panel contains the weighted data counterparts to the top row. There are seven principal
30
components and three ML factors in this data. The green lines inside the spider chart shows
how Albania ranked on each principal component or ML factor in 1995. The red line in the
spider graph shows Albania's ranking in the most recent period, around 2002. If, along any
factor axis, the red line graph is closer to the center than the green line, then it indicates
that Albania's position relative to other countries in the sample on that factor has worsened
over the decade. This unpleasant and surprising finding applies to the Literacy factor, the
ICP factor, and the Education factor. Albania's ranking on the Literacy factor dropped from
being near the top 25th percentile to the bottom 35th percentile over this decade. Similar
deteriorations are in evidence for the ICP factor and the Education factor. Whether this
decline in rankings imply that Albania degraded in absolute terms on the factor score or
whether it improved, but at a far slower pace than other countries, is not obvious from the
graphs. However, since we have used a common factor scoring matrix for computing factor
scores for the two periods, the scores can be put to use in cardinal comparisons as well. The
ML factors, although fewer in number, convey the same difficult message about the change
in Albania's ranking on the ICP factor and the Literacy & Education factor.
Angola ranks towards the bottom of the list of 120 countries in almost all dimensions whether
measured by principal components are maximum likelihood. It ranks abysmally in literacy,
law, education, and innovation potential. The "unweighted" data may stack the odds against
small countries like Angola since the variable patent applications, number of researchers and
technical journal output is unscaled by population. The "weighted" data do indicate hope
for Angola. Its rank in terms of its (scaled) patent applications is closer to the median. Its
ranking on net users and (scaled) number of researchers has also increased over the 10 year
period indicating the country is taking steps to keep up with the technological changes in
the world.
One reason for separately analyzing the "weighted" and "unweighted" data sets is the belief
31
that there is a scale effect in the sheer numbers. That is, there may be threshold effects
in innovation potential based on the stock of intellectual R&D and capital as measured by
technical journal output, number of patent applications, number of researchers. This is the
sense in which the "unweighted" data are different from the "weighted" data. In addition to
the obvious examples of the US, Japan and Western European countries, India and China
have also demonstrated such threshold effects. On the other hand, scaling these variables by
population indicates the extent to which the full technical potential of the population is being
tapped. High levels of these scaled measures are also indicators of innovation potential as
countries like Finland and Iceland have demonstrated in the last decade. So while there is no
compelling reason that sheer numbers should be more or less important than the proportion
of the population that is involved in technical pursuits, it is clear that both have led to the
potential to innovate.
Argentina has, as one might expect after a major currency and banking crisis, degraded
along many dimensions. In the "unweighted" data, it has fallen to the bottom quartile on
the law dimension, as well as in openness. Rising inequality due to the recession are prob-
ably responsible for the degradations in the law dimension. The devaluation was probably
not enough to make their exports competitive and therefore, while the rest of the world has
cut back on trade barriers, Argentina has maintained or increased them. The four dimen-
sional ML factors show a stark picture on the law and openness dimension. Surprisingly,
Argentina has not lost its ranking in the other three dimensions. Its literacy ranking has
actually increased, on innovation potential it has kept pace, and on the ICT dimension it
has maintained its position. The "weighted" data reiterate the same messages from the
"unweighted" data.
Brazil has made gains and presents a contrasting picture to Argentina on at least the law
dimension. While its high income inequality is probably responsible for placing Brazil in
32
the bottom 50% percentile on the law and regulation dimension, the country has improved
on this dimension during the last decade. In the four dimensional ML graph, the green line
contains the red line, indicating that over this ten year period Brazil has improved its ranking
on each dimension. The principal components show that its rank on the openness dimension
has lowered, which probably has to do with Mercosur Argentina and Brazil shared similar
rankings on openness in 1995 or is a result of keeping trade barriers at fixed levels while the
rest of the world has liberalized). The ML graph indicates impressive gains in literacy and
education in Brazil. It is probably a good bet that this trend will also lead to an increase in
Brazil's rankings on the law dimension in future years (recall that the factors are correlated).
The "weighted" data paint a similar picure.
China, being a populous country, will obviously show different rankings for "weighted" versus
"unweighted" dimensions. We should be cautious about interpreting the meaning of the
innovation potential factors in the "unweighted" versus the "weighted" data. In the unscaled
data China ranks high on the innovation potential list because of the sheer strength of its size.
The "weighted" data present quite a contrast along the dimension measured by researcher
and technical journals. In other words, while China has a critical mass in innovation potential
(which may be the reason it attracts foreign direct investment), China still has a long way to
go in achieving its full potential on innovation as measured by the scaled data. If it produced
patents, researchers and technical journal at the same per capita rate as the more advanced
countries, China would probably be an OECD country. Such trends are already in evidence.
Along each of these dimensions in the "weighted" data, China is already at the median of
the sample and has made strides to move ahead, especially in patent applications. On other
dimensions, literacy has not improved greatly. However, the ICT factor leaped from the
bottom quartile to close to the median among the sample.
33
6. Conclusion
We factor-analyze the Knowledge Assessment Methodology (KAM) data. The KAM data
was developed at the World Bank to assess countries' readiness for the knowledge economy.
The data potentially draw the attention of policymakers to specific areas deserving of more
attention and future investments. We factor-analyze KAM data in order to reduce those
variables to their essential dimensions or factors. Our main objective in undertaking the
factor analysis is to quantify the factors for each country, that is, compute factor scores on
each factor. To this end, the paper details these issues in the factor analysis of the KAM data
in detail whether the KAM data should be factor-analyzed, the optimal dimensionality of
the data, and giving political-economic meaning to the factors. We find that the KAM data
are not just amenable to factor analysis but they greatly benefit from it. There are enough
inter-correlations among the variables that the real information in the data can be distilled
down to a smaller number of dimensions.
We use two factor analytic methods Principal Components (PC) analysis and "true" factor
analysis which we estimate using maximum likelihood (ML). While PC analysis focuses on
explaining the variance in the data, the ML method seeks to explain the intercorrelations in
the data. We should therefore expect the two methods to produce different results. While
the results are different (PC analysis requires many more dimensions to rationalize the data
than ML analysis), there are common themes.
A contribution of the paper is identifying the political-economic dimensions in the KAM
data and measuring them for (ordinal) comparisons over time. We embrace the idea of a
simple structure of the dimensions and allow these dimensions to be correlated with each
other. The output from the factor analysis is used to graphically analyze how countries have
changed their rankings on the underlying dimensions over the 1995-2002 period.
34
References
Anderson. T. W. 1984. An Introduction to Multivariate Statistical Analysis. New York,
Wiley.
Bohara, A. K., A. I. Camargo, T. Grijalva, and K. Gawande. 2005. "Fundamental Dimen-
sions Underlying the Regulation of U.S. Trade." Journal of International Economics 65(1):
93-125.
Bollen, K.A., 1989. Structural Equations with Latent Variables. New York, NY: Wiley.
Chen, H. C. Derek, and C. J. Dahlman, 2005. "The Knowledge Economy, the KAM Method-
ology, and World Bank Operations." Manuscript.
Chen, H. C. Derek, and C. J. Dahlman, 2004. "Knowledge and Development: A Cross-
Section Approach." World Bank Policy Research Working Paper #3366.
Goldberger, A., 1972. "Maximum Likelihood Estimation of Regressions Containing Unob-
servable Independent Variables." International Economic Review 13: 1-15.
Joreskog, K.G. and Sorbom, D., 1996. LISREL 8: User's Reference Guide. Chicago, IL:
Scientific Software International Inc.
Joreskog, K.G. and Sorbom, D., 1979. Advances in Factor Analysis and Structural Equations
Models. Cambridge, MA: Abt Books.
Joreskog, K. G., 1967, "A general approach to confirmatory maximum likelihood factor
analysis", Psychometrika 34, 183-202.
35
Kaufmann, D., A. Kraay, and M. Mastruzzi, 1999. "Aggregating Governance Indicators."
World Bank Policy Research Working Paper #2195.
Kaufmann, D., A. Kraay, and M. Mastruzzi, 2004. "Governance Matters III: Governance
Indicators for 1996, 1998, 2000, and 2002." World Bank Economic Review 18: 253-287.
Lawley, D. N. and A. E. Maxwell, 1971. Factor analysis as a statistical method. New York,
NY: American Elsevier.
Reyment, R. and Joreskog, K.G., 1993. Applied Factor Analysis in the Natural Sciences.
Cambridge, UK: Cambridge University Press.
Rubin, D. B. and D. T. Thayer, 1982. "EM Algorithms for ML Factor Analysis". Psychome-
trika, Vol 47, No. 1, March, 1982.
Theil, H., 1971. Principles of Econometrics. New York, NY: John Wiley.
36
37
38
39
40
41
42
43
44
45
46
47
48