WPS4632


Policy ReseaRch WoRking PaPeR               4632




          Instrumental Variables Regressions
          with Honestly Uncertain Exclusion
                            Restrictions

                               Aart Kraay




The World Bank
Development Research Group
Macroeconomics and Growth Team
May 2008

Policy ReseaRch WoRking PaPeR 4632


 Abstract

 The validity of instrumental variable regression models                               uncertainty about the exclusion restriction into increased
 depends crucially on fundamentally untestable exclusion                               uncertainty about parameters of interest. Moderate prior
 restrictions. Typically exclusion restrictions are assumed                            uncertainty about exclusion restrictions can lead to a
 to hold exactly in the relevant population, yet in many                               substantial loss of precision in estimates of structural
 empirical applications there are reasonable prior grounds                             parameters. This loss of precision is relatively more
 to doubt their literal truth. This paper shows how to                                 important in situations where instrumental variable
 incorporate prior uncertainty about the validity of the                               estimates appear to be more precise, for example in larger
 exclusion restriction into linear instrumental variable                               samples or with stronger instruments. These points are
 models, and explores the consequences for inference.                                  illustrated using several prominent recent empirical
 In particular the paper provides a mapping from prior                                 papers that use linear instrumental variable models.




 This paper--a product of the Growth and the Macroeconomics Team, Development Research Group--is part of a larger
 effort in the department to develop tools for the analysis of development issues. Policy Research Working Papers are also
 posted on the Web at http://econ.worldbank.org. The author may be contacted at akraay@worldbank.org.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                          Produced by the Research Support Team

             Instrumental Variables Regressions
  with Honestly Uncertain Exclusion Restrictions


                                            Aart Kraay
                                        The World Bank




_____________________________
1818 H Street NW, Washington DC, 20433, akraay@worldbank.org. I would like to thank Daron
Acemoglu, Laura Chioda, Frank Kleibergen, Dale Poirier and Luis Serven for helpful comments.
The opinions expressed here are the author's and do not reflect the official views of the World
Bank, its Executive Directors, or the countries they represent.

"The whole problem with the world is that fools and fanatics are always so certain of
themselves, but wiser people are so full of doubts"
                                                                                  Bertrand Russell




         The validity of the widely-used linear instrumental variable (IV) regression model

depends crucially on the exclusion restriction that the error term in the structural

equation of interest is orthogonal to the instrument. In virtually all applied empirical work

this identifying assumption is imposed as if it held exactly in the relevant population. But

in the vast majority of empirical studies using non-experimental data, it is hard to be
certain that the exclusion restriction is literally true as it is fundamentally untestable.1

Recognizing this, careful empirical papers devote considerable effort to selecting clever

instruments and arguing for the plausibility of the relevant exclusion restrictions. But

despite the best efforts of the authors, readers (and authors) of these papers may in

many cases legitimately entertain doubts about the extent to which the exclusion

restriction holds.


         In this paper I consider the implications of replacing the standard identifying

assumption that the exclusion restriction is literally true with a weaker one: that there is

prior uncertainty over the correlation between the instrument and the error term,

captured by a well-specified prior distribution centered on zero. The standard and stark

prior assumption is that this distribution is degenerate with all of the probability mass

concentrated at zero, so that the exclusion restriction holds with probability one in the

population of interest. In most applications however a more honest, or at least more

modest, prior assumption is that there is some possibility that the exclusion restriction

fails, even if our best guess is that it is true.


         I then explore the consequences for inferences about the structural parameters

of interest of such prior uncertainty about the validity of the exclusion restriction. I find

that even modest prior uncertainty about the validity of the exclusion restriction can lead

to a substantial loss of precision in the IV estimator. Somewhat surprisingly, this loss of

precision is relatively more important in situations in which the usual IV estimator would



1Murray (2006) poetically refers to this as the "cloud of uncertainty that hovers over instrumental
variable estimation".


                                                  1

otherwise appear to be more precise, for example, when the sample size is large or the

instrument is particularly strong. The intuition for this is straightforward. If I am willing to

entertain doubts about the literal validity of the exclusion restriction, having a stronger

instrument or having a larger sample size cannot reduce my uncertainty about the

exclusion restriction, as the data are fundamentally uninformative about its validity.

Since prior uncertainty about the exclusion restriction is unaffected by sample size or the

strength of the instrument, while the variance of the IV estimator declines with sample

size and the strength of the instrument for the usual reasons, the effects of prior

uncertainty about the exclusion restriction become relatively more important in

circumstances where the IV estimator would otherwise appear to be more precise.


        In this paper I rely on the Bayesian approach to inference. With its explicit

treatment of prior beliefs about parameters of interest, it provides a natural framework for

considering prior uncertainty about the exclusion restriction. I use recently-developed

techniques from the literature on Bayesian analysis of linear IV models, and extend them

to allow for prior uncertainty over the validity of the exclusion restriction. However, to

keep the results as familiar as possible (and hopefully as useful as possible) to non-

Bayesian readers, I confine myself to particular cases that mimic standard frequentist

results as closely as possible.


        The broader goal of this paper is to provide a practical tool for producers and

users of linear IV regression results who are willing to entertain doubts about the validity

of their exclusion restrictions. Too often discussions of empirical papers that use IV

regressions have an absolutist character to them. The author of the paper feels

compelled to assert that the exclusion restriction relevant to his or her instrument and

application is categorically true, and the skeptical reader, or seminar participant, or

referee, is left in an uncomfortable "take it or leave it" position. One possibility is to

wholeheartedly accept the author's untestable assertions regarding the literal truth of the

exclusion restriction, and with them the results of the paper. The stark opposite

possibility is to reject the literal truth of the exclusion restriction, and with it the results of

the paper.


        The results in this paper provide a modest but useful step away from such

"foolish and fanatical" behaviour that the quote from Bertrand Russell reminds us of. For



                                                  2

example, using the results in this paper, the producers and consumers of a particular IV

regression can readily agree on how much prior uncertainty about the validity of the

exclusion restriction would be consistent with the author's results remaining significant at

conventional levels. In some circumstances results might be quite robust to substantial

prior uncertainty about the exclusion restriction, in which case the author and skeptical

reader might agree that the author's conclusions are statistically significant even if they

do not agree on the likelihood that the exclusion restriction is in fact true. In other

circumstances, even a little bit of prior uncertainty about the exclusion restriction might

be enough to overturn the significance of the author's results, in which case the reader

who is skeptical about the validity of the exclusion restriction would be justified in

rejecting the conclusions of the paper. The contribution of this paper is to provide an

explicit tool to enable such robustness checks for uncertainty about the exclusion

restriction.


         I illustrate these results using three prominent studies that use linear IV

regressions. Rajan and Zingales (1998) study the relationship between financial

development and growth, using measures of legal origins and institutional quality as

instruments for financial development. Frankel and Romer (1999) study the effects of

trade on levels of development across countries, using the geographically-determined

component of trade as an instrument. Finally, Acemoglu, Johnson and Robinson (2001)

study the effects of institutional quality on development in a sample of former colonies,

using historical settler mortality rates in the 18th and 19th centuries as instruments. In

all three cases, reasonable readers might entertain some doubts as to the literal validity

of the exclusion restriction. I show how to adjust the standard errors in core

specifications from these papers to reflect varying degrees of uncertainty about the

exclusion restriction. For the first two papers I find that moderate uncertainty about the

exclusion restriction is sufficient to call into question whether the findings are indeed

significant at conventional levels, while the findings of the third paper appear to be more

robust to all but extreme prior uncertainty about the exclusion restriction.


         Most theoretical and empirical work using the linear IV regression model

proceeds from the assumption that the exclusion restriction holds exactly in the relevant

population. One notable recent exception, closely related to this paper, is Hahn and

Hausman (2006). They study the asymptotic properties of OLS and IV estimators when



                                                 3

there are known "small" violations of the exclusion restriction. In particular, they allow

for a known correlation between the instrument and the error term, and in order to obtain

asymptotic results they assume that this correlation shrinks with the square root of the

sample size. Since the violation of the exclusion restriction is "local" in this particular

sense, they find no effects on the asymptotic variance of the IV estimator. They then go

on to compare the asymptotic mean squared error of the OLS and IV estimators, and

show that IV dominates OLS according to this criteria unless violations of the exclusion

restriction are strong. My approach and results differ importantly in two respects. First, I

do not assume that the strength of violations of the exclusion restriction declines with

sample size. While this assumption is analytically convenient when deriving asymptotic

properties of estimators, it is not very intuitive. Since in general the data are

uninformative about exclusion restrictions, it is unclear why we should think that

concerns about the validity of the exclusion restriction are diminished in larger samples.

Second, I explicitly incorporate uncertainty about the exclusion restriction, by assuming

that there is a well-specified prior distribution over the correlation between the instrument

and the error term. In contrast Hahn and Hausman (2006) treat violations of the

exclusion restriction as a certain but unknown parameter to be chosen by the
econometrician.2 The uncertainty about the exclusion restriction that I emphasize is

central to my results, as this uncertainty is responsible for the increased posterior

uncertainty about parameters of interest. Closely related to their paper is Berkowitz,

Caner and Fang (2008) who assume the same 'local' violation of the exclusion

restriction, and demonstrate that standard test statistics in the IV regression model tend

to over-reject the null hypothesis.


         The results in this paper are also closely related to (although developed

independently of) those in Conley, Hansen, and Rossi (2007). They study linear IV

regression models in which there are potentially failures of the exclusion restriction

(which they refer to as "plausible exogeneity"). They propose a number of strategies for

investigating the robustness of inference in the presence of potentially invalid

instruments, including a fully-Bayesian approach like the one taken here. While very

similar in approach, this paper complements theirs in three respects. First, I focus on



2A similar approach of considering the sensitivity of coefficient estimates and tests of
overidentifying restrictions to parametric violations of the exclusion restriction is taken by Small
(2007).


                                                   4

special cases in which analytic or near-analytic results on the effects of prior uncertainty

about the exclusion restriction are available, which helps to develop some key insights.

In contrast, their paper uses numerical methods to construct and sample from the

posterior distribution of the parameters of interest. Second, I characterize how the

consequences for inference of prior uncertainty about the exclusion restriction depend

on the characteristics of the observed sample. This can provide guidance to applied

researches as to whether such prior uncertainty is likely to matter significantly in

particular samples. Finally, I provide several macroeconomic cross-country applications

of this approach that complement the more microeconomic examples in their paper.


        The rest of the paper proceeds as follows. In order to develop intuitions based

on analytic results, I begin in Section 2 with the simplest possible example of a bivariate

OLS regression. This is of course a particular case of IV in which the regressor serves

as its own instrument. I consider the consequences of introducing prior uncertainty

about the correlation between the regressor and the error term for inference about the

slope coefficient. In this simple case I can analytically characterize the effect of prior

uncertainty of the precision of the OLS estimator. In Section 3 I turn to the IV regression

model, focusing on the particular case of a just-identified specification with a single

endogenous regressor. The same insights and analytic results from the OLS case apply

to the OLS estimates of the reduced-form of the IV regression model. Although I am no

longer able to analytically characterize the effect of prior uncertainty on the precision of

the IV estimator of the structural slope coefficient of interest, it is straightforward to

characterize it numerically and show how it depends on the characteristics of alternative

realized samples. Section 4 of the paper applies these results to three empirical

applications. Section 5 offers concluding remarks and discusses potential extensions of

the results.




                                              5

2. The Ordinary Least Squares Case


         I begin by showing how to incorporate prior uncertainty about the exclusion

restriction in the simplest possible case: a linear OLS regression. It is helpful to begin

with this simple case by way of introduction. In the next section of the paper we will see

how these results extend in a very straightforward way to linear IV regression models.


2.1 Basic Setup and the Likelihood Function


         Consider the following bivariate linear regression:



(1)         yi =  xi +i


The regressor x is normalized to have zero mean and unit standard deviation. Assume

further that the regressor and the error term are jointly normally distributed:




(2)         xii  ~ N 0, 0  2
                                       1    


The key assumption here is that I allow for the possibility that the error term is correlated

with the regressor, i.e.  might be different from zero. In the case of OLS this is the

relevant failure of the exclusion restriction. In the next section when I discuss the IV

case, I will assume that an instrumental variable z is available for x, but might be invalid

in the sense that the instrument is correlated with the error term .


         The distribution of the error term conditional on x is:



(3)         i| xi ~ N xi  , 2  1-2
                       (                (     ))

Note of course that when 0, the usual conditional independence assumption

 E i| xi = 0 that is normally used to justify OLS does not hold.
  [       ]




                                               6

        Let y and X denote the Tx1 vectors of data on y and x in a sample of size T, and

note that the normalization of x implies that X'X=T. Also let  = T-1X'y denote the OLS
                                                                   ^


estimator of the slope coefficient, and let s2 = y - X y - X /(T -1) be the OLS
                                                   (      ^ )(      ^)
estimator of the variance of the error term. Finally, define   2  1- 2 . With this
                                                                          (      )
notation the likelihood function can be written as:


                                                                                    2
                                                                                     
                                                                                  
                                                                     1- 2            
(4)   L( y,X;,,) -T2 exp-             1 (T -1)s2       +     -^ -                    
                                      2                              /T              
                                                                                     
                                                                                     




2.2 The Prior Distribution


        In Bayesian analysis, the parameters of the model, in this case , , and , are

treated as random variables. The analyst begins by specifying a prior probability

distribution over these parameters, reflecting any prior information that might be

available. This prior distribution for the parameters is then multiplied with the likelihood

function, which is simply the distribution of the observed data conditional on the

parameters. Using Bayes' Rule this delivers the posterior distribution of the model

parameters conditional on the observed data sample. Inferences about the parameters

of interest are based on this posterior distribution. In many applications, choosing an

appropriately uninformative or diffuse prior distribution for the parameters results in a

posterior distribution that is closely analogous to the usual frequentist results. In the

case of a simple OLS regression where =0 with certainty, an example of such a diffuse

prior distribution is to assume that  and ln() are independently and uniformly

distributed, which implies that their joint prior distribution is proportional to 1/. In this

case, a well-known textbook Bayesian result is that the marginal posterior distribution for

 is a Student-t distribution with mean equal to the OLS slope estimate and variance

equal to the estimated variance of the OLS slope. As a result, a standard frequentist 95




                                                7

percent confidence interval would be analogous to the range from the 2.5th percentile to

97.5th percentile of the posterior distribution for .


        In order to retain this link with standard frequentist results, I will maintain this

diffuse prior assumption for  and . My main interest is in specifying a non-degenerate

prior distribution for the correlation between the regressor and the error term, . Note

that in the standard case there is a drastic asymmetry between prior beliefs about  and

the other parameters of the model. In particular, prior beliefs about  are usually

assumed to be highly informative in the sense that the prior probability distribution for 

is degenerate with all the probability mass at zero, while prior beliefs about  and  are

assumed to be diffuse or totally uninformative. My objective is to relax this asymmetry

by allowing for some prior uncertainty about the exclusion restriction. In particular, I

assume the prior distribution for  is proportional to 1- 2
                                                        (      ) over the support (-1,1),

where  is a parameter that governs prior confidence as to the validity of the identifying

assumption. In particular, when =0 we have a uniform prior over (-1,1). As  increases

the prior becomes more concentrated around zero, and in the limit we approach the

standard assumption that =0 with probability one. Figure 1 plots this prior distribution

for alternative values of . The top panel of Table 1 reports the 5th and 95th percentiles

of the distribution for alternative values of . For example, setting =500 corresponds to

the rather strong prior belief that there is a 90 percent probability that  is between -0.05

and 0.05, and only a 10 percent probability that it is further away from zero.


        A natural extension is to allow the prior distribution for  to have a non-zero

mean, in order to encompass prior beliefs that there might be systematic violations of

the exclusion restriction. Although this is straightforward to do, I do not pursue this

option here as it adds little in the way of additional conceptual insights. For example, if

our prior is that the mean of  is positive, then there will be a corresponding downward

adjustment in the mean of posterior distribution for the slope coefficient. Moreover, the

adjustments to the variance of the posterior distribution due to uncertainty about the

exclusion restriction will be the same as what we have in the case where  has a zero

mean, and these adjustments to the variance are of primary interest here.




                                               8

Assuming further that the prior distribution for  is independent of the prior distribution for

the other two parameters, we have the following joint prior distribution for the three

parameters of the model:



(5)         g(,,)  -1  1- 2   (      )
                                     




2.3 The Posterior Distribution


        The posterior density is proportional to the product of the likelihood and the prior

density, i.e. from applying Bayes' Rule. Multiplying these two distributions and

performing some standard rearrangements gives:


                                                                         2
                                                                          
                                                                          
      f(,, | y,X)  (/ T)-2 exp-   1        1      -^ -      1- 2          
                                           2               /T             
                                                                          
(6)                                                                       


                          -T2 exp- 
                              +1


                                        2
                                        1 (T -1)s2           (1-2  )

                                                  




The first line is proportional to a normal distribution for  conditional on  and , with

mean  -^              and variance /T . When =0, this is the very standard
             1- 2

Bayesian result for the linear regression model with a diffuse prior. In particular, when

=0, the posterior conditional distribution of  is normal and is centered on the OLS

estimate  . When  is different from zero, the mean of the conditional posterior
           ^

distribution for  needs to be adjusted to reflect this failure of the exclusion restriction. If

the correlation between the regressor and the error term is positive (negative), then




                                               9

intuitively, the posterior mean needs to be adjusted downwards (upwards) from the OLS

slope estimator.


        The second line is the joint posterior distribution of  and . It consists of the

product of an inverted gamma distribution for  and the posterior distribution for .3 The

posterior distribution for  is also standard, and intuitively has a mean equal to the OLS

standard error estimator (times a small degress of freedom correction), i.e.

E  =
  [ ]   (T -3)
        (T -1) s2 .


        The only novel part of Equation (6) is the posterior distribution for , which is

identical to the prior distribution. This is what Poirier (1998) refers to as a situation in

which the data are marginally uninformative about the unidentified parameter . This in

turn is a consequence of our prior assumption that  is independent of the other

parameters of the model.4 Although the data are uninformative about , since we have

now explicitly incorporated uncertainty about the exclusion restriction, we can explicitly

average over this uncertainty when performing inference about the slope coefficient of

interest, . In particular, we know that the marginal posterior distribution of  will reflect

our uncertainty about the exclusion restriction. We turn to this next.


2.3 Inference About  With an Uncertain Exclusion Restriction


        Inferences about  are based on its marginal posterior distribution, which is

obtained by integrating  and  out of the joint posterior distribution of all three




3A random variable x follows an inverted gamma distribution, x~IG(,) if its pdf is:

f(x;,) = ()-  x-(              +1) exp-                                                     2
                                              x1 .     Setting x=2,  =     T -1,  =              and
                                                                            2         s2 (T -1)
disregarding the unimportant constant of proportionality gives the result in the text.
4If by contrast the prior distribution allowed for some dependence between the unidentified
parameter  and the identified ones, then the posterior distribution for  would no longer be
identical to the prior. Intuitively, if the unidentified and identified parameters are a priori
dependent, then the data will through this channel be informative about the unidentified
parameters.


                                                      10

parameters. This integration does not appear to be tractable analytically.5 However,

given the conditional structure of the posterior distribution, it is straightforward to

compute the mean and variance of the marginal posterior distribution of  by repeated

application of the law of iterated expectations. In particular, for the posterior mean we

find:


                                       
(7)   E  = ^ - sB(T)E
       [ ]                              = ^
                                1-2 



where B(T)       ((T -2)/2) T2-1
                 ((T -1)/2)             1 as T becomes large, and we have used the


fact that E  = B(T)s .
           [ ]

        Note that the last expectation is with respect to the marginal posterior distribution

of . When  is identically equal to zero, we have the usual result that the mean of the

posterior distribution of  is the OLS slope estimate. However, when there is prior (and

thus also posterior) uncertainty about , we have an additional term reflecting this

uncertainty. This term involves the expectation (with respect to the posterior density for

) of            . When the prior (and posterior) are symmetric around =0, this term is
       1- 2

unsurprisingly zero in expectation. If we are agnostic as to whether the correlation

between the error term and x is positive or negative, on average this does not affect the

posterior mean of . Of course for other priors (and posteriors) not symmetric around

zero this would not be the case, and the posterior mean of  would have to be adjusted

accordingly.


        The posterior unconditional variance is more interesting, and can also be found

by repeated application of iterated expectations:




5 When =0, standard results show that integrating  out of the joint posterior distribution results
in a marginal t-distribution for . However this convenient standard result does not go through
when  differs from zero.


                                                11

(8)   V  = s2  T + E1-
        [ ]        1         2  T -
                               2       T 1
                                         -3


Disregarding the small degrees of freedom correction (T -1)/(T - 3) , the first term is


just the standard OLS estimator of the variance of  , which is
                                                      ^            s2
                                                                      . The second term is
                                                                    T

a correction to the variance estimator coming from the fact that there is uncertainty about

the conditional mean of  coming from our uncertainty about . In fact, the second term

is recognizeable as the variance of the adjustment to the conditional mean that we saw

above.


        This correction to the posterior variance of  is quantitatively very important

because it does not decline with the sample size T. The reason for this is

straightforward -- since the data are uninformative about the correlation between the

regressor and the error term, having a larger sample cannot reduce our uncertainty

about this parameter.


        The bottom panel of Table 1 gives a sense of the quantitative importance of this

                                                                   1/2

adjustment to the posterior variance. Define 1+ T E
                                                           2 
                                                          1-          as the ratio of the
                                                             2

standard deviation of the posterior distribution of  in the case where there is prior

uncertainty about , to the same standard deviation in the standard case where  is

identically equal to zero. This ratio captures the inflation of the posterior standard

deviation due to uncertainty about . This ratio can be large, particularly in cases where

the sample size is large and/or when there is greater prior uncertainty about . For

example, for the case where =100, so that 90 percent of the prior probability mass for 

lies between -0.12 and 0.12, the posterior standard deviation is 22 percent higher in a

sample size of 100, but 87 percent higher when the sample size is 500, and 245 percent

larger in a sample of size 1000. Moving to the left in the table to cases with greater prior

uncertainty about  results in even greater inflation of the posterior standard deviation.




                                             12

         In summary, in this section I have shown how to incorporate prior uncertainty

about the relevant exclusion restriction in a very simple OLS example. The main insight

from this section is that even modest doses of prior uncertainty about the exclusion

restriction can substantially magnify the variance of the posterior distribution of .

Moreover, this effect is greater the larger is the sample size, as the intrinsic uncertainty

about the exclusion restriction becomes relatively more important. The results of this

section will be helpful in developing results for the IV case in the following section, and

the key insight regarding the role of sample size will generalize naturally.




                                              13

3 The Instrumental Variables Case


         I now extend the results of the previous section to the case of the linear IV

regression model in which there is prior uncertainty about the validity of the exclusion

restriction. In this section I show that this type of uncertainty magnifies the posterior

variance of the slope coefficients in the reduced-form version of the model, and this in

turn makes the unconditional posterior distribution of the structural slope coefficient of

interest more dispersed. I also show how this increase in dispersion depends on the

characteristics of the observed sample.


3.1 Basic Setup


         To keep things as simple as possible I focus on the particular case where the

dependent variable y is a linear function of a single potentially endogenous regressor, x,

and a single instrument z is available for x. The structural form of the model is:


            yi =  xi +i
(9)         xi =   zi + vi


The main parameter of interest is , which captures the structural relationship between y

and x. The parameter  captures the relationship between the instrument z and the

endogenous variable x.


         For convenience I assume that, like the endogenous regressor x, the instrument

z has also been normalized to have a zero mean and unit standard deviation. I assume

further that the two error terms and the instrument are jointly normally distributed:



            i         0           2
                                                          
(10)        vi                   2vv                 0  
            zi   ~ N 0 , 0                           1   


where 2 and 2v are the variances of the two error terms, and  and  are the

correlations of  with v, and  with z, respectively.


                                               14

         The standard assumption used to identify the linear IV model is that the

correlation  between the instrument z and the error term  is identically equal to zero.

This is the exclusion restriction which stipulates that the only channel through which the

instrument z affects the dependent variable y is through the endogenous variable x.

When the exclusion restriction holds, it is possible to separate the the regressor x into (i)

an endogenous component, v, that has a potentially nonzero correlation  with the error

term, and (ii) an exogenous component �z that is uncorrelated with the error term when

=0. This latter exogenous source of variation in x can then be used to identify the slope

coefficient . In fact, this is precisely the intuition behind two-stage least squares (2SLS)

estimation. In the first stage, the endogenous variable is regressed on the instrument x.

The fitted values from this first-stage regression are used as a proxy for the exogenous

component of x in the second-stage regression.


         When the exclusion restriction fails to hold, the instrumental variables estimator

of  is biased with a bias equal to         . This bias is larger (in absolute value) the
                                        

larger is the correlation between the instrument and the error term, and the weaker is the

correlation between the instrument and the endogenous variable x, i.e. the smaller is .


         Standard practice is to impose the identifying assumption and proceed as if it

were literally true. This approach is appealing because it ensures -- albeit purely by

assumption -- that the IV estimator will be consistent for . But in most empirical

applications using non-experimental data, it is impossible to be sure that the exclusion

restriction in fact holds, as it is fundamentally untestable.


         Bayesian analysis of the linear IV model is most conveniently based on the

reduced form of the model in Equation (9). The reduced form is obtained by substituting

the second equation into the first:


           yi =   zi +ui
(11)       xi =   zi + vi




                                               15

where ui i +   vi and      . This latter identity allows us to retrieve the slope

parameter of interest, , from the coefficients of the reduced-form model. This is

precisely the principle of indirect least squares. In particular, in the just-identified case I

consider here, the 2SLS estimator of  is the ratio of the OLS estimators of  and  from

the two equations of the reduced form.


        The distributional assumptions for the structural form of the model imply the

following distribution for the reduced form errors and the instrument:



           ui        0          u 2   u v        u       
(12)       vi                             2v       0  
           zi    ~ N 0 , 
                        0                           1   


where:


           u = 2 + 2v +22v
             2



           =           +v

(13)            2 + 2v +22v

           =              
                2 + 2v +22v




Note that the correlation  between the reduced form error u and the instrument z is the

counterpart of the correlation  between the structural form error  and the instrument z.

When the exclusion restriction holds exactly, ==0, and we have the standard linear IV

regression model. In the next section of the paper I show how to replace this exact

exclusion restriction with something weaker: a non-degenerate prior probability

distribution over the correlation between the instrument and the error term.


        The distribution of the reduced-form errors u and v conditional on the instrument

is:




                                              16

(14)       ui                                 2

             vi  zi ~ N 
                            u zi 
                                0        ,  u (1-2) u v         2v


This in turn implies the following distribution for y and x conditional on the instrument:




(15)       yxii  zi ~ N 
                            ( +u)zi           ,
                                 zi              u (1-2) 2vv  
                                                    2
                                                                      u




Let Y denote the Tx2 matrix with the T observations on (yi,xi) as rows; let Z denote the

Tx1 vector containing the T observations on zi; and recall that Z has been normalized

such that Z'Z=T. Let  denote the variance-covariance matrix of (yi,xi) conditional on zi.

Define the 1x2 matrix G   :  and let G   :  = Z'Z
                             (    )          ^   (^ ) ( )
                                                      ^         -1Z'Y denote the matrix of

OLS estimates of the reduced-form slope coefficients and

S  Y - ZG Y - ZG /(T -1) as the estimated variance-covariance matrix of the
     (      ^ )(      ^ )
residuals from the OLS estimation of the reduced-form slopes. The multivariate
generalization of the likelihood function in Equation (4) is:6



      L(Y,X,Z;G,,)= -TM         / 2    -T /2


(16)        exp- tr-1(T -1)S+ T G- G- u :0) G- G- u :0) 
                     1                        (     (^ (            ))(    (^ (         ))
                     2




3.2 Bayesian Analysis of the IV Regression Model


        When the exclusion restriction holds exactly, i.e. ==0, the reduced-form model

in Equation (11) becomes a standard multivariate linear regression model, in this

particular case with two equations in which the dependent variable y and the

endogenous regressor x are both regressed on the instrument z.          Bayesian analysis of



6See for example Zellner (1973), Equation 8.6 or Poirier (1996), Equation 10.3.12.



                                              17

the linear IV model builds on well-established textbook results for Bayesian analysis of

the multivariate regression model (for textbook treatments of the latter see Zellner

(1971), Ch. 8 and Poirier (1996), Ch. 10). In particular, the multivariate regression

model admits a natural conjugate prior, meaning that the prior and posterior distributions

have the same analytic form. Moreover, there are analytic results providing the mapping

from the parameters of the prior distribution to the parameters of the posterior

distribution, which make transparent how the observed data is used to update prior

beliefs.


        Hoogerheide, Kleibergen and Van Dijk (2008) extend these tools to analysis of

the linear IV regression model. Their key insight is that, since there is a one-to-one

mapping between the structural and the reduced-form parameters, the familiar prior and

posterior distributions for the reduced form parameters in the multivariate regression

model induce well-behaved prior and posterior distributions over the structural

parameters. They analytically characterize these distributions for the structural

parameters for a number of particular cases, and provide an application to the Angrist-

Krueger data. I follow their approach, but with a further extension to allow for prior

uncertainty over the validity of the exclusion restriction.


3.3 The Prior Distribution


        I begin by specifying the same prior distribution over the correlation between the

reduced-form error and the instrument, , that was used in the previous section of the

paper, i.e. g   1- 2
              ( ) (      )   over the support (-1,1) where  is a parameter that governs

the strength of the prior belief that this correlation is zero. For the remaining parameters,

I make the standard multivariate analog of the diffuse prior assumptions for these

parameters in the OLS case. In particular, define 11 = (1- 2 )u ,       2




12 = (1-2)-1 , and 22 = 2v , so that  = 
                    /2                                11 12 11 22           , and let the prior
                                                                  22

distribution for the elements of  be       -3/2. This prior corresponds to the Jeffrey's prior

for the multivariate regression model when =0. And, as in the OLS case, this choice of

prior distribution ensures that the Bayesian results mimic the frequentist ones for the



                                               18

case where =0. In this case, the posterior distribution for the reduced-form slopes is a

multivariate Student-t distribution centered in the OLS slope estimates. With the further

assumption that the prior distribution of the reduced-form slopes is uniform and

independent of the other parameters, we have the following joint prior distribution:



 (17)      g(G,, )  -3  1- 2   /2 (      )




          Before proceeding, it is useful to characterize the prior distribution that this

implies for the correlation between the structural disturbance and the instrument, .

Since  =                    u                     , the prior distribution of  will in general
               u -2 uv / +( /)22v
                 2


depend on the entire joint prior distribution of all of the structural parameters. However,

since the prior distribution of the remaining parameters is chosen to be uninformative, it

is straightfoward to verify numerically that the distribution of  has the same shape and

percentiles as the distribution of .7 As a result, we can use the percentiles reported in

Table 1 for the prior distribution of  in the OLS case to interpret the prior distributions of

 and  in the IV case.


3.4 The Posterior Distribution


          The posterior distribution for the parameters of interest is proportional to the

product of the likelihood function and the prior, i.e. from applying Bayes' Rule.

Multiplying these two distributions and rearranging gives:




7It is straightforward although tedious to compute the Jacobian of the mapping from the structural
parameters to the reduced-form parameters, and use this to write down the joint prior distribution
of all the structural-form parameters. It does not however appear to be tractable to extract
analytically from this the implied marginal distribution of . This is why I instead characterize this
distribution numerically.


                                                  19

       L(G,,|Y,X,Z)


                                                              
                                                                                           : 0 
                     -1/2                  ^    11              -1          ^     11           
(18)                     exp- G-G-
                                1

                                2                1- 2    : 0   G - G - 
                                                                 T                  1- 2



                     -(T-2)/2exp- tr -1S(T -1)  1-2
                                    1   {                       )

                                    2                  } (


This expression is just the multivariate generalization of Equation (6). The first line is

proportional to a normal distribution for the matrix of reduced-form slopes, G, conditional


on  and , with mean ^ -        11       : ^ and variance-covariance matrix
                                                                               
                                                                                  . When
                                1- 2                                            T

=0, we again retrieve the standard Bayesian result for the multivariate linear regression

model with a diffuse prior for the reduced-form of the IV regression. In particular, when

=0, the posterior conditional distribution of the reduced-form slopes is normal and is

centered on their OLS estimates. However, when  is different from zero, the mean of

the conditional posterior distribution for  needs to be adjusted to reflect this failure of the

exclusion restriction, which induces a correlation between the regressor and the error

term in the first structural equation. If the correlation between the regressor and the

error term is positive (negative), then intuitively, the posterior mean needs to be adjusted

downwards (upwards) from the OLS slope estimator. In contrast, no adjustment is

required for the conditional mean of , since by assumption the error term in the second

structural equation is orthogonal to the instrument.


        The second line is the joint posterior distribution of  and , and is again

precisely analogous to the OLS case. It consists of the product of an inverted Wishart

distribution for  and the posterior distribution for . The posterior inverted Wishart

distribution for  is the multivariate generalization of the inverted gamma distribution for

 in the OLS case, and again it is intuitively centered on the OLS variance estimator, i.e.

E  =
  [ ]   T -1 S    .
       T - 3


                                               20

         As in the OLS case, the only novel part of Equation (6) is the posterior

distribution for , which once again is identical to the prior distribution. As before, the

prior and the posterior are identical because the data are marginally uninformative about

this parameter given the prior independence between  and the other parameters of the

model. However, since we have explicitly incorporated uncertainty about the exclusion

restriction, we can explicitly average over this uncertainty when performing inference

about the slope coefficients of interest.


3.4 Inference with an Uncertain Exclusion Restriction


         As in the OLS case, we want to base inferences about  on its marginal posterior

distribution, which is obtained by integrating all of the other parameters out of the joint

posterior distribution. Again, this is unfortunately not tractable analytically and needs to

be done numerically. However, we can obtain some useful insights by first studying how

the distribution of the reduced-form slopes is affected by prior uncertainty about the

exclusion restriction.


         We begin by using the law of iterated expectations to compute the unconditional

posterior mean and variance of the reduced-form slopes:


                                           
(19)   E  :  = ^ - sB(T)E
         [    ]                                      ( : ^)
                                            : ^ = ^
                                    1-2         


and



(20)   V  :  =
         [     ]   s11 1/ T +E 2 /(1- 2)
                     2(        [           ])
                                              s222 /T  T - 3 
                                              s12 /T T -1
                             s12 / T


These expressions are just the multivariate generalizations of Equations (6) and (7) in

the OLS case, and the intuitions for them are identical. Since the prior (and posterior)

distribution for  has zero mean, the expectation in Equation (19) is equal to zero and so

the unconditional posterior mean for the reduced-form slopes is equal to their OLS


                                              21

estimates. The effects on the posterior variance are substantively more interesting. As

before, we see that the posterior variance of  increases due to uncertainty about the

exclusion restriction. In fact, the posterior variance of  is identical to the OLS case. It

consists of the usual component that declines with sample size, s11 / T , as well as an
                                                                       2


adjustment capturing the variance of the adjustment to the sample mean due to

uncertainty about the exclusion restriction, s11 E 2 /(1- 2 ) . The key point once again
                                                2    [            ]
is that this adjustment does not decline with sample size, and so uncertainty about the

exclusion restriction has proportionately larger effects on the posterior variance of the

reduced-form slope coefficient  when the sample size is large. In contrast, there is no

change in the posterior variance of the slope coefficient from the first-stage regression,

, as the exclusion restriction is not relevant to the estimation of this slope parameter.


         This adjustment to the posterior variance of the reduced-form coefficient  will

also be reflected in the distribution of the structural form coefficients. In particular, since

=/, and since uncertainty about the exclusion restriction expands the posterior

variance of  alone, we would expect to see a similar increase in the dispersion of the

posterior distribution of  as well. I characterize this effect by sampling from the

posterior distribution of . In fact, since the posterior distribution of  conditional on 

and  is a Cauchy-like ratio of correlated normal random variables, it is not even clear

that moments of the unconditional posterior distribution of  exist.


         In general, the effects of prior uncertainty about the exclusion restriction on the

posterior distribution of the structural slope coefficient of interest will be sample-

dependent. This is because the posterior distribution in Equation (18) depends on the

observed sample through the OLS estimates of the reduced form slopes and residual

variances, ^ ,  , and S. In order to give a sense of how the effects of prior uncertainty
                ^

about the exclusion restriction might vary in different observed samples, I present some

simple illustrative calculations for alternative hypothetical observed samples. I begin by

innocuously assuming that the observed data on y and x are scaled to have mean zero

and variance one, as is z. The observed sample can therefore be characterized by three

sample correlations, Ryx, Ryz, and Rxz, and the observed reduced-form slopes and

residual variances can be expressed in terms of these correlations as:




                                               22

(21)   (^:^)=(R          )
                  yz:Rxz    and S =    1-R2yz Rxy -Rzy Rxz 
                                                     1-R2xz


For each hypothetical sample summarized by a combination of assumptions on the three

sample correlations, I sample from the posterior distribution of , for a range of values

for the parameter governing prior uncertainty about the exclusion restriction, , and for

different values of the sample size, T. I take 10,000 draws from the posterior distribution

of  in each case, and compute the 2.5th and 97.5th percentiles of the distribution. This

is analogous to a standard frequentist 95 percent confidence interval for the IV estimate

of the slope coefficient.


         The results of this exercise are summarized in Table 2. Each row of the table

corresponds to a set of assumptions on the observed sample correlations and the

sample size. These assumptions are spelled out in the left-most columns, in italics. In

each row I also report the 2.5th and 97.5th percentiles of the posterior distribution for 

in the standard case where there is no uncertainty about the exclusion restriction, i.e.

when ==0. This serves as a benchmark. The right-most columns correspond to

various assumption about , corresponding to varying degrees of prior certainty about

the exclusion restriction. I consider the same range of values as in Table 1, and for

reference at the top of the table I report the 5th and 95th percentiles of the prior

distribution of  (and ) that these imply. Each cell entry reports the length of the

interval from the 2.5th to the 97.5th percentile of the posterior distribution of ,

expressed as a ratio to the length of this same interval when ==0, i.e. relative to the

standard case.


         Not surprisingly, all of the entries in Table 2 are greater than one, reflecting the

fact that prior uncertainty about the exclusion restriction increases the dispersion of the

posterior distribution of . This increase in posterior uncertainty regarding  is of course

higher the greater is prior uncertainty regarding the exclusion restriction. Consider for

example when all three sample correlations are equal to 0.5 and the sample size is

equal to 100. When =10, corresponding to significant uncertainty about the exclusion

restriction, the 95 percent confidence interval for  is 2.14 times larger than the



                                                23

benchmark case where ==0 by assumption. However, as  increases this

magnification of posterior uncertainty is smaller, and when =500 the confidence

intervals are just 1.03 times larger than the benchmark case.


         Unsurprisingly, Table 2 also confirms that in all cases the magnification of

posterior uncertainty is greater the larger is the sample size. For example, when all

three sample correlations are equal to 0.5 and the sample size is equal to 100, the

confidence interval for  is inflated by a factor of 2.14 when T=100, but it is inflated by a

factor of 4.45 when T=500. The reason for this is the same as in Section 2 in the OLS

case. There we saw that the correction to the posterior variance of  to capture

uncertainty about the exclusion restriction does not decline with sample size, and so its

effect on posterior uncertainty is proportionately greater the larger is the sample size.


         The more interesting insight from Table 2 is that the magnification of posterior

uncertainty about  also depends on the moments of the observed sample in a very

intuitive way. Consider the first panel of Table 2, where I vary the strength of the first-

stage sample correlation between the instrument and the endogenous variable, Rxz,
holding constant the other two correlations.8 In the standard case where ==0 by

assumption, the confidence intervals of course shrink as the strength of the first-stage

relationship increases. However, the magnification of the posterior variance increases

as the strength of the first-stage relationship increases. The intuition for this is

analogous to the intuition for the effects of sample size. A larger sample size, and also a

stronger first-stage relationship between the instrument and the endogenous variable

permit more precise inferences about . However, a larger sample size and a stronger

first-stage regression cannot reduce our intrinsic uncertainty about the validity of the

exclusion restriction, and so the adjustment to the posterior variance to account for this

is proportionately greater. Of course this does not mean that uncertainty about the

8 In these examples I have chosen hypothetical samples in which we are unlikely to encounter
well-known weak-instrument pathologies. In fact, the minimum correlation of 0.3 between the
endogenous variable and the instrument in this table is deliberately chosen to ensure that the
first-stage F-statistic is almost 10 in the smallest sample of size T=100 that I consider, and is
greater than 10 in all other cases. This corresponds to the rule of thumb proposed by Staiger
and Stock (1997) for distinguishing between weak and strong instruments. These weak-
instrument pathologies pose no particular difficulties for Bayesian analysis that bases inference
on the entire posterior distribution of . However, with weak instruments the Bayesian highest
posterior density intervals I focus would no longer necessarily be symmetric around the mode of
the posterior distribution.


                                                   24

exclusion restriction is less important in an absolute sense in small samples or with weak

instruments -- only that its effects on posterior uncertainty are smaller relative to other

sources of posterior imprecision about the parameters of interest.


        The same insight holds in the second and third panels of Table 2. In the second

panel, I vary the strength of the observed sample correlation between the dependent

variable and the instrument, Ryz. Since I am holding constant the other two correlations

in this panel, larger values of Ryz correspond to greater endogeneity problems, and

hence less precise IV estimates of the structural slope coefficient  in the benchmark

case where ==0 by assumption. Since varying the extent of the endogeneity problem

does not affect the intrinsic uncertainty about the exclusion restriction, I find that the

magnification of the confidence interval declines as Ryz increases. A similar effect

occurs in the third panel, where I vary the strength of the observed correlation between y

and x. Since I am holding the other two correlations constant, higher values of Rxy imply

a more precisely-estimated structural relationship between these two variables.

However, once again this does not affect intrinsic prior (and posterior) uncertainty about

the exclusion restriction, and so the magnification of the confidence intervals increases

as Rxy increases.


        In summary, we have seen that prior uncertainty about the exclusion restriction

can substantially increase posterior uncertainty about the key structural slope coefficient

of interest, . The magnitude of this inflation of posterior uncertainty depends of course

depends on the degree of prior uncertainty about the exclusion restriction. But it also

depends on the characteristics of the observed sample in a very intuitive way. Holding

other things constant, a greater sample size, a stronger first-stage relationship between

the instrument and the endogenous variable, a stronger structural correlation between

dependent variable and the endogenous variable, and a weaker reduced-form

correlation between the endogenous variable and the instrument all imply a more

precise IV estimator, absent any prior uncertainty about the exclusion restriction.

However, since none of these factors help to reduce prior (or posterior) uncertainty

about the exclusion restriction, this uncertainty becomes relatively more important.




                                              25

4. Empirical Applications


         I next demonstrate the quantitative importance for inference of prior uncertainty

about exclusion restrictions in three well-known empirical studies that use linear

instrumental variables models. Acemoglu, Johnson and Robinson (2001, hereafter AJR)

study the causal effects of institutions on economic development. Using a sample of 64

former colonies, they regress the logarithm of GDP per capita on a measure of property

rights protection. They propose using historical data on mortality rates experienced by

settlers during the colonial period as a novel instrument for institutional quality. AJR

argue that in areas where settlers experienced high mortality rates, colonial powers had

few incentives to set up institutions that protect property rights and provide a foundation

for subsequent economic activity. In a simple bivariate specification there are a number

of obvious concerns regarding the validity of the exclusion restriction that settler mortality

rates matter for development only through their effects on institutional quality. Historical

settler mortality rates might be correlated with the tropical location and intrinsic disease

burden of a country, and these factors may matter directly for modern development.

AJR seek to address such concerns in their paper through the addition of various control

variables to capture these effects. For example, we will show results using one of their

core specifications in which they control for latitude to capture such locational effects

(Table 4, Column 2 in AJR). And in the paper they also present a wide range results
with direct controls for location and the disease burden.9


         Nevertheless, a reader of AJR might reasonably entertain some doubts as to

whether the exclusion restriction holds exactly even in these extended specifications.

There are many potential correlates of settler mortality rates that might in turn be

correlated with development outcomes. For example, Glaeser et. al. (2004) argue that

low settler mortality rates may have operated through investments in human capital

rather than institutions to protect property rights. Here we do not take any stand as to


9 Ideally I would like to use one of AJR's specifications with a more complete set of control
variables to illustrate the effects of uncertainty about exclusion restrictions. However, in many of
their specifications with more control variables, their instruments are much weaker, and I do not
want to conflate my point about uncertainty regarding exclusion restrictions with the well-known
concerns with weak instruments. For example, in Columns (7) and (8) of Table 4, AJR introduce
continent dummies, and continent dummies together with latitude. In these specifications, I find
first-stage F-statistics on the excluded instrument of 6.83 and 3.97, well below the Staiger and
Stock (1997) rule of thumb of 10. This suggests that the settler mortality instrument does not
have sufficiently strong explanatory power within geographic regions.


                                                   26

which of these potential failures of the exclusion restriction is the right one. Rather we

simply argue that reasonable people might question whether the exclusion restriction

holds exactly, and might entertain some probability that it is not in fact true.


        My second example is Frankel and Romer (1999, hereafter FR), who study the

relationship between trade openness and development in a large cross-section of

countries. They regress log GDP per capita on trade as a share of GDP. To address

concerns about potential reverse causation and omitted variables, they propose a novel

instrument based on the geographical determinants of bilateral trade. In particular, they

estimate a regression of bilateral trade between country pairs on the distance between

the countries in the pair, their size measured by log population and log area, and a

dummy variable indicating whether either country in the pair is landlocked. They then

use the fitted values from this bilateral trade regression to come up with a constructed

trade share for each country that reflects only these geographical determinants of trade.

They then use this as an instrument for trade. In their core specification, they also

control directly for country size, as measured by log population and log land area, to

control for the problem that large countries tend to trade less and these size variables

also enter in the bilateral trade equation. There are however various reasons why the

necessary exclusion restriction (that the geographically-determined component of trade

matters for development only through its effects on overall trade) may not hold exactly.

For example, Rodrigu�z and Rodrik (2000) discuss various channels through which the

geographical variables in the FR bilateral trade regression might have direct effects on

per capita incomes.


        My third example comes from Rajan and Zingales (1998, hereafter RZ), who

study the relationship between financial development and growth. In contrast with the

previous two papers that exploit purely cross-country variation, this paper uses a novel

identification strategy that exploits within-country cross-industry differences in

manufacturing growth rates. They construct a measure of the dependence of different

manufacturing sectors on financial services, and then ask whether industries that are

more financially-dependent grow faster in countries where financial development is

greater. In particular, they estimate regressions of the growth rate of industry i in country

j on a set of country dummies, a set of industry dummies, the initial size of the industry,

and an interaction of the financial dependence of the sector with the level of financial



                                              27

development in the country. In a number of specifications, RZ instrument for this final

interaction term with variables capturing the legal origins of the country and a measure

of institutional quality, all interacted with a measure of financial development. In

particular, I will focus on the specification in Table 4, column 6 of RZ, where the relevant

measure of financial dependence is an index of accounting standards recording the

types of information provided in annual reports of publicly-traded corporations in a cross-

section of countries.


         This third example differs from the previous ones in two key respects. First,

because RZ rely on the within-country variation in sectoral growth rates, potential

violations of the exclusion restriction are less obvious than in the previous two cases. In

RZ, the requirement is that the instruments be orthogonal to the country- and industry-

specific component of growth, since the regressions contain country and industry

dummies. Thus for example, concerns about the exclusion restriction are not that

countries with faster growth adopt better accounting standards, but rather that countries

with a relatively faster growth in financially-dependent industries would adopt better

accounting standards. Nevertheless there might be residual concerns about the validity

of the exclusion restriction in this case. The second difference is that RZ use multiple

instruments, while the results I show above apply to the case of a single instrument. To

make the RZ results fit into the framework of this paper, I choose just one of their

instruments and first reproduce the RZ results in this just-identified case. For this

purpose I choose their index of efficiency and integrity of the legal system, produced by

a commercial risk rating agency, as the one instrument of choice. Doing so gives a

result that is of comparable significance to the RZ core result, although the magnitude of
the estimated coefficient becomes somewhat larger than what RZ report.10


         I use datasets provided by the authors to reproduce their results. In each of the

three examples, I first project the dependent variable, the regressor of interest, and the

instrument on all the remaining control variables that these authors treat as exogenous,

so that I can identify these residuals as y, x, and z in the discussion above. I also

normalize the variance of z to be equal to one, consistent with the discussion above. I


10An alternative is to use just their dummy variable for Scandinavian legal origins as an
instrument, which generates results that are quite similar to those reported by RZ. Conversely,
using either dummies for British or French legal origins alone as an instrument does not deliver
significant IV estimates of the coefficient on the interaction variable of interest.


                                                  28

then take 10,000 draws from the posterior distribution of , for alternative values of 

corresponding to varying degrees of prior uncertainty about the exclusion restriction. I

then compute the 2.5th, 50th and 97.5th percentiles of this distribution.


         Table 3 summarizes the results, with three panels corresponding to the three

examples. In each panel in the first column I report the sample size and my replication

of the relevant IV slope coefficient and standard error from each paper. In the columns

of the table I provide summary statistics on the posterior distribution for the slope

coefficient, for varying degrees of prior uncertainty about the exclusion restriction. In

addition, Figure 2 plots the posterior densities for the slope coefficient for selected

values of . Unsurprisingly, in all three panels of this figure we clearly see how the

posterior distribution of the slope coefficient becomes more dispersed as uncertainty

about the exclusion restriction increases.


         This increase in posterior dispersion is quantified in the table, which reports the

2.5th, 50th, and 97.5th percentiles of the posterior distribution of the structural slope

coefficient for each of the three papers. To read this table, it is useful to begin with the

last column which reports these percentiles for the limiting case where  tends to infinity

and thus the prior distribution imposes =0 with certainty. This corresponds to the

standard Bayesian IV estimates in which there is no uncertainty regarding the exclusion

restriction. Because of my choice of diffuse priors for all of the parameters other than ,

when =0 these Bayesian results mimic the classical ones quite closely, with these

percentiles quite similar to the 95 percent confidence intervals reported in the first

column. This is particularly so for RZ, while for FR and AJR the posterior distribution of

the slope is somewhat longer right tail, with the result that the 97.5th percentiles are a bit

higher than the upper bounds of the classical confidence intervals. This is also apparent

in Figure 2, where the thin solid line plots a normal distribution with mean and standard

deviation corresponding to the classical IV slope coefficient estimate and estimated

standard error. For RZ this normal distribution coincides almost perfectly with the

posterior distribution for the slope when =0, while there are some small discrepancies

for the other two papers.


         Moving from right to left in Table 3 illustrates the effects of greater prior

uncertainty about the exclusion restriction. In each of the three panels, I summarize this


                                               29

increase in the dispersion of the posterior distribution by reporting the length of the

interval from the 2.5th percentile to the 97.5th percentile, relative to the length of the

same interval when =0 with certainty. These intervals expand substantially as

uncertainty about the exclusion restriction increases. For example, for FR in the middle

panel, this interval is 2.8 times as wide when =10, while for RZ in the bottom panel it is

7.26 times as wide. This greater proportional effect on posterior uncertainty about the

structural slope is consistent with what we saw in the artificial samples in Table 2, as RZ

have a larger sample size and a stronger instrument than do FR. In contrast, for AJR

with their smaller sample, the increase in posterior dispersion is smaller.


        Table 2 also can be used to determine how great prior uncertainty about the

exclusion restriction needs to be in order for the interval from the 2.5th percentile to the

97.5th percentile of the posterior distribution of  to include zero. In the case of AJR,

their particular specification that we report is most robust to uncertainty about the

exclusion restriction. Even when =5, so that there is a great deal of prior uncertainty,

with 90 percent of the prior probability mass for  (and ) between -0.46 and 0.46, the

2.5th percentile of the posterior distribution of the slope is greater than zero. This is not

however the case for FR and RZ. Moving from =200 to =100, the 2.5th percentile of

the posterior distribution of the slope falls below zero. This in turn means that if the prior

distribution of  (and ) is such that more than 10 percent of the prior probability mass

falls outside the interval of about (-0.1,0.1), then the Bayesian analog of the 95 percent

confidence interval includes zero.




                                               30

5. Extensions and Conclusions


        The validity of the IV estimator depends crucially on the validity of fundamentally

untestable exclusion restrictions. Typically these exclusion restrictions are assumed to

hold exactly in the relevant population. However, in many empirical examples it is

reasonable to doubt their validity. In this paper I have shown how to explicitly

incorporate prior uncertainty about the exclusion restriction into the linear IV regression

model. This prior uncertainty about the exclusion restriction leads to greater posterior

uncertainty about parameters of interest, in some cases quite substantially so. This

enables straightforward checks of the robustness of inferences about structural

parameters to varying degrees of prior uncertainty about the exclusion restriction.


        There are at least two natural extensions of the results presented here. The first

I have already discussed: allowing the prior distribution for the correlation between the

instrument and the error term to have a non-zero mean. This would encompass not only

prior uncertainty about the validity of the exclusion restriction, but also prior beliefs about

the direction of likely violations of the exclusion restriction. For example, one might

specify a prior distribution for  that is a translation of a beta distribution, i.e.

( +1)/2 ~ Beta(1,1) . With appropriate choices of the prior parameters 1 and 2, a

prior such as this can capture prior beliefs regarding both the mean and the variance of

. Since there is no updating of the prior distribution of , we will have the same

posterior distribution, and we can simply (numerically) integrate over this distribution to

arrive at the marginal posterior distribution for the slope coefficients of interest. This will

have predictable effects on the results presented here: the posterior mean of the

distribution of the structural slope coefficients will need to be adjusted to reflect the non-

zero prior and posterior mean for the distribution of , since the expectation in Equation

(19) will no longer be zero. While this extension may be practically useful in many

situations where there might be obvious potential directions for violations of the

exclusion restriction, conceptually this adds little in the way of additional insights.


        The second is to consider the case of multiple instruments and multiple

endogenous variables. In this paper, I have focused on the case of a single endogenous

variable and a single instrument in order to keep the results as transparent as possible.

Moving to the case of multiple endogenous variables and potential overidentification also


                                               31

poses no particular conceptual problems, although it does pose two modest practical

difficulties. First, when there are multiple instruments, we need to elicit a prior

distribution over the correlation between each of the instruments and the structural error

term, rather than just a simple univariate prior over a single parameter that I have used

here. In practice, it may be difficult to flexibly specify such a prior in a way that captures

differing degrees of certainty about the exclusion restriction for each instrument.

Second, in the case of overidentification, the mapping from the reduced-form parameters

to the structural parameters is more complex, and therefore it is more difficult to simulate

the prior and posterior distribution of the structural parameters that is implied by the prior

and posterior distribution over the reduced-form parameters. Hoogerheide, Kleibergen

and Van Dijk (2007) provide further details on this case.




                                               32

References


Acemoglu, Daron, Simon Johnson, and James A. Robinson (2001). "The Colonial
        Origins of Comparative Development: An Empirical Investigation." American
        Economic Review. 91(5):1369-1401.

Berkowtiz, Daniel, Mehmet Caner and Ying Fang (2008). "Are 'Nearly Exogenous'
        Instruments Reliable?". Economics Letters. (article in press).

Conley, Tim, Christian Hansen and Peter E. Rossi (2007). "Plausibly Exogenous".
        Manuscript. Graduate School of Business, University of Chicago.

Frankel, Jeffrey A. and David Romer (1999). "Does Trade Cause Growth?" The
        American Economic Review, (June) 379-399.

Glaeser, Edward, Rafael Laporta, Florencio Lopez-de-Silanes, and Andrei Shleifer
        (2004). "Do Institutions Cause Growth?". Journal of Economic Growth.
        9(3):271-303.

Hahn, Jinyong and Jerry Hausman (2006). "IV Estimation with Valid and Invalid
        Instruments". Annales d'Economie et Statistique.

Hoogerheide, Lennart, Frank Kleibergen and Herman van Dijk (2007). "Natural
        Conjugate Priors for the Instrumental Variables Regression Model Applied to the
        Angrist-Krueger Data". 138(1):63-103.

Murray, Michael (2006). "Avoiding Invalid Instruments and Coping with Weak
        Instruments". Journal of Economic Perspectives. 20(4):111-132.

Poirier, Dale J. (1998). "Revising Beliefs in Nonidentified Models". Econometric Theory.
        14:483-509.

Rajan, Raghuram and Luigi Zingales (1998). "Financial Dependence and Growth".
        American Economic Review. 88(3):559-586.

Rodriguez, Francisco and Dani Rodrik (2001). "Trade Policy and Economic Growth: A
        Skeptic's Guide to the Cross-Country Evidence". NBER Macroeconomics
        Annual. 15:261-325.

Small, Dylan (2007). "Sensitivity Analysis for Instrumental Variables Regression With
        Overidentifying Restrictions". Journal of the American Statistical Association.
        102(479):1049-1058.

Staiger, D. and J.H. Stock (1997). "Instrumental Variables Regression With Weak
        Instruments". Econometrica. 65:557-586.




                                            33

                              Table 1: Inference in the OLS Case

                                                  Value of Prior Parameter 
                                        5        10        100       200    500    1000
90% Prior Probability of  Between:
 Lower                                  -0.46     -0.34     -0.12     -0.08  -0.05   -0.04
 Upper                                    0.46     0.34      0.12      0.08   0.05    0.04
Inflation of Posterior Standard Deviation of  When T=:
               100                        3.32     2.45      1.22      1.12   1.05    1.02
               200                        4.58     3.32      1.41      1.22   1.10    1.05
               500                        7.14     5.10      1.87      1.50   1.22    1.12
               1000                     10.05      7.14      2.45      1.87   1.41    1.22




                                                 34

                                   Table 2: Inference in the IV Case


                                                               Value of Prior Parameter 
                                                       5         10         100       200      500



90 percent of prior probability between:
                                          Lower       -0.46    -0.34       -0.12     -0.08     -0.05
                                          Upper       0.46      0.34        0.12     0.08      0.05



Assumptions on Observed Sample                             Width of 95% Confidence Interval for
                                                                   Indicated Value of 
                                                               (Relative to Width when =0)



Vary Strength of First-Stage CORR(x,z)
 Rxy= 0.5        Ryz= 0.5       Rxz= 0.3
95% CI for       (1.00, 4.02)              T=100      1.85      1.49        1.05     1.05      1.04
95% CI for       (1.32, 2.20)              T=500      4.44      3.16        1.41     1.26      1.11
 Rxy= 0.5        Ryz= 0.5       Rxz= 0.5
95% CI for       (0.65, 1.52)              T=100      2.95      2.14        1.17     1.10      1.03
95% CI for       (0.84, 1.19)              T=500      6.42      4.45        1.72     1.42      1.17
 Rxy= 0.5        Ryz= 0.5       Rxz= 0.7
95% CI for       (0.47, 0.99)              T=100      3.28      2.41        1.22     1.13      1.05
95% CI for       (0.61, 0.83)              T=500      7.19      5.00        1.86     1.48      1.23

Vary Strength of Reduced Form CORR(y,z)
 Rxy= 0.5        Ryz= 0.3        Rxz= 0.5
95% CI for       (0.24, 0.99)              T=100      3.71      2.64        1.24     1.11      1.05
95% CI for       (0.45, 0.76)              T=500      8.09      5.75        2.01     1.60      1.29
 Rxy= 0.5        Ryz= 0.5        Rxz= 0.5
95% CI for       (0.65, 1.52)              T=100      2.95      2.14        1.17     1.10      1.03
95% CI for       (0.84, 1.19)              T=500      6.42      4.45        1.72     1.42      1.17
 Rxy= 0.5        Ryz= 0.7        Rxz= 0.5
95% CI for       (1.01, 2.11)              T=100      2.02      1.57        1.07     1.06      1.05
95% CI for       (1.21, 1.65)              T=500      4.17      3.13        1.33     1.21      1.09

Vary Strength of Structural CORR(y,x)
 Rxy= 0.3        Ryz= 0.5       Rxz= 0.5
95% CI for       (0.60, 1.63)              T=100      2.57      1.92        1.12     1.06      1.02
95% CI for       (0.81, 1.23)              T=500      5.34      3.70        1.53     1.31      1.13
 Rxy= 0.5        Ryz= 0.5       Rxz= 0.5
95% CI for       (0.65, 1.52)              T=100      2.95      2.14        1.17     1.10      1.03
95% CI for       (0.84, 1.19)              T=500      6.42      4.45        1.72     1.42      1.17
 Rxy= 0.7        Ryz= 0.5       Rxz= 0.5
95% CI for       (0.72, 1.39)              T=100      3.65      2.70        1.28     1.16      1.07
95% CI for       (0.87, 1.15)              T=500      8.14      5.73        2.03     1.64      1.28




                                               35

                                  Table 3: Empirical Examples

                                                    Value of Prior Parameter 
                                         5        10         100       200     500      
90% Prior Probability of  Between:
 Lower                                  -0.46       -0.34      -0.12     -0.08   -0.05   0.00
 Upper                                  0.46         0.34       0.12      0.08    0.05   0.00

Acemoglu-Johnson-Robinson (2001)
(Table 4, Column 2)
T=64
IV Slope = 0.96
IV Standard Error = 0.21
95% C.I. = (0.53, 1.39)
Posterior Distribution for Slope
 2.5th Percentile                       0.08     0.32       0.61       0.63    0.63    0.65
 Mode                                   0.95     0.96       0.96       0.96    0.96    0.96
 97.5th Percentile                      2.31     2.06       1.80       1.81    1.76    1.75

Increase in P025-P975 range             2.02     1.57       1.08       1.07    1.02    1.00

Frankel-Romer (1999)
(Table 3, Column 2)
T=150
IV Slope = 1.97
IV Standard Error = 0.91
95% C.I. = (0.18, 3.76)

Posterior Distribution for Slope
 2.5th Percentile                       -5.62    -3.61      -0.28      0.01    0.14    0.31
 Mode                                   1.98     1.95       1.97       1.98    1.96    1.96
 97.5th Percentile                     10.73     8.66       5.37       5.04    4.83    4.69

Increase in P025-P975 range             3.73     2.80       1.29       1.15    1.07    1.00

Rajan-Zingales (1998)
(Table 4, Column 6)
T=1067
IV Slope = 0.31
IV Standard Error = 0.08
95% C.I. = (0.16, 0.46)

Posterior Distribution for Slope
 2.5th Percentile                       -1.27    -0.82      -0.06      0.02    0.10    0.16
 Mode                                   0.30     0.31       0.31       0.31    0.31    0.31
 97.5th Percentile                      1.85     1.41       0.70       0.60    0.53    0.47

Increase in P025-P975 range            10.14     7.26       2.45       1.90    1.40    1.00




                                            36

      Figure 1: The Prior Distribution for , OLS Case


                          14


                          12


                          10                   eta=0
                                               eta=5
                           8                   eta=10
                                               eta=100
                           6
                                               eta=500

                           4


                           2


                           0
-1       -0.5                0                0.5         1

                Correlation between x and 




                            37

            Figure 2: Posterior Distribution for Structural Slopes


               2.5

                                                                eta=10
   Acemoglu,    2                                               eta=100
 Johnson and                                                    phi=0
Robinson (2001)1.5                                              normal

                1


               0.5


                0
 -1     -0.5       0       0.5        1        1.5        2       2.5     3


                        0.5

                       0.45
                                                                 eta=10
                        0.4
    Frankel and                                                  eta=100
                       0.35
    Romer (1999)                                                 phi=0
                        0.3
                                                                 normal
                       0.25

                        0.2

                       0.15

                        0.1

                       0.05

                          0
-6      -4        -2       0          2          4        6         8      10


                                    6

                                    5                           eta=10

           Rajan and                                            eta=100

        Zingales (1998)             4                           phi=0
                                                                normal
                                    3

                                    2

                                    1

                                    0
-1.5        -1         -0.5           0            0.5          1         1.5




                                     38

39