Policy Research Working Paper                          9629




Machine Learning in International Trade Research
            Evaluating the Impact of Trade Agreements

                                  Holger Breinlich
                                 Valentina Corradi
                                   Nadia Rocha
                                   Michele Ruta
                                J.M.C. Santos Silva
                                    Tom Zylkin




 Development Economics
 Development Research Group
  &
 Macroeconomics, Trade and Investment Global Practice
 April 2021
Policy Research Working Paper 9629


  Abstract
  Modern trade agreements contain a large number of pro-                             paper proposes data-driven methods for selecting the most
  visions in addition to tariff reductions, in areas as diverse                      important provisions and quantifying their impact on trade
  as services trade, competition policy, trade-related invest-                       flows, without the need of making ad hoc assumptions on
  ment measures, or public procurement. Existing research                            how to aggregate individual provisions. The analysis finds
  has struggled with overfitting and severe multicollinearity                        that provisions related to antidumping, competition policy,
  problems when trying to estimate the effects of these pro-                         technical barriers to trade, and trade facilitation are asso-
  visions on trade flows. Building on recent developments in                         ciated with enhancing the trade-increasing effect of trade
  the machine learning and variable selection literature, this                       agreements.




 This paper is a product of the Development Research Group, Development Economics and the Macroeconomics, Trade and
 Investment Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a
 contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the
 Web at http://www.worldbank.org/prwp. The authors may be contacted at h.breinlich@surrey.ac.uk, v.corradi@surrey.ac.uk,
 nrocha@worldbank.org, mruta@worldbank.org, jmcss@surrey.ac.uk, and tzylkin@richmond.edu.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                        Produced by the Research Support Team
Machine Learning in International Trade Research –
   Evaluating the Impact of Trade Agreements
          Holger Breinlichy              Valentina Corradiz               Nadia Rochax
            Michele Ruta{              J.M.C. Santos Silvak                Tom Zylkin




KEY WORDS: Lasso, Machine Learning, Preferential Trade Agreements, Deep Trade
Agreements.
JEL CLASSIFICATION: F14, F15, F17.
      Research for this paper has been in part supported by the World Bank’    s Multidonor Trust Fund
for Trade and Development. The …ndings, interpretations, and conclusions expressed in this paper are
entirely those of the authors. They do not necessarily represent the views of the International Bank for
Reconstruction and Development/World Bank and its a¢ liated organizations, or those of the Executive
Directors of the World Bank or the governments they represent. We also gratefully acknowledge …nancial
support through ESRC grant EST013567/1, and thank Scott Baier, Maia Linask, Yoto Yotov, and
seminar participants at the World Bank Economics of Deep Trade Agreements Seminar Series for useful
comments. Alvaro Espitia and Jiayi Ni provided excellent research assistance. The usual disclaimer
applies.
    y
      University of Surrey, CEP and CEPR. Email: h.breinlich@surrey.ac.uk
    z
      University of Surrey. Email: v.corradi@surrey.ac.uk
    x
      World Bank. Email: nrocha@worldbank.org.
   {
      World Bank. Email: mruta@worldbank.org.
   k
      University of Surrey. Email: jmcss@surrey.ac.uk.
      University of Richmond. Email: tzylkin@richmond.edu.
1    Introduction
International trade is of vital importance for modern economies, and governments around
the world try to shape their countries’ export and import patterns through numerous
interventions. Given the di¢ culties facing multilateral trade negotiations through the
World Trade Organization (WTO), in the last two decades countries have increasingly
turned their focus to preferential trade agreements (PTAs) involving only one or a small
number of partners. At the same time, attention has shifted from the reduction of import
tari¤s to the role of non-tari¤ barriers and behind-the-border policies, such as di¤erences
in regulations, technical standards or intellectual property rights protection. Accordingly,
modern trade agreements contain a host of provisions besides tari¤ reductions, in areas
as diverse as services trade, competition policy, trade-related investment measures, or
public procurement (Hofmann, Osnago, and Ruta, 2017).
    Against this background, researchers and policy makers interested in the e¤ects of
trade agreements face di¢ cult challenges. In particular, recent research has tried to move
beyond estimating the overall impact of PTAs and to establish the relative importance
of individual trade agreement provisions in determining an agreement’       s overall impact
(e.g., Kohl, Brakman, and Garretsen, 2016, Mulabdic, Osnago, and Ruta, 2017, Dhingra,
Freeman, and Mavroeidi, 2018, and Regmi and Baier, 2020). However, such attempts
face the di¢ culty that the large number of provisions, and the fact that similar provisions
appear in di¤erent trade agreements, create severe multicollinearity problems, which
make it very di¢ cult to identify the e¤ects of individual provisions. Traditional methods
such as gravity regressions of trade ‡   ows on dummies for individual provisions are not
able to deal with such multicollinearity. Instead, researchers have grouped or aggregated
provisions in di¤erent ways. For example, Mattoo, Mulabdic, and Ruta (2017) use
the count of provisions in an agreement as a measure of its ‘       depth’, hence implicitly
giving equal weight to each measure. Dhingra, Freeman, and Mavroeidi (2018) overcome
multicollinearity problems by grouping services, investment, and competition provisions
and examining the e¤ect of these “provision bundles”on trade ‡       ows.
    In this paper we propose a new method to estimate the impact of individual provi-
sions on trade ‡  ows that does not require ad hoc assumptions to aggregate individual
provisions. Instead, we propose a data-driven method based on recent developments
in the machine learning and variable selection literature to select the most important
provisions and quantify their impact on trade ‡    ows.
    In doing so, we build on recent advances in variable selection methods that address
di¢ culties arising from a key feature exhibited by trade data, namely the high degree of
correlation between individual PTA provisions. We propose an extension of the Belloni,
Chernozhukov, Hansen, and Kozbur (2016) approach to the case of nonlinear models
with high-dimensional …xed e¤ects, which have become standard in the analysis of trade
‡ ows in recent years (see, for example, Head and Mayer, 2014, Yotov, Piermartini, Mon-
teiro, and Larch, 2016). In particular, we use a Poisson pseudo-maximum likelihood
(PPML) version of the well-known lasso (Least Absolute Shrinkage and Selection Oper-
ator) method for variable selection (see, for example, Hastie, Tibshirani, and Friedman,
2009) and show how to choose the tuning parameter of this estimator using either a


                                             2
plug-in method based on Belloni, Chernozhukov, Hansen, and Kozbur (2016) or cross-
validation. Notably, this requires overcoming a number of practical problems inherent
in the nature of trade data, such as the nonlinearity of the underlying gravity model and
the need to control for multilateral resistance and unobserved trade barriers.
    We apply our method to a comprehensive data set on PTA provisions recently made
available by the World Bank (Mattoo, Rocha and Ruta, 2020). Importantly, this data-
base is very rich, to the point where the number of provision variables we consider is
larger than the number of PTAs we observe in our data. In addition, due to template
e¤ects and possible synergies between groups of provisions, these provision variables can
be highly correlated with one another. For these reasons, we complement our plug-in
lasso results with a novel methodology that seeks to identify potentially important vari-
ables that may have been missed in the initial lasso step. As we show using simulation
evidence, this new method, termed the “iceberg lasso”      , presents a favorable balance
between the rigor of the plug-in lasso and the lenience of cross-validation methods in
small-to-moderate samples where the true causal variables may be highly correlated
with an unknown number of other variables in the data set. To be clear, this two-step
approach does not completely answer the question of “which provisions matter most for
trade?” but it does lead to substantial improvements in our ability to …nd the correct
provision variables and narrow down the number of potential candidates in the presence
of such rich data.
    Our work contributes to several di¤erent literatures. Most directly, we contribute to
the large and growing literature on the e¤ects of PTAs on trade ‡     ows. This literature
has been predominantly interested in estimating the overall e¤ects of trade agreements
rather than individual provisions (see, for example, Baier and Bergstrand, 2007). More
recently, attention has shifted to trying to decompose the overall PTA e¤ect and to dis-
entangle the e¤ects of individual trade agreement provisions. As previously discussed,
this literature often needs to make strong assumptions about the importance of indi-
vidual provisions or needs to aggregate them in essentially arbitrary ways (see Mattoo,
Mulabdic, and Ruta, 2017; Dhingra, Freeman, and Mavroeidi, 2018). We propose in-
stead a novel set of methods to select the most important provisions and to quantify
their impact on trade ‡   ows. To provide some headline results, our plug-in lasso results
…nd that 6 provisions related to antidumping, competition policy, technical barriers to
trade, and trade facilitation are associated with enhancing the trade-increasing e¤ect of
trade agreements. When we then use our iceberg lasso procedure to look beyond the
“tip”of the proverbial iceberg, we subsequently identify a set of 43 provisions out of 305
provision variables in our data that may be impacting trade. For some comparison, a
more conventional approach based on cross-validation selects 124 provisions as being rel-
evant and, based on our simulations, is actually less likely to include all of the “correct”
provisions.
    In addition, we contribute to the subset of the machine learning literature inter-
ested in variable selection. In particular, we extend and adapt existing work by Belloni,
Chernozhukov, Hansen, and Kozbur (2016) to make it applicable to the context of inter-
national trade ‡ ows and trade agreements. This requires an extension to the estimation
of nonlinear models with high-dimensional …xed e¤ects using PPML. The international

                                             3
trade context also throws up some interesting challenges when trying to select the tuning
parameter that governs the extent to which our PPML-lasso estimator penalizes coe¢ -
cients on included variables and hence selects included variables. In particular, standard
cross-validation methods such as k -fold or leave-one-out approaches are not feasible in
practice, requiring us to propose a novel approach based on out-of-sample predictions of
the e¤ects of PTAs. We …nd that the number of provisions selected when the tuning pa-
rameter is chosen by cross-validation is too large to have a meaningful interpretation and
that, in contrast, the number of provisions identi…ed when using the plug-in penalty is
too small to allow us to be con…dent that it includes the majority of relevant provisions.
The two-step method that we propose builds on the results obtained using the plug-in
penalty and identi…es an additional set of provisions that may have a causal e¤ect on
trade.
    Finally, we contribute to a small existing literature that has used machine learning
and other related methods to study the e¤ects of trade agreements in a gravity context.
For example, Regmi and Baier (2020) use an unsupervised learning method to group
PTAs by textual similarity, so as to provide a more nuanced notion of PTA depth.
Following from a similar motivation, Hofmann, Osnago, and Ruta (2017) propose an
earlier depth measure for PTAs based on principal components analysis applied to their
provisions data. In contrast, Baier, Yotov, and Zylkin (2019) use a two-step methodology
where pair-speci…c PTA e¤ects are estimated in a …rst stage and then predicted out of
sample using country- and pair-speci…c variables.
    The rest of this paper is structured as follows. Section 2 presents the data on PTA
provisions and provides a descriptive analysis of these data, highlighting a number of
stylized facts about the provisions present in recent trade agreements. Section 3 intro-
duces the variable selection problem in the three-way gravity model context and explains
how we implement PPML-lasso estimation with high-dimensional …xed e¤ects. It also
includes simulation evidence comparing relative performance of di¤erent lasso methods
in a simpli…ed setting with high correlation between regressors. Section 4 applies our
methods to our database on PTA provisions and shows which individual provisions are
the strongest predictors of trade ‡ ows. Section 5 concludes.


2    Data
Our analysis combines data on international trade ‡ ows from Comtrade with the new
database on the content of PTAs that has been collected by Mattoo, Rocha and Ruta
(2020). On trade, we use merchandise trade exports between 1964 and 2016 from 220
exporters to 270 importers. Country pairs without export information are considered
as zeros. The database on the content of trade agreements includes information on 283
PTAs that have been signed and noti…ed to the WTO between 1958 and 2017. The data
focus on the sub-sample of 18 policy areas that are most frequently covered in trade
agreements –de…ned as areas that are present in 20 percent or more of the agreements
that have been mapped in Hofmann, Osnago, and Ruta (2017). These policy areas range
from environmental laws and labor market regulations, that are covered in roughly 20


                                            4
percent of the PTAs, to areas such as export taxes and trade facilitation that are present
in over 80 percent of the agreements (see Figure 1).

                 Figure 1: Share of PTAs that cover selected policy areas




 Figure shows the share of PTAs that cover a policy area. Source: Mattoo, Rocha and Ruta (2020).
    For each agreement and policy area, the database provides granular information on
the speci…c provisions covering stated objectives and substantive commitments, as well
as aspects relating to transparency, procedures and enforcement. The coding exercise
focuses on the legal text of the agreements and therefore excludes information on the
actual implementation of the commitments included in the agreements.1
    To alleviate the problems caused by the high dimensionality of the data and the high
level of correlation across the provisions included in the agreements, the analysis pre-
sented in this paper focuses on a sub-set of “essential”provisions. This includes the set
of substantive provisions (those that require speci…c integration/liberalization commit-
ments and obligations) plus the disciplines among procedures, transparency, enforcement
or objectives, which are viewed as indispensable and complementary to achieving the
substantive commitments. Non-essential provisions are referred to as “corollary”   .2 The
share of essential provisions in the total number of provisions included in an agreement
ranges from less than 10 percent for public procurement, movement of capital and visa
and asylum, to more than 50 percent for policy areas such as environmental laws and
labor market regulations. Overall, the sub-set of essential provisions represents almost
one-third (305/937) of the total number of provisions coded in this exercise (see Table
1).
   1
     In this data set, information coming from secondary law (the body of law that derives from the
principles and objectives of the treaties) has not been coded. This is of particular importance for
agreements such as the EU, since most policy areas covered have used secondary law such as regulations,
directives, and other legal instruments to pursue integration.
   2
     The classi…cation into essential and corollary in the database is based on experts’knowledge and,
hence, is subjective.




                                                  5
              Table 1: Distribution of essential provisions by policy area
                                            Number of          Number of
 Policy Area                                 provisions Essential provisions     Share
 Anti-dumping and Countervailing Duties           53               11            28:8%
 Competition Policy                               35               14            40:0%
 Environmental Laws                               48               27            56:3%
 Export Taxes                                     46               23            50:0%
 Intellectual Property Rights                    120                67           55:8%
 Investment                                       57               15            26:3%
 Labor Market Regulations                         18               12            66:7%
 Movement of Capital                              94                8             8:5%
 Public Procurement                              100                5             5:0%
 Rules of Origin                                  38               19            50:0%
 Sanitary and Phytosanitary                       59               24            40:7%
 Services                                         64               21            32:8%
 State-Owned Enterprises                          53               13            24:5%
 Subsidies                                        36               13            36:1%
 Technical Barriers to Trade                      34               19            55:9%
 Trade Facilitation and Customs                   52               11            21:2%
 Visa and Asylum                                  30                3            10:0%
 Total                                           937               305           32:6%

     The coverage of essential provisions also varies widely across trade agreements and
disciplines, indicating that not all PTAs cover the same set of essential provisions. As
shown in Table 2, more than 3/4 of agreements cover 25 percent or less of essential
provisions included in policy areas such as environmental laws, antidumping, sanitary
and phytosanitary measures, and technical barriers to trade. Conversely, for policy
areas such as visa and asylum, rules of origin, and trade facilitation and customs, more
than 70 percent of the mapped agreements cover between 25 and 75 percent of essential
provisions. With the exception of services and investment, coverage of more than 75
percent of essential provisions is rare and happens in less than 15 percent of the mapped
agreements.
     One important caveat regarding this data set is that it does not cover all of the
trade agreements that have been in force during the period under study. Speci…cally,
our information on provisions is limited to agreements that are in e¤ect in present day,
i.e., excluding any agreements that are no longer in e¤ect. For this reason, we drop
observations associated with an agreement no longer in e¤ect. This means that the
e¤ects of newer agreements are identi…ed by changes in trade relative to when that
pair did not have any agreement rather than relative to pre-existing agreements. The
majority of the observations that are dropped are due to pre-accession agreements that
new European Union (EU) members sign before joining the EU. Thus, to use one of
these cases as an example, Italy-Croatia is included in our data for years 1992-2000
(after Croatian independence and before the initial EU-Croatia PTA in 2001) and for
year 2016 (after Croatia joins the EU in 2013). The EU is treated di¤erently in our


                                           6
analysis for this reason, as we discuss further in Section 4. To identify agreements no
longer in e¤ect, we consult the NSF-Kellogg database created by Je¤ Bergstrand and
Scott Baier crosschecked with data from the WTO. The EU and the earlier European
Community are treated as the same agreement for these purposes, though it is allowed
to evolve as new provisions are added.
               Table 2: Coverage of essential provisions by policy area
                                                  Share of agreements covering:
    Policy Area                                0 to 25% 25% to 75% over 75%
    Anti-dumping and Countervailing Duties       99%            1%         0%
    Competition Policy                           48%           47%         5%
    Environmental Laws                           88%           12%         0%
    Export Taxes                                 41%           59%         0%
    Intellectual Property Rights                 76%           23%         1%
    Investment                                    6%           64%        30%
    Labor Market Regulations                     68%           17%        15%
    Movement of Capital                          44%           42%        13%
    Public Procurement                           53%           40%         7%
    Rules of Origin                               7%           93%         0%
    Sanitary and Phytosanitary Measures          87%           13%         0%
    Services                                      6%           62%        33%
    State-Owned Enterprises                      45%           54%         1%
    Subsidies                                    59%           41%         0%
    Technical Barriers to Trade                  93%            7%         0%
    Trade Facilitation and Customs               21%           78%         0%
    Visa and Asylum                              27%           70%         3%
    Note: Coverage ratio refers to the share of essential provisions for a policy area contained
    in a given agreement relative to the maximum number of essential provisions in that policy
    area. Source: Mattoo, Rocha and Ruta (2020)


3    Determining Which Provisions Matter for Trade
We now outline the methodology we use to identify which PTA provisions have the
largest impact on bilateral trade. To preview our approach, we will …rst specify a typical
panel data gravity model for trade ‡   ows. Following the latest recommendations from
the methodological literature (Yotov Yotov, Piermartini, Monteiro, and Larch, 2016,
Weidner and Zylkin, 2020), we will use a multiplicative model where expected trade
‡ ows are given by an exponential function of our covariates of interest plus three sets of
…xed e¤ects. Drawing on this standard framework, we will then consider the estimation
challenges that arise when the number of covariates (here, provision variables) is allowed
to be very large. As we will discuss, it will be convenient to reformulate the usual
estimation problem as a “variable selection” problem, where we suppose that many of
the provisions have zero or approximately zero e¤ect.
    Bringing together these elements will require that we extend recent computational
advances in high-dimensional …xed e¤ects estimation to incorporate lasso and lasso-

                                                 7
type penalties. It will also require that we introduce our own innovation, the iceberg
lasso method, which we will motivate as providing a balance between “cross-validation”
approaches that tend to select too many variables and more rigorous, “plug-in”methods
that may select too few. We also include simulation evidence comparing the performance
of these various methods.

3.1     The Gravity Model
Our starting point for estimation is the following multiplicative gravity model:

                  ijt   := E (yijt jxijt ;   it ;   jt ;   ij )   = exp(x0ijt +   it   +   jt   +   ij ):   (1)

Here, i, j; and t respectively index exporter, importer, and time. Bilateral trade ‡      ows
from exporter i to importer j at time t are therefore given by yijt , xijt are our covariates
of interest, and it , jt , and ij are, respectively, exporter-time, importer-time, and
exporter-importer (“pair”  ) …xed e¤ects.
    Because of the three …xed e¤ects, the model in (1) is often called the “three-way
gravity model”  . The use of the term “gravity” is most closely associated with the
exporter-time and importer-time …xed e¤ects it and jt . Intuitively, these two …xed
e¤ects may be thought of as controlling for changes over time in the “gravitational pull”
that the exporter and importer each exert on world trade ‡    ows. More formally, these
…xed e¤ects can be shown to depend on the market sizes of the two countries as well as
on what Anderson and van Wincoop (2003) call “multilateral resistance”         , a theoretical
measure of each country’  s connectedness to the overall trade network. The inclusion of
pair …xed e¤ect ij was suggested by Baier and Bergstrand (2007), who convincingly
argue that estimates of trade agreements and other similar variables would otherwise
be biased due to omitted cross-sectional heterogeneity. In terms of a trade model, this
omitted heterogeneity is often motivated as coming from unobserved trade costs.
    An important point about (1) is that it motivates estimating the model in its original
nonlinear form using PPML; see Gourieroux, Monfort and Trognon (1984). In principle,
one could instead use a linear model after taking logs, but Santos Silva and Tenreyro
(2006) have pointed out two main pitfalls with this approach. First, if the correct model
for trade ‡ ows is given by (1), OLS estimation is consistent only if the distribution of
the error term satis…es very strong conditions. Second, it cannot deal with zero trade
‡ ows. Given the exponential mean form, there are good reasons to instead estimate us-
ing PPML. Though the resulting model is nonlinear with three sets of high-dimensional
…xed e¤ects, estimation is feasible due to recent computation innovations by Correia,
Guimarães, and Zylkin (2020) and others.3 Weidner and Zylkin (2020) have recently
   3
     Correia, Guimarães, and Zylkin (2020) and Stammann (2018) have each proposed algorithms for
estimating nonlinear …xed e¤ects models based on iteratively re-weighted least squares (IRLS). Heuris-
tically, this type of algorithm exploits the linearity of the weighted least squares step in the IRLS
algorithm to wipe out the …xed e¤ects in each iteration, then uses an application of the Frisch-Waugh-
Lovell theorem to update the weights, repeating until convergence. For a di¤erent approach, see Larch,
Wanner, Yotov, and Zylkin (2019).



                                                                  8
established the consistency and asymptotic distribution of the three-way PPML estima-
tor, and Yotov, Piermartini, Monteiro, and Larch (2016) recommend it as the workhorse
method for estimating the e¤ects of trade policies. It is frequently applied to the context
of trade agreements in particular.
     Having established these details, our focus is on the set of covariates, xijt . In most
applications in the trade agreements literature, xijt is often either a single variable—
i.e., a dummy for the presence of a trade agreement— or minor variants thereof, such
as introducing interactions with either the depth of the agreement or the bilateral char-
acteristics of the two countries (Baier, Bergstrand, and Feng, 2014; Baier, Bergstrand,
and Clance, 2018). However, a major estimation challenge that arises in our setting is
that we must treat the number of provisions as being very large. With our data set, this
high dimensionality, combined with the relatively small number of PTAs, leads to im-
plausibly large and uninterpretable estimates due to multicollinearity. Furthermore, the
estimated model will have poor predictive performance due to over…tting. We therefore
must discuss how the standard gravity estimation approach must be modi…ed in order
to deal with this additional source of high dimensionality.

3.2    Variable Selection and Gravity
The starting point for our methodological innovations is to suppose that only a handful
of our provision variables have a non-negligible e¤ect on trade ‡ ows. To be more precise,
we have p = 305 essential provision variables, coded as dummies, of which a subset s < p
are assumed to have non-zero e¤ects. We do not know s beforehand, nor do we know
the identities of any of the s provisions that substantively a¤ect trade. Our goal then
is to use statistical methods along with the model described in (1) in order to identify
these provisions.
    Because of the high dimensionality of xijt , experimenting with di¤erent subsets of
provisions to see which has the best performance is unlikely to be fruitful. Instead, we
adopt a penalized regression (or “regularization”    ) approach that involves appending a
penalty term to the Poisson pseudo-likelihood one would use to estimate the unpenalized
gravity model. The idea is that the penalty term “shrinks” all estimated coe¢ cients
towards zero and forces some of them to be exactly equal to zero. The higher the
penalty, the fewer the variables that are found to have non-zero coe¢ cients and are
therefore “selected”  . By design, the variables that are selected should be those that
exert the strongest in‡ uence on the …t of the model; coe¢ cients for variables that are
not as relevant should end up getting shrunken to zero completely.
    Because of its computational feasibility, the most frequently used approach to this
type of variable selection problem is the lasso, introduced by Tibshirani (1996). In our
setting, the penalized objective function that de…nes the three-way PPML-lasso is
                                                           !      p
                                1 X                            1 Xb
              PL( ; ; ; ) =                    yijt ln ijt +            j k j;        (2)
                                n i;j;t ijt                    n k=1 k
                               |            {z              } |     {z     }
                                    1 PPML pseudo likelihood    Lasso penalty




                                              9
                                                                                           0
where n is the number of observations,4 as in (1) above, ijt = e it + jt + ij +xijt is the
conditional mean, and           0 and bk      0 are tuning parameters that determine the
penalty. As indicated in (2), the …rst term in this expression is the standard PPML
objective function one would minimize in order to estimate the three-way gravity model.
Thus, the PPML-lasso nests PPML as a special case when is set to zero.
    The second term in (2) is a modi…ed lasso penalty that allows for regressor-speci…c
penalty weights as opposed to having as the only tuning parameter as in the standard
lasso. Intuitively, larger penalties increasingly shrink the estimated -coe¢ cients towards
zero. The coe¢ cients for any variables that do not su¢ ciently increase the likelihood
are set to exactly zero, thereby giving us a way of identifying which xijt variables to
include in the …nal model. For some illustration, if we consider ! 1, the only way to
minimize PL is to set all bk s equal to zero, meaning that no variables are selected. As in
Belloni, Chernozhukov, Hansen, and Kozbur (2016), we will use the regressor-speci…c bk
penalty terms to iteratively re…ne the model while also constructing them appropriately
to re‡ ect any heteroskedasticity and within-pair correlation featured in the data.
    Importantly, the …xed e¤ects parameters ; , and are not penalized. This is
mainly because there is no reason to believe that most of the …xed e¤ects parameters are
actually zero. In addition, it turns out they do not pose special issues for computation.
This is because they do not depend on the penalty. As such, for any given , the …xed
e¤ects can be obtained by solving their usual PPML …rst-order conditions from the
standard unpenalized regression approach. In practice, this means that the …xed e¤ects
can actually be dealt with in the exact same manner as in Correia, Guimarães, and Zylkin
(2020). More details on our computational methods are provided in the Appendix, but,
basically, we use the original HDFE-IRLS algorithm of Correia, Guimarães, and Zylkin
(2020) to take care of the …xed e¤ects but replace the weighted linear regression step
from that algorithm with a weighted lasso regression.5

3.3     Implementing the Lasso
The next question of course is how to determine the tuning parameters and bk . As a
starting point, the two existing approaches we will …rst examine are the “plug-in”lasso of
Belloni, Chernozhukov, Hansen, and Kozbur (2016) and the traditional cross-validation
approach, both of which we have modi…ed to …t the demands of the three-way PPML
setting. As we will discuss, each of these methods has its strengths and weaknesses.
Therefore, we will then turn to describing an extension of the plug-in lasso, termed the
“iceberg lasso” , that is intended to address one of the plug-in lasso’s key shortcomings
in this context.
   4
     Naturally, the number of observations will depend on the number of countries for which we have
data and on the number of years we observe them. For simplicity, we do not make that relation explicit.
   5
     For the lasso regression step, we use the coordinate descent algorithm of Friedman, Hastie, and
Tibshirani (2010).




                                                  10
Plug-in Lasso
The plug-in lasso is so-named because it speci…es appropriate functional forms for the
penalty parameters based on statistical theory and then uses plug-in estimates for these
parameters. It is therefore a relatively “theory-driven”approach to the variable selection
problem, whereas cross-validation, discussed next, is a more traditional machine learning
method that relies on out-of-sample prediction. The plug-in lasso was …rst proposed by
Belloni, Chen, Chernozhukov, and Hansen (2012), though the speci…c implementation we
build on is the panel data lasso method of Belloni, Chernozhukov, Hansen, and Kozbur
(2016), which allows for correlated errors within cross-sectional units.
    Without delving too much into technical details, which we defer to the Appendix,
variable selection using the plug-in lasso can be thought of as involving the following
three ingredients:

   i. The absolute value of the score for each          k   when evaluated at 0,

   ii. The standard error of the score for each         k,

  iii. Values for and bk set high enough so that the absolute value of the score for k
       must be statistically large relative to its standard error in order for regressor xijt;k
       to be selected.

    Intuitively, the value of the score re‡ects the impact that a small change in k has
on the …t of the model. When evaluated at 0, it tells us how much the …t of the
model improves when we make k non-zero. The standard logic of the lasso is that this
improvement in …t must be large relative to the penalty in order for bk to be non-zero.
One of the main innovations of the plug-in lasso is to allow the regressor-speci…c penalty
b to adjust to re‡    ect the standard error of the score. This way, we counteract the
  k
possibility that regressors could be mistakenly selected due to estimation noise rather
than because of their true impact on the model. These regressor-speci…c penalties play
an important role in the presence of heteroskedasticity, which of course is an important
feature of trade data. Since the gravity context often assumes that errors are correlated
over time within pair, we take this correlation into account as well in constructing these
penalty weights.
    A principal advantage of the plug-in lasso is that it is very rigorous in terms of the
number of variables it selects. As shown by Drukker and Liu (2019), the plug-in method
o¤ers superior performance versus cross-validation approaches in …nite samples, in large
part because these other methods tend to select too many variables. Furthermore, the
“post-lasso” estimates obtained using unpenalized PPML on the covariates selected by
the plug-in lasso have a “near-oracle”property that ensures they will capture the correct
model if the sample is su¢ ciently large relatively to the number of potential regressors
(see Belloni, Chen, Chernozhukov, and Hansen, 2012).6
   6
    The “oracle” property of estimators such as the adaptive lasso of Zou (2006) refers to their ability
to correctly recover which parameters are zero and non-zero in a setting where the number of potential
regressors is …xed and the number of observations is large. The “near-oracle” property of the plug-in


                                                  11
    However, the plug-in method’   s rigor can also be a weakness. In general, it attempts
to select a small number of variables that are most useful for predicting the outcome.
However, in data settings where there are a substantial number of regressors that are
highly correlated, as is the case with our provisions data, it is possible that the plug-in
lasso will wrongly select a regressor that does not a¤ect the outcome but is strongly
correlated with another regressor that does, since either (or perhaps both) can have
similar predictive value for …tting the model. We discuss this issue in more detail when
we introduce the iceberg lasso method.

Cross-Validation
As an alternative to the plug-in method, we also consider a more traditional approach
based on cross-validation. Under cross-validation, one repeatedly holds out some of the
data and chooses in order to maximize the predictive …t of the model when evaluated
on the held-out data. The regressor-speci…c bk do not play a role and are set equal to 1.
    Because of the size of the data and the nature of our model, implementing this
approach presents some interesting challenges. A standard implementation would be a
“k -fold”approach that randomly partitions the sample into k folds and then uses k 1
subsets to estimate the parameters and the excluded one to evaluate the predictive ability
of the model. To adapt this idea to our setting, we validate our model by repeatedly
dropping random groups of agreements in our data, and then predicting their e¤ects on
trade out of sample, similar to the approach taken by Baier, Yotov, and Zylkin (2019).
In this case, all …xed e¤ects are always present in each practice sample, so that we can
always form the necessary predictions for the omitted trade ‡    ows associated with the
                              7
PTA that has been dropped.
    The main advantage of cross-validation is that it is explicitly designed to optimize
predictive performance. Thus, it may o¤er a conceptual advantage where forecasting
tasks are concerned. However, a known weakness of the standard lasso with cross-
validation is that it often errs on the side of selecting too many variables that are not
relevant.8 Furthermore, it does not take into account heteroskedasticity when performing
the selection, and it generally does not have either an oracle or near-oracle property
in large samples. For these reasons, cross-validation is not our preferred method for
lasso is similar, but its rate of convergence is slower and depends on the number of potential regressors
because in the setting considered by Belloni, Chen, Chernozhukov, and Hansen (2012) the number of
potential regressors is allowed to grow with the sample size.
    7
      It may, however, happen that some provisions are not included in the agreements used in the
estimation sample. This is less likely to happen if k is large and therefore we use k = 25:
    8
      In linear models, tuning      using cross-validation is analogous to selection based on the Akaike
information criterion, which ensures that the probability of selecting too few variables goes to zero but
does not eliminate the possibility of selecting too many. Relatedly, Drukker and Liu (2019) …nd that
selecting using cross-validation also leads to the inclusion of too many regressors in Poisson regressions.
In our own application, we too …nd that the cross-validation method selects many more provisions than
the plug-in method.




                                                    12
answering the question of which provisions matter for trade; we consider it mainly to
illustrate the basic mechanics of the lasso and as a check on our plug-in results.9

The Iceberg Lasso
One important feature of the lasso is that it selects variables that are good predictors
of the outcome, but these are not necessarily variables that have a causal impact on
the outcome. Indeed, Zhao and Yu (2006) show that only when the so-called “irrepre-
sentability condition” is valid can we expect the variables selected by lasso to have a
causal interpretation; the condition essentially imposes limits on the degree of collinear-
ity between the variables with a causal e¤ect on the outcome and the other candidate
regressors.
    As we have noted, in the case of our data set, there is a very high degree of collinearity
between some of the variables, and therefore we cannot expect the irrepresentability
condition to hold. Furthermore, for the plug-in lasso especially, which tends to select
a very parsimonious model, we should be worried whether the selected provisions mask
the e¤ects of a potentially more complex set of other provisions that are often included
in the same agreements as the provisions that are selected.
    To address this important complication, we introduce what we call the “iceberg
lasso” . Simply put, it involves performing a subsequent set of plug-in lasso regressions in
which each of the provisions selected by the plug-in PPML-lasso estimator is regressed
on all of the provisions that were excluded. The purpose of these regressions is to identify
bundles of provisions that are highly correlated with the selected ones and therefore may
be representable by them, in the sense of Zhao and Yu (2006). That is, each of the
variables selected by the PPML-lasso with the plug-in tuning parameter may be just
“the tip of the iceberg”of a bundle of variables that have a causal impact on trade, and
these additional lasso regressions may help to identify these bundles. As such, the iceberg
lasso may be interpreted as a data-driven alternative to the method used in Dhingra,
Freeman, and Mavroeidi (2018) to construct provision bundles.10
    Having described the ideas behind our methods, several further caveats are in order.
First, by construction, not all of the provisions selected by the iceberg lasso can be said
to have causal e¤ects. Whether or not this is more informative than other methods
that are already known to over-select regressors is an empirical matter and the answer
will depend on the application. Second, in general, we need to be very humble about
potential causal interpretation of our results. We view our approach as a statistical
   9
     Alternatively, we could consider the adaptive lasso (Zou, 2006), which adds a second tuning parame-
ter and is known to deliver consistent variable selection. However, we have still found that the adaptive
lasso is similar to the standard lasso in that it is much too lenient and it keeps too many regressors that
are not relevant.
  10
     Our approach complements the one adopted by Regmi and Baier (2020), who use machine learning
tools to construct groups of provisions and then use these clusters in a gravity equation. The main
di¤erence between the two approaches is that Regmi and Baier (2020) use what is called an unsupervised
machine learning method, which uses only information on the provisions to form the clusters. In contrast,
we select the provisions using a supervised method that also considers the impact of the provisions on
trade, and then add another step which can be interpreted as unsupervised learning.



                                                    13
method to select a group of variables that is likely to include the ones most relevant to
the …t of the three-way gravity model. This of course requires taking the model to be an
appropriate representation of the determinants of trade. The three-way gravity model
has the considerable advantage that it isolates a particular variation in the data that is
empirically relevant for the study of trade agreements, namely the within-pair variation
that is time-varying and independent of country-speci…c changes in trade. However, the
initial PPML-lasso with the tuning parameter selected by the plug-in method is likely to
omit relevant variables, and that obviously complicates interpretation of those estimates.
The additional step in the iceberg lasso is explicitly designed to address this latter issue
and should at least partially alleviate this problem at the cost of possibly selecting some
variables that e¤ectively have little or no impact on trade.

3.4     Simulation Evidence
In this section we report the results of a small simulation exercise investigating the …nite-
sample properties of the three methods we will use to identify the set of PTA provisions
that are likely to have more impact on trade ‡    ows. The simulation design we use covers
a range of scenarios that, to di¤erent degrees, combine two important features of our
application: a relatively small sample and a high degree of collinearity between several
potential explanatory variables. The results we obtain, therefore, provide information
on the performance of the di¤erent methods in conditions similar to those we face,
and illustrate how these performances change when we progressively move towards less
challenging environments.
    In all the experiments we use n observations on a set of p potential explanatory    p
variables; we consider cases with sample size n 2 f250; 1000; 4000g, and set p to 5 d ne,
where d e denotes the ceiling function; that is, depending on the value of n, p is either
80, 160, or 320. The p potential explanatory variables are obtained as random draws
from the normal distribution; the …rst variables are correlated with each other, and
the remaining ones are independent of all other variables. The covariance matrix of the
…rst regressors is given by U 0 U , where U is a         matrix where each entry is a draw
from the uniform distribution on the interval (u; 1). All regressors have zero mean and
variance 1 and we perform simulations with 2 f5; 10; 20g and u 2 f0:0; 0:3; 0:6g.11
    For all combinations of n, u and , the dependent variable is generated as

                                  y = exp (1 + x1 + z + ") ;

where x1 is the …rst of the p potential explanatory variables described above, and
are parameters, and z and " are independent random draws from the standard normal
distribution. The parameters and determine the relevance of x1 and the signal-to-
noise ratio: because gravity equations typically have an excellent …t, we set = 0:2 and
  = 0:3, which ensures that model has a reasonably high R2 and that the e¤ect of x1 is
neither too small (which makes its role very di¢ cult to detect) nor too large (in which
  11
    These values of u imply average correlations between the …rst   variables of around 0:75, 0:91, and
0:98, respectively.


                                                 14
case all approaches have an excellent performance). When performing the selection of
the relevant elements of the p potential explanatory variables, z is always included as
a regressor whose coe¢ cient is not penalized. Therefore, in this design, x1 plays the
role of the presumably small number of provisions that e¤ectively a¤ect trade and are
correlated with others that do not, and z mimics the role of the …xed e¤ects that explain
a signi…cant share of the variation of trade and are always included without penalty.
    The selection of the relevant explanatory variables is performed using each of the
three methods presented before: plug-in lasso, cross-validation lasso, and the proposed
iceberg lasso, which uses the plug-in penalty in both steps. Additionally, we also perform
the variable selection using the adaptive lasso of Zou (2006), with penalty chosen by cross
validation.12 Unlike the other methods we consider, the adaptive lasso has the so-called
oracle property, implying that asymptotically it will choose the right set of regressors,
and therefore it provides an interesting benchmark against which the performance of the
other methods can be judged.13
    We repeat the simulations 1; 000 times, recording the number of times the variable
x1 is correctly selected as a regressor, and the total number of variables selected by each
method. For each of the cases considered, Tables 3 and 4 present the percentage of
times the regressor x1 is selected and the average and standard error of the number of
regressors selected by each method.
    The results in Table 3 reveal a number of interesting patterns. For n = 250, lasso
with the penalty chosen by the plug-in method (PI) is the method that most often
fails to identify x1 as a relevant regressor, and its performance deteriorates quickly as
u increases. The adaptive-lasso (AL) performs better, but its performance is also very
poor when u is high. Lasso with the penalty chosen by cross-validation (CV) provides
a substantial improvement, but it also struggles for larger values of u. The iceberg
lasso (IL) is marginally outperformed by CV when u = 0:0, but in the more challenging
cases it can have a substantial advantage over all other methods.14 The performance
of all methods improves for the larger sample sizes, but the iceberg lasso maintains its
advantage in the more challenging cases.
    The results in Table 4 are equally interesting. In all cases considered, CV tends to
lead to a high average number of selected regressors; this method also generally leads to
high variability in the number of selected regressors. Remarkably, the average number of
regressors picked by CV increases with n, and therefore with p, but is almost insensitive
to . The average number of regressors selected by PI is always very small, and we do
not see a clear pattern as n and vary. In contrast, the average number of variables
selected by AL drops with the sample size and for n = 4; 000 it is always very close to
1, as we would expect from its oracle property. Finally, not surprisingly, the average
number of variables selected by the IL increases with , and this is the feature that
allows it to more frequently identify x1 as a relevant regressor.
  12
     The adaptive lasso requires a set of initial estimates; we used those obtained by the cross-validation
lasso.
  13
     Note, however, that the plug-in lasso has a related near-oracle property.
  14
     Part of the reason why in some cases IL does not perform well is that sometimes PI selects no
regressors at all, and in those cases IL cannot improve on it.


                                                    15
                       Table   3: Percentage of times x1     is selected
                     u = 0:0                    u = 0:3                          u = 0:6
 n        =5           = 10      = 20     =5       = 10       = 20        =5       = 10      = 20
250  CV 95:49         96:89     97:70 82:16       82:87       79:80      56:41    47:80      37:70
     AL  93:69        95:09     95:60 76:35       75:15       71:10      47:39    37:78      28:60
     PI  85:37        82:76     81:90 67:23       60:62       54:50      41:38    32:06      21:80
     IL  94:09        93:99     92:00 90:18       86:77       83:10      79:66    71:64      60:00
1000 CV 99:60        100:00    100:00 97:20       98:10       99:00      82:20    78:90      72:10
     AL  98:90        99:90    100:00 93:20       95:20       96:40      71:50    66:40      57:80
     PI  98:50        98:60     99:00 92:60       94:30       93:10      73:30    67:50      58:40
     IL 100:00       100:00     99:70 99:60       99:30       98:90      96:30    93:10      89:60
4000 CV 100:00       100:00    100:00 99:80 100:00           100:00      96:20    98:30      98:60
     AL  99:80       100:00    100:00 98:50       99:50      100:00      86:60    90:70      91:80
     PI  99:80       100:00    100:00 99:10       99:90      100:00      94:50    95:20      94:50
     IL 100:00       100:00    100:00 100:00 100:00          100:00      99:80   100:00      99:30

      Table 4: Average and standard error of the number of selected regressors
                  u = 0:0                  u = 0:3                   u = 0:6
 n           =5     = 10    = 20     =5      = 10    = 20      =5      = 10    = 20
250   CV    8:51    9:08     8:76   8:46      9:14   8:64     8:40     8:79    8:16
            (7:69)    (7:74)    (7:62)    (7:43)    (7:60)    (7:05)    (7:42)    (7:54)    (7:07)
      AL    7:32      7:59      7:38      7:32      7:65      7:22      7:00      7:20      6:71
            (7:10)    (6:94)    (6:68)    (7:01)    (7:00)    (6:55)    (6:84)    (6:80)    (6:45)
      PI    1:33      1:59      1:85      1:40      1:68      1:98      1:26      1:43      1:52
            (0:66)    (0:84)    (1:08)    (0:72)    (0:86)    (1:10)    (0:60)    (0:73)    (0:80)
      IL    5:12      5:98      9:95      5:73      6:20     10:67      5:09      5:72      9:16
            (5:32)    (2:29)    (4:19)    (7:18)    (2:31)   (4:22)     (8:02)    (2:19)    (3:70)
1000 CV     9:63      9:90     10:11      9:94     10:34     10:94     10:02     10:32     10:66
            (9:31)    (9:25)   (10:16)    (9:39)   (9:34)    (10:32)   (9:36)    (9:26)    (9:52)
      AL    4:35      4:91      4:86      5:02      5:89      6:37      5:20      6:40      6:70
            (8:13)    (9:32)    (8:86)    (8:60)    (9:63)    (9:70)    (8:63)    (9:98)    (9:61)
      PI    1:41      1:62      1:99      1:69      2:08      2:68      1:69      2:08      2:54
            (0:61)    (0:81)    (1:16)    (0:71)    (0:99)    (1:37)    (0:67)    (0:88)    (1:13)
      IL    5:37      7:16     11:86      5:98      7:22     12:99     6:55       7:23     12:94
            (5:27)    (1:98)    (4:11)    (7:20)    (2:04)   (4:03)    (11:25)    (1:91)   (3:66)
4000 CV    10:48     10:34     10:77     10:91     10:71     11:52     11:11     11:06     12:09
           (11:52)   (11:13)   (11:15)   (11:57)   (11:06)   (11:59)   (11:54)   (11:12)   (11:57)
      AL    1:00      1:04      1:00      1:03      1:14      1:15      1:16      1:31      2:04
            (0:04)    (1:10)    (0:00)    (0:50)    (1:98)    (2:27)    (1:84)    (2:76)    (6:01)
      PI    1:35      1:52      1:79      1:68      2:05      2:49      1:93      2:40      3:02
            (0:57)    (0:74)    (1:03)    (0:74)    (1:01)    (1:30)    (0:80)    (1:10)    (1:30)
      IL    5:19      8:36     15:39     6:00       8:04     14:16     6:51       7:55     14:13
            (5:65)    (1:70)    (3:44)   (11:38)    (1:89)   (3:88)    (10:51)    (1:96)   (3:54)




                                            16
    In summary, for very large samples, the adaptive lasso with penalty parameter se-
lected by cross validation is the preferred method; this is justi…ed both by our simulation
results and by its oracle property. However, for small to medium samples, and especially
with high correlation between potential explanatory variables, the adaptive lasso is out-
performed by other methods. In these cases, the choice of method depends on whether
we favor selecting the relevant regressors or having a parsimonious model. If parsimony
is paramount, the lasso with penalty parameter selected by the plug-in method is di¢ -
cult to beat. However, if selecting the relevant regressor is important, the iceberg lasso
is a safe bet and is the best method. This is particularly the case if the relevant variable
is highly correlated with other potential controls because in that case the iceberg lasso
outperforms the adaptive lasso even for the larger samples considered in our experiments.
    These results, which con…rm and extend the …ndings of Drukker and Liu (2019), have
important implications for our work. Given that in our application we only have data on
283 trade agreements,15 we cannot expect any of the methods considered to be able to
precisely identify the set of provisions that matter for trade. The task of identifying the
correct set of explanatory variables is particularly challenging in our application because
many of the provisions have very strong correlations with others, and there are even
cases of perfect collinearity. In this challenging context, the iceberg lasso emerges as
providing a good compromise between parsimony and the ability to identify the relevant
variables. It consequently is our preferred approach.


4      Lasso Results
In this section, we present our lasso results obtained using the methods described in the
previous section. We …rst present results for the plug-in method before brie‡y discussing
the results obtained using cross-validation. We then turn to the iceberg lasso results,
which themselves are based on provisions selected by the plug-in method.

4.1     Plug-in Lasso Results
Table 5 presents results for the plug-in lasso and post-lasso regressions discussed before.
In column 1, we start by presenting the results of a traditional PPML estimation with a
dummy for the presence of a preferential trade agreement between the trading partners.
This shows that we can replicate the usual …nding that PTAs lead to a signi…cant increase
in trade ‡ ows in our data. Speci…cally, we …nd that the PTAs in our data increase
trade by exp (0:130) 1 = 13:8%: Column (2) then shows the results of our …rst-step
lasso regression, showing only the coe¢ cients that the lasso …nds to be non-zero. In a
subsequent step, we then estimate a “post-lasso” PPML regression— a standard PPML
regression using only the provisions that were selected by the lasso in the …rst step.
  15
    Note that the information on the e¤ect of the di¤erent provisions is limited by the relatively small
number of PTAs that are observed. Therefore, despite having a large number of observations, we
e¤ectively only have a small sample to identify the e¤ect of the di¤erent provisions.




                                                  17
                          Table 5: PPML, PPML-lasso, and post-lasso PPML results for plug-in approach
                                                          Dependent variable: Bilateral Trade Flows (1964-2016, every 4 years)
                                                                              PPML                                    PPML
                                                              PPML Lasso Post-lasso PPML PPML Lasso Post-lasso
                                                                (1)   (2)       (3)         (4)      (5)     (6)        (7)
     PTA                                                     0:130                         0:030 0:083
                                                             (0:038)                     (0:054) (0:038)
     EU                                                                                           0:688     0:416    0:589
                                                                                                  (0:065)            (0:084)
     AD14. Anti-dumping –  Material Injury                           0:172   0:313       0:303              0:188    0:343
                                                                              (0:114)    (0:116)                     (0:105)
     CP23. Competition Policy –  Transparency / Coordination         0:031     0:075      0:078             0:011     0:046
                                                                              (0:056)    (0:056)                     (0:054)
     SUB12. Subsidies – Discipline (general)                         0:008    0:099      0:108
                                                                              (0:052)    (0:055)
     TBT provisions:
     TBT2. Mutual Recognition                                        0:084     0:073      0:068




18
                                                                              (0:093)    (0:094)
     TBT7. Technical Reg’ s: use International Standards             0:034     0:111      0:121             0:055     0:106
                                                                              (0:080)    (0:082)                     (0:077)
     TBT33. Standards: use Regional Standards                        0:067     0:046      0:050             0:039     0:039
                                                                              (0:066)    (0:066)                     (0:051)
     Trade Facilitation provisions:
     TF41. Harmonization and Common Legal Framework                                                         0:038    0:550
                                                                                                                     (0:126)
     TF42. Customs and Other Duties Collection                       0:227   0:354       0:352
                                                                              (0:121)    (0:121)
     TF45. Issuance of Proof of Origin                               0:022   0:076       0:096              0:016    0:079
                                                                              (0:029)    (0:043)                     (0:028)
     Notes: Gravity estimates are obtained using PPML with exporter-time, importer-time, and exporter-importer FEs. The number of
     observations is 194,092. Columns labelled “PPML Post-lasso”report PPML coe¢ cients for all variables selected by a plugin lasso method
     in a prior step. The di¤erence between columns 2-3 and 6-7 is that the latter includes the EU dummy in the lasso step as a possible
     predictor to be selected. All other columns report further experiments using PPML. PPML cluster-robust standard errors are reported in
     parentheses. * p < 0:10 , ** p < :05 , *** p < :01.
    Using the plug-in approach, the lasso selects a small number of trade agreement
provisions related to anti-dumping, competition policy, domestic subsidies, technical
barriers to trade (TBT), and trade facilitation. Broadly speaking, these variables all
can be rationalized as having intuitive e¤ects on trade. The selected anti-dumping,
competition policy, and subsidy provisions all create more certainty as to how disciplinary
investigations and proceedings will be carried out in these various policy areas. This
increased certainty may increase entry by foreign exporting …rms. The inclusion of
provisions related to technical barriers to trade and trade facilitation is likewise intuitive,
but the selection of TF45, which facilitates obtaining certi…cates of origin, seems of
particular note in that it highlights the costs of complying with rules of origin.
    The corresponding post-lasso PPML results, shown in column (3), …nds that some of
the selected provisions have large e¤ects when estimated in the conventional way. For ex-
ample, the inclusion of anti-dumping provision AD14, which requires that anti-dumping
proceedings establish “material injury”to domestic producers, is associated with an in-
crease in trade ‡  ows of about 36:8% (exp (0:313) 1 = 0:368). Even larger e¤ects are
found for having trade facilitation provisions that regulate customs and other duties
collection (TF42), which has an estimated e¤ect of 42:5% (exp (0:354) 1 = 0:425).
Interestingly, not all of the provisions selected by the lasso step are found to be statisti-
cally signi…cant in the post-lasso step. This apparent contradiction arises for two reasons.
First, the lasso focuses on the implications for model …t when a variable is not included,
which is not the same as testing whether its coe¢ cient is statistically di¤erent from zero.
Second, because the lasso shrinks all coe¢ cients towards zero simultaneously, it reduces
the in‡ uence of the collinearity between them and can allow individual provisions that
are not signi…cant in the conventional regressions to speak more loudly.
    In column (4), we re-estimate the model using the same covariates as column (3) but
now re-adding our original PTA dummy from column 1. In this case, the coe¢ cient on
PTA captures any e¤ect on trade ‡     ows that is not already captured by the 8 provision
variables that were selected by the lasso. With this in mind, we take the insigni…cant
and near-zero coe¢ cient on PTA in column (4) as an encouraging indication that the
selected provisions completely explain the average PTA e¤ect estimated in column (1).
    Next, column (5) returns to our original simple model from column (1) but adds a
second dummy variable for the EU agreement. Our reasons for treating the EU sepa-
rately from other agreements are three-fold. First, we suspect that not all of the EU’        s
e¤orts to promote trade are captured in how their provisions variables are coded in our
data. There could also be unobserved e¤ects that are channeled through the EU’           s sec-
ondary law process, in which the EU’    s governing institutions are empowered to pass new
regulations and directives on an ongoing basis. Second, our provisions data does not
include agreements that are no longer in e¤ect. For the most part, the agreements that
cannot be included are EU pre-accession agreements, which obviously are subsumed by
the EU agreement once each new member joins the EU. As discussed in Section 2, we
deal with this data issue in practice by dropping all observations associated with obsolete
agreements. Nonetheless, this could lead to biased estimates of the EU agreement and
the provisions associated with it. Third, the EU has in place six of the eight provisions



                                              19
selected in column 2 (all except AD14 and TBT7); thus, we want to make sure we are
not simply picking up an “EU e¤ect”in the provisions that are selected.
    As the PPML results in column (5) show, the estimated EU e¤ect is large, several
times that of non-EU PTAs in fact. However, the more important exercise is in column
(6), where we now treat the EU as a possible predictor in the lasso. Because the EU
is indeed selected as being an important predictor of changes in trade ‡    ows, the value
of this exercise is that the selection of other predictors is solely based on information
from other agreements aside from the EU. Consequently, the set of provision variables
selected by the lasso is now slightly di¤erent than in column (2), adding TF41 (which
calls for harmonization of customs procedures) but losing TBT2, SUB12, and TF42.
Notably, the post-lasso estimates in column (7) …nd TF41 to be highly signi…cant both
statistically and economically, with an estimated e¤ect of exp (0:550) 1 = 73:3%. Given
the possible issues with the EU we have outlined, this last set of provision variables is
our preferred set to work with in the subsequent iceberg lasso analysis.

4.2    Cross-Validation Lasso Results
As discussed previously, the plug-in approach to choosing is conservative, in the sense
that it tends to choose a relatively small set of regressors and may fail to pick the
“correct” regressors. For comparison, we now discuss the choice of regressors when we
use the cross-validation approach. Figure 2 shows how the out-of-sample mean   Psquare
error (MSE) varies with the log of the tuning parameter, which is scaled by ijt yijt
so that the results do not depend on the scale of the data. The out-of-sample MSE
       P decreases as is increased and then increases again, with a minimum reached
initially
at = ijt yijt = 0:00025.
                      Figure 2: Cross-validation MSE vs. tuning parameter
                           210
                           200
                  Cross-validation MSE
               170     180 160   190




                                         -2   -4                   -6                 -8   -10
                                                   Log of the scaled tuning parameter



    For more illustration, Figures 3 and 4 show the corresponding regularization paths
for selected provisions. That is, the …gures show how the value of the estimated (post-

                                                              20
lasso) coe¢ cient on the selected provisions changes as we vary . P   As expected, fewer
provisions are selected as we increase . At the optimal value of = ijt yijt = 0:00025,
our cross-validation approach selects 124 provisions to have non-zero e¤ects, which is
many more than what we found using our plug-in approach.16
    Note, however, that it is not necessarily the case that the set of provisions selected
at lower levels of includes the set of provisions selected at higher levels. For example,
Figure 3 shows that provision AD14, which was one of the provisions selected by our
plug-in approach, is only selected for higher values for . Intuitively, as we lower ,
more provisions are selected and some of these are correlated with provision AD14. This
then implies that adding AD14 itself does not lead to signi…cant improvements in out-
of-sample forecasts during cross-validation and hence it is no longer selected. It is only
when the provisions correlated with AD14 are purged from the model as increases that
AD14 on its own gains predictive power and is included. That said, for higher values of
  , we generally see a close correspondence between the results along the regularization
path indicated in Figures 3 and 4 and those that we found earlier using the plug-in
method.
    Overall, Figures 3 and 4 show that our two approaches to selecting lead to very
di¤erent sets of trade agreement provisions being selected. While some provision, such as
CP23 or SUB12 are selected by both approaches, others, such as AD14, are only selected
by the plug-in method, and many provisions are only selected using cross-validation, such
as anti-dumping provision AD05. Furthermore, we also see in Figures 3 and 4 that many
of the estimated e¤ects for the provisions that are selected are too large in absolute
magnitude to be plausible when interpreted on their own. These observations re‡    ect the
known shortcomings of the cross-validation approach that we stated earlier and found
support for in our simulations.

4.3     Iceberg Lasso Results
As previously mentioned, we cannot be certain whether the variables selected by the
lasso have a causal e¤ect on trade, or are simply highly correlated with the variables
that have a causal e¤ect. In this section, we investigate this issue further by carrying
out the iceberg lasso analysis we proposed earlier. That is, for each of the provisions
from our preferred set of estimates (those from the last column of Table 5), we run an
additional plug-in lasso regression where we regress each selected provision on all of the
provisions excluded by our …rst-stage lasso. As discussed, the purpose of these auxiliary
regressions is to construct bundles of provisions that, at least when combined together,
are likely to have a causal impact on trade ‡    ows when included in trade agreements.
As we have noted, the reader should be cautioned that we will not be able to say with
high certainty whether a given provision is important for promoting trade but, as we
will see, this method gives us signi…cantly increased parsimony versus instead relying on
cross-validation. Furthermore, as we have seen from our simulations, it should also give
us more con…dence in the results.
  16
     In each panel of the …gure, the second-to-last set of estimates corresponds to the 124 variables
selected by the cross-validation method.


                                                 21
22
     Figure 3: Regularization path for selected provisions (AD, ET, CM, STE, SUB, ENV, LM, and MIG)
23
     Figure 4: Regularization path for selected provisions (IPR, TBT, SPS, SER, ROR, TF, INV,MOC, and PP)
     Table 6 presents the results of our iceberg lasso analysis. The …rst two rows of Table 6
list each of the six provisions selected by the …rst-stage plug-in lasso when the EU dummy
is included, as well as their estimated impact on trade ‡    ows from column (6) of Table 5.
The subsequent rows of Table 6 report all provisions that were not selected by the lasso
in the …rst step but are identi…ed in the second step of the iceberg lasso; we also report
the correlation of each of these provisions with the selected provision in the …rst row.
Finally, the last row reports the R2 of the regression of each selected provision on the
corresponding correlated provisions. For example, column (1) shows that antidumping
provision AD14 is highly correlated with two further antidumping provisions (AD06 and
AD08) as well as with one provision on environmental protection (ENV42);17 the R2 of
the regression of AD14 on these three provisions is 0:95.

                                    Table 6: Iceberg lasso results
        (1)             (2)              (3)            (4)           (5)                  (6)
       AD14            CP23            TBT07          TBT33          TF41                 TF45
      (+41%)          (+4.7%)         (+11.2%)        (+4%)        (+73.3%)              (+8.2%)
  AD06 (0.97)        AD06 (0.46)     AD06 (0.54)       AD06 (0.48)     AD05 (0.89)     AD11 (0.09)
  AD08 (0.97)        AD08 (0.46)     AD08 (0.54)       AD08 (0.48)                      CP15 (0.73)
  ENV42 (0.97)       CP22 (0.78)     ENV42 (0.54)     AD12 (-0.11)                      ET03 (0.51)
                     CP24 (0.89)     ENV44 (0.06)     ENV42 (0.48)                     SUB10 (0.25)
                    ENV42 (0.46)     SPS21 (0.23)    ENV44 (-0.01)                     SUB11 (0.28)
                     ET41 (0.16)     SUB07 (0.08)     INV24 (0.11)                      TF44 (0.98)
                    IPR42 (-0.00)    TBT15 (0.73)     IPR71 (-0.08)
                    IPR55 (-0.01)    TBT34 (0.94)    IPR103 (-0.11)
                    IPR63 (-0.00)                    IPR107 (-0.12)
                    IPR74 (-0.01)                    MOC26 (-0.10)
                     PP08 (0.08)                      SPS21 (0.19)
                    SPS21 (0.17)                     SUB04 (-0.11)
                    STE31 (0.57)                      SUB07 (0.07)
                    TBT02 (0.56)                      TBT05 (0.61)
                    TBT15 (0.37)                      TBT06 (0.98)
                    TBT29 (0.56)                      TBT15 (0.69)
                     TF42 (0.56)                      TBT32 (0.61)
                     TF44 (0.38)                      TBT34 (0.53)
         0.95           0.83             0.89             0.97             0.80             0.96
  Notes: Table shows PTA provisions associated with increases in bilateral trade ‡       ows (row 1),
  together with the estimated increase in trade ‡ows (row 2), as well as other provisions that predict
  the provision in row 1 (rows 3-20; numbers in brackets are raw correlations with the provision
  from line 1). The last row displays the R2 of the regression of each selected provision on the
  corresponding correlated provisions.

    The results in Table 6 show that the iceberg-lasso identi…es 43 provisions that are
likely to be associated with increased trade. This …nding contrasts with the 124 provi-
sions identi…ed by the cross-validation lasso, and the 6 provisions selected by the plug-in
lasso. Therefore, as in the simulations in the preceding section, the iceberg lasso appears
 17
      In our data, ENV42 is perfectly colinear with AD06 and AD08.


                                                    24
to provide a good compromise between the cross-validation lasso, which selects so many
provisions that makes it di¢ cult to interpret its results, and the plug-in lasso, which is
likely to miss important provisions.
    As noted above, we …nd that provision AD14 is correlated other antidumping pro-
visions; this correlation is not surprising because all these provisions ful…ll a similar
purpose, which is to increase transparency in the use of antidumping duties. In that
sense, one conclusion to be drawn from this exercise is that antidumping provisions are
likely to increase trade ‡  ows, although we cannot say which of them has the biggest
e¤ect. Table 6 shows that, more surprisingly, AD14 is also strongly correlated with
ENV42. This correlation seems to be due to what might be called a template e¤ect,
that is, the tendency of important trading blocs such as the EU and the US to use
similar provisions in all their agreements. For example, most agreements signed by the
EU include provisions on antidumping and the environment, hence leading to a high
correlation between the corresponding provisions in our data.
    Template e¤ects may also be important for understanding the variables highly corre-
lated with the selected TBT provisions, TBT07 and TBT33. Indeed, some of the same
anti-dumping and environmental provisions that were found to be correlated with AD14
show up here as well (AD6, AD8, ENV42). That said, the strongest correlations in these
cases are with other TBT provision such as TBT06, TBT15 and TBT34. This is not
surprising as these provisions also relate to the use of international standards. Thus, it
seems likely that provisions encouraging the use of international standards in the area
of technical barriers to trade are likely to be behind the trade increases associated with
provisions TBT07 and TBT33, although we cannot say which of the individual TBT
provisions is driving the observed e¤ect.
    The lasso also selects two provisions that reduce the administrative burden resulting
from compliance with rules of origin and other customs procedures (TF41 and TF45),
which are estimated to have a very large trade increasing e¤ect (over 70% for TF41).
Table 6 also indicates that other trade facilitation provisions are correlated with some
of the provisions selected by the lasso; this is true both for TF45 and CP23. Thus,
our results suggest that trade facilitation procedures are likely to be associated with
signi…cant trade ‡  ow increases.
    Finally, we …nd that provision CP23, which serves to promote transparency in com-
petition policy, is correlated with some of the previously identi…ed types of provisions,
as well as with two further provisions on competition policy (CP22 and CP24). Thus, it
seems likely that the presence of provisions on competition policy is behind the observed
trade increasing e¤ect of CP23, although we are again unable to say which provision
exactly is driving this e¤ect.
    The iceberg lasso also identi…es provisions from other areas that help predict the
provisions identi…ed in the …rst step. For example, provisions in policy areas such as
intellectual property rights and sanitary and phytosanitary measures are related both
to CP23 and TBT33, but these types of provisions are associated with smaller raw
correlations. By the logic of the lasso, it is likely that these provisions are informative for
predicting the presence of CP23 and TBT33 in a relatively small number of agreements
where other provisions with higher raw correlations are not found.

                                              25
    In summary, although it is not possible to identify with certainty which provisions
are most important for increasing trade, our results allow us to …nd a relatively small
bundle of provisions that are likely to have the desired e¤ect. In particular, provisions
related to TBTs, antidumping, trade facilitation, and competition policy are likely to
enhance the trade-increasing e¤ect of trade agreements.


5    Conclusions
In this paper, we have proposed new methods for assessing the impact of individual trade
agreement provisions on trade ‡ ows. While other work in this area has relied on summary
measures of agreement depth or on speci…c provision bundles of interest, our approach
is instead to study the rich provision content of PTAs as a variable selection problem.
By combining the three-way PPML estimator that is popular in the study of PTAs with
lasso methods for variable selection, we are able to identify which of the many provisions
in our data set should be treated as relevant for a¤ecting trade ‡ows. Using our preferred
method, a two-step “iceberg lasso” approach, we identify a relatively parsimonious set
of 43 provisions that are most likely to impact trade. While these 43 provisions span a
range of policy areas, our results generally support the conclusion that a select number
of provisions related to anti-dumping, competition policy, technical barriers to trade,
and trade facilitation are most e¤ective at promoting trade as compared to other types
of provisions that appear in PTAs.
    We need to be clear that interpreting these results requires some important caveats.
We know that it is possible that our preferred method may fail to discover important
trade-promoting provisions, and it is almost certain to lead to the inclusion of provisions
that are not relevant. At present, we are not able to quantify either type of uncer-
tainty. Developing metrics that can be used to guide researcher con…dence represents an
important avenue for future research.
    In terms of broader applications, our methods are not limited to just PTAs or even
just to trade. There are many other contexts in which the iceberg lasso method we
have introduced could be a helpful tool for any researcher wishing to determine which
of a large number of variables are worth focusing on as most relevant for the outcome.
Furthermore, by integrating the lasso into a nonlinear model with high-dimensional …xed
e¤ects, we show how variable selection and other related machine learning approaches
can be utilized in much more generalized settings than what had been possible previously.




                                            26
                               Table A1: Provisions selected by the iceberg lasso
     Anti-dumping
     AD05     Export price less than comparable price when destined for consumption in the exporting country
     AD06     If there are no sales in the normal course of trade in the domestic market of the exporting country
     AD08     Cost of production in the country of origin plus a reasonable amount
                                                                                                                                            Appendix




     AD11     Price e¤ects of dumped imports
                                                                                                                          Provisions list




     AD12     The consequent impact of dumped imports on the domestic industry
     AD14     Requirement to establish material injury to domestic producers
     Competition Policy
     CP15     Does the agreement prohibit/regulate cartels/concerted practices?
     CP22     Does the agreement contain provisions that promote predictability?
     CP23     Does the agreement contain provisions that promote transparency?
     CP24     Does the agreement contain provisions that promote the right of defense?
     Environmental Laws
     ENV42 Does the agreement require states to comply with the UN Conference on Environment and Devel-
              opment?




27
     ENV44 Does the agreement require states to comply with the International Energy Program?
     Export Taxes
     ET03     Prohibits new export quotas/quantitative restrictions between the parties
     ET41     Prohibits non-tari¤ measures related to export of goods
     Investment
     INV24    Does the FET clause prohibit arbitrary, unreasonable or discriminatory measures?
     Intellectual Property Rights
     IPR42    Prohibits requiring the recording of a trade mark license to establish license validity or as a condition
              for use
     IPR55    Requires patent be made available for new processes of a known product
     IPR63    Requires a period of sui generis protection for patents
     IPR71    Requires system for protection of industrial designs
     IPR74    Seek to improve industrial design systems
     IPR103 Stipulates practices to be followed by collective management organizations
     IPR107 Patent Law Treaty (2000)
                           Table A1 (cont’   d): Provisions selected by the iceberg lasso
     Movement of Capital
     MOC26 Does the transfer provision explicitly exclude “good faith” and non-discriminatory application of
              its laws related to prevention of deceptive and fraudulent practices?
     Public Procurement
     PP08     Does the agreement contain explicit provisions on MFN treatment of third parties?
     Sanitary and Phytosanitary Measures
     SPS21    B. Risk Assessment: Is there reference to international standards/procedures?
     State-Owned Enterprises
     STE31    Does the agreement prohibit anti-competitive behavior of state enterprises?
     Subsidies
     SUB04 Does the agreement prohibit or regulate local-content subsidies?
     SUB07 Does the agreement introduce any ceiling to permitted subsidies?
     SUB10 Does the agreement include any speci…c regulation of …sheries subsidies?
     SUB11 Does the agreement include any speci…c discipline for public services?
     SUB12 Does the agreement include any other speci…c discipline for certain sectors or objectives?




28
     Technical Barriers to Trade
     TBT02 B. Technical Regulations - Is mutual recognition in force?
     TBT05 B. Technical Regulations - Are there speci…ed existing standards to which countries shall harmonize?
     TBT06 B. Technical Regulations - Is the use or creation of regional standards promoted?
     TBT07 B. Technical Regulations - Is the use of international standards promoted?
     TBT15 C. Conformity Assessment - Is the use of international standards promoted?
     TBT29 A. Standards - Is mutual recognition in force?
     TBT32 A. Standards - Are there speci…ed existing standards to which countries shall harmonize?
     TBT33 A. Standards - Is the use or creation of regional standards promoted?
     TBT34 A. Standards - Is the use of international standards promoted?
     Trade Facilitation and Customs
     TF41     Does the agreement require customs harmonization and a common legal framework?
     TF42     Does the agreement regulate customs and other duties collection?
     TF44     Do trade facilitation provisions simplify requirements for proof of origin?
     TF45     Does trade facilitation provisions simplify procedures to issue proof of origin?
More Details on HDFE-PPML-Lasso Estimation
The minimization problem that de…nes the three-way PPML-lasso is
                                           1X
             ( b ; b; b; b) := arg min             exp(x0ijt +       it   +   jt   +       ij )
                                   ; ; ;   n i;j;t
                                                                                       p
                             1X                                                   1 Xb
                                     yijt x0ijt +      it +   jt +    ij      +           j            k j:    (3)
                             n i;j;t                                              n k=1 k

The …rst-order conditions (FOCs) for this problem are
                1X
          b it :        yijt bijt = 0;                                                             8i; t;
                n j
                1X
          bjt :         yijt bijt = 0;                                                             8j; t;
                n i
                1X
          bij :         yijt bijt = 0;                                                             8i; j;
                n t
                  X
          bk : 1                           1
                         yijt bijt xijt;k + bk sign( ^ k ) = 0;                                   k = 1:::p;
                n i;j;t                    n

where bijt denotes ijt evaluated at b , b, b, b. Notice that the penalty only a¤ects the
FOCs for the main covariates of interest. The FOCs for the …xed e¤ects are exactly the
same as they would be in unpenalized PPML. That said, further simpli…cation is still
needed because it is generally not possible to estimate all of the parameters directly,
with or without the penalty. Instead, we …rst need to “concentrate out”the …xed e¤ect
parameters. That is, instead of minimizing (3) over all of the parameters, we treat
b it (b), bit (b), and bit (b) as functions of b that are implicitly de…ned by their FOCs.
The resulting “concentrated”minimization problem is
                           X
            b := arg min 1       exp x0ijt + b it ( ) + bjt ( ) + bij ( )
                         n i;j;t
                                                                                           p
                   1X                                                  1 Xb
                           yijt x0ijt + b it ( ) + bjt ( ) + bij ( ) +         j                        k j;   (4)
                   n i;j;t                                             n k=1 k

such that    is now the only argument we need to solve for. The FOC for each bk
associated with this modi…ed problem is:
      X
b : 1       yijt     exp x0ijt b + b it b + bjt b + bij b
                                                                                          1
                                                                                  eijt;k + bk sign(bk ) = 0;
                                                                                  x
 k
    n i;j;t                                                                               n

where
                                                 d b it;k dbit;k dbij;k
                            eijt;k := xijt;k +
                            x                            +      +                                              (5)
                                                   d       d      d

                                                  29
captures both the direct and indirect e¤ects of a change in on the conditional mean of
yijt .
     To explain how we deal with the …xed e¤ects, assume for the moment that we know
the true values of ijt := exijt + it + jt + ij that we will eventually estimate. If that is the
case, then the penalized PPML solution ( ; ; , ) is also the solution to the following
weighted least squares problem
                                                                                                            p
                  1 X                                                                          2     1 Xb
             min            ijt   zijt         it          jt           ij       x0ijt             +         j   kj ;
                 2n i;j;t                                                                            n k=1 k

where
                                                    yijt          ijt
                                     zijt =                              + log           ijt
                                                           ijt

is the transformed dependent variable that is used to motivate estimation via iteratively
re-weighted least squares (IRLS). The convenient thing about this representation of the
problem is that we can rewrite it as
                                                                                     p
                              1X                                             2
                                                                                     X
                                                                                                   bk j
                        min              ijt        eijt
                                                    z           e0ijt
                                                                x                +                        kj ;          (6)
                              2 i;j;t                                                k=1

where z            eijt are respectively de…ned as the “partialed-out” versions of xijt and
          eijt and x
zijt , which are obtained by within-transforming xijt and zijt with respect to it, jt; and
ij and weighting by ijt . The within-transformation steps involved in computing z         eijt
and x  eijt are the same as in Correia, Guimarães, and Zylkin (2020) and can be computed
quickly using the methods of Gaure (2013). Furthermore, one can show that the x           eijt
that appears in (6) is consistent with the de…nition given for x  eijt;k in (5).
      The nice thing about expressing the problem as in (6) is that it now resembles a simple
penalized regression problem. It can thus be quickly solved using the coordinate descent
algorithm of Friedman, Hastie, and Tibshirani (2010). Furthermore, though we do not
know the correct estimation weights (the ijt s) beforehand, we can follow the approach
of Correia, Guimarães, and Zylkin (2020) by repeatedly updating them until convergence
after each new estimate of , as in IRLS estimation. Altogether, our algorithm closely
follows Correia, Guimarães, and Zylkin (2020) and otherwise only involves swapping out
their weighted least squares step for a penalized weighted least squares step, as shown
in (6). In principle, this algorithm can be easily modi…ed to other settings that feature
multi-way …xed e¤ects in order to simplify estimation.

More Details on Plug-in Lasso
Rather than relying on out-of-sample performance, the Belloni, Chernozhukov, Hansen,
and Kozbur (2016) “plug-in”lasso method chooses the penalty parameters and bk using
statistical arguments. Their speci…c framework is a simple linear panel data model, but
their reasoning involves modifying the standard lasso penalty to re‡ ect the variance of
the score. These concepts are quite general; thus, we can modify their approach to take
into account the more complex case of a nonlinear model with multiple …xed e¤ects.

                                                                30
   The key condition in choosing these penalty parameters is that they should satisfy
the following inequality for all k :

                b
                    k        1X
                        c            (yijt      exp(x0ijt +           it   +        jt   +         eijt;k
                                                                                              ij ))x        8k;   (7)
                n            n i;j;t

for some c > 1. Intuitively,

                            1X
                                    (yijt     exp(x0ijt +        it   +        jt   +             eijt;k
                                                                                             ij ))x
                            n i;j;t

is the absolute value of the score for k :When evaluated at k = 0, it tells us to what
degree moving each k away from zero will a¤ect the …t of the model. If it does not
produce a su¢ cient improvement in …t as compared to the penalty bk , then regressor
xijt;k will not be selected.
    Next, set
                                           !2
                 2   1 X X                      1 XXX
               b =             e
                               x ijt;kbijt    =                   eijt0 ;kbijtbijt0 ;
                                                           eijt;k x
                                                           x
                 k
                     n i;j   t
                                                n i;j t t0

where bijt = yijt exp(x0ijt b + b it + bjt + bij ), but can also be obtained as bijt = bijt (z
                                                                                             eijt
  0 b
eijt ). By inspection, this expression provides an estimate of the variance of the score
x
for k under the assumption that errors are correlated over time within the same pair,
as is commonly assumed in this context. Provided there is weak temporal dependence
                                                    2
(in the sense described by Hansen, 2007), bk             2
                                                         k = op (1) uniformly in k , where k is
                                                                                               2
                   2
the analogue of bk evaluated at the true values of ijt . By choosing bk in this way we
ensure that the score for k when evaluated at zero must be large as compared to its
standard deviation in order for regressor k to be selected.
    The choice of then involves setting a value that is su¢ ciently large that the statis-
tical probability an irrelevant regressor is selected is small. By the maximal inequality
for self-normalized sums (see Jing, Shao, and Wang, 2003), it follows that
                                    1 1 P
                               Pr bk p n
                                                eijt;k
                                          i;j;t x               ijt        m
                                                                                         = o(1);
                                            Pr (N (0; 1)        m)

for jmj = o(n1=6 ), thus establishing a bound for the tails of the normalized sum. This
suggests that by choosing a that is su¢ ciently large to dominate a p-dimensional stan-
dard normal, the inequality in (7) is satis…ed. Hence, following Belloni, Chernozhukov,
Hansen, and Kozbur (2016), we set
                                           p
                              = plug = 2c n 1 (1        =2p) ;

where c = 1:1 and       = 0:1= log(n).

                                                           31
    As discussed in the main text, after the lasso step, we then perform an unpenalized
PPML estimation using the selected covariates, a so-called “post-lasso” regression. Let
bP L be the estimator of the parameters associated with the s selected covariates. Such
an estimator is said to have the “oracle property”if the asymptotic distribution of bP L
coincides with that of the estimator we would obtain if we knew exactly which coe¢ cients
were equal to zero, i.e., for large enough samples we would have bP L;k = 0 if and only
if k = 0 for k = 1; :::; p. Hence, for estimators with the oracle property, asymptotically
the post-lasso model is indeed the right model. In general, the lasso does not satisfy the
oracle property. Nevertheless, under some additional regularization conditions, the use
of the plug-in lasso method just described ensures the following “near-oracle” property
for bP L ,                                  r                        !
                                              s 2 max (log n; log p)
                         b           = Op                              ;
                           PL
                                   1                    n
and hence the post-lasso estimates are consistent at a rate that di¤ers from the oracle
rate only up to the log factor max (log n; log p).
    In practice, the plug-in lasso method only requires adding one additional step to the
procedure used for the estimation of the PPML-lasso with high-dimensional …xed e¤ects
described before. Though the bk penalty terms are not known beforehand, they, too,
can be iterated on in the same fashion as ijt . Simply use the most recent values of bijt
in each iteration to construct new values for bk .

More Details on Cross-Validation
As discussed in the main text, the idea behind cross-validation (CV) is to repeatedly hold
out a subset of the sample during estimation and then use it to validate the resulting
estimates. In our setup, rather than holding out observations in an unstructured way,
we keep together all observations for which a given agreement is in e¤ect, and hold out
subsets of agreements. Doing so allows us to obtain estimates for the all the …xed e¤ects
in the model.
    To describe the implementation of CV, suppose that the observations associated with
trade agreements are partitioned into G subsets. Each resulting hold-out sample g will
have ng observations, where ng is the number of observations associated with agreements
that are held out in partition g . Because our variables of interest are all dummies, a
problem that may occur is that over some subsamples some regressors may not be present,
but that is less likely to happen when G is large.
    The CV approach sets all regressor-speci…c penalty weights bk equal to 1. Let bL;g ( )
be the lasso estimator obtained via the minimization of (4) when holding out the ng
observations contained in partition j . De…ne the CV bandwidth as

                         1 X 1 X
                            G

        CV   = arg min            (yijt
                   2     G g=1 ng
                                 (i;j;t)2g
                                                                                                        2
              exp x0ijt bL;g ( ) +   it
                                          b
                                              L;g (   ) +   jt
                                                                 b
                                                                     L;g (   ) +   ij
                                                                                        b
                                                                                            L;g (   )       :


                                                      32
Since CV is based on the minimization of the average MSE over di¤erent subsamples, we
expect it to deliver a much more lenient variable selection. There is some disagreement
over whether dummy variables, such as the ones used in our application, should be
standardized before applying the CV lasso. This consideration is in contrast to the plug-
in lasso, since standardization of the covariates simply causes the bk terms to be re-scaled
without otherwise a¤ecting estimation in that case. We have computed CV lasso results
with and without …rst standardizing and found that the results with standardization are
noticeably more similar to the plug-in lasso results. Thus, our preference is to work with
standardized dummy covariates.


References
 Anderson, J. and E. Van Wincoop (2003). “Gravity with gravitas: A solution to the
    border puzzle,” American Economic Review, 93, 170-192.
 Baier, S.L. and J.H. Bergstrand (2007). “Do free trade agreements actually increase
     members’international trade?,” Journal of International Economics, 71, 72-95.
 Baier, S.L., J.H. Bergstrand, and M.W. Clance (2018). “Heterogeneous e¤ects of eco-
     nomic integration agreements,” Journal of Development Economics, 135, 587-608.
 Baier, S.L., J.H. Bergstrand, and M. Feng (2014). “Economic integration agreements
     and the margins of international trade,” Journal of International Economics, 93,
     339-350.
 Baier, S.L, Y.V. Yotov, T. Zylkin (2019). “On the widely di¤ering e¤ects of free
     trade agreements: Lessons from twenty years of trade integration,” Journal of
     International Economics, 116, 206-228.
 Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). “Sparse models and
      methods for optimal instruments with an application to eminent domain,” Econo-
      metrica, 80, 2369-2429.
 Belloni A., V. Chernozhukov, C. Hansen (2014). “Inference on treatment e¤ects after
      selection among high-dimensional controls,”Review of Economics Studies, 81, 608-
      650.
 Belloni, A., V. Chernozhukov, C. Hansen, D. Kozbur (2016). “Inference in high dimen-
      sional panel models with an application to gun control,” Journal of Business &
      Economic Statistics, 34, 590-605.
 Correia, S., P. Guimarães and T. Zylkin (2020). “Fast Poisson estimation with high
     dimensional …xed e¤ects,” STATA Journal, 20, 90-115.
 Dhingra, S., R. Freeman, and E. Mavroeidi (2018). “Beyond tari¤ reductions: What
     extra boost to trade from agreement provisions?,” LSE Centre for Economic Per-
     formance Discussion Paper 1532.
 Drukker, D.M and D. Liu (2019). “A plug-in for Poisson lasso and a comparison of
     partialing-out Poisson estimators that use di¤erent methods for selecting the lasso
     tuning parameters,”mimeo.

                                            33
Fu, W. and K. Knight (2000). “Asymptotics for lasso-type estimators,” Annals of
    Statistics, 28, 1356-1378.
Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalized
     linear models via coordinate descent,” Journal of Statistical Software, 33, 1-22.
Gaure, S (2013). “OLS with multiple high dimensional category variables,” Computa-
   tional Statistics & Data Analysis 66, 8-18.
Gourieroux, C., A. Monfort, A. Trognon (1984). “Pseudo maximum likelihood methods:
   Applications to Poisson models,” Econometrica, 52, 701-720.
Hansen, C. (2007). “Asymptotic properties of a robust variance matrix estimator for
   panel data when T is large,” Journal of Econometrics, 141, 597-620.
Hastie, T., R. Tibshirani, and J.H. Friedman (2009). The elements of statistical learn-
    ing: Data mining, inference, and prediction. New York (NY): Springer.
Head, K. and T. Mayer (2014). “Gravity equations: Workhorse, toolkit, and cookbook,”
    Handbook of International Economics, Vol. 4: 131-195.
Hofmann, C., A. Osnago, M. Ruta (2017). “Horizontal depth. A new database on the
    content of preferential trade agreements,” World Bank Policy Research Working
    Paper 7981.
Jing, B.Y., Q.M. Shao, and Q. Wang (2003). “Self-normalized Cramér-type large devia-
     tions for independent random variables,” The Annals of Probability, 31, 2167-2215.
Kohl, T., S. Brakman, H. Garretsen (2016). “Do trade agreements stimulate interna-
    tional trade di¤erently? Evidence from 296 trade agreements,” The World Econ-
    omy, 39, 97-131.
Larch, M., J. Wanner, Y.V. Yotov, T. Zylkin (2019). “Currency unions and trade:
    a PPML re-assessment with high dimensional …xed e¤ects,” Oxford Bulletin of
    Economics and Statistics, 81, 487-510.
Mattoo, A., A. Mulabdic, and M. Ruta (2017). Trade creation and trade diversion in
    deep agreements. Policy Research Working Paper Series 8206, The World Bank.
Mattoo, A., N. Rocha, M. Ruta (2020). “Handbook of deep trade agreements.”Wash-
    ington, DC: World Bank.
Mulabdic, A., A. Osnago, and M. Ruta (2017). “Deep integration and UK-EU trade
    relations,”World Bank Policy Research Working Paper Series 7947.
Regmi, N. and S. Baier (2020). “Using machine learning methods to capture hetero-
   geneity in free trade agreements,”mimeograph.
Santos Silva, J.M.C. and S. Tenreyro (2006). “The log of gravity,”Review of Economics
    and Statistics, 88, 641-658.
Stammann, A. (2018). “Fast and feasible estimation of generalized linear models with
    high-dimensional k -way …xed e¤ects,”arXiv:1707.01815.
Tibshirani, R. (1996). “Regression shrinkage and selection via lasso,” Journal of the
    Royal Statistical Society, Ser B. 59, 267-288.


                                          34
Weidner, M., T. Zylkin (2020). “Bias and consistency in three-way gravity models,”
    arXiv:1909.01327.
Yotov, Y.V., R. Piermartini, J.-A. Monteiro, M. Larch (2016). An advanced guide to
    trade policy analysis: The structural gravity model. Geneva: World Trade Organi-
    zation.
Zhao, P. and B. Yu (2006). “On model selection consistency of lasso,” Journal of
    Machine Learning Research, 7, 2541-2563.
Zou, H. (2006). “The adaptive lasso and its oracle properties,” Journal of the American
    Statistical Association, 101, 1418-1429.




                                         35