[PDF] Latent Dirichlet Allocation Models for World Trade Analysis

Abstract

The international trade is one of the classic areas of study in economics. Nowadays, given the availability of data, the tools used for the analysis can be complemented and enriched with new methodologies and techniques that go beyond the traditional approach. The present paper shows the application of the Latent Dirichlet Allocation Models, a well known technique from the area of Natural Language Processing, to search for latent dimensions in the product space of international trade, and their distribution across countries over time. We apply this technique to a dataset of countries' exports of goods from 1962 to 2016. The findings show the possibility to generate higher level classifications of goods based on the empirical evidence, and also allow to study the distribution of those classifications within countries. The latter show interesting insights about countries' trade specialisation.

Full PDF

LLatent Dirichlet Allocation Models for World Trade Analysis

Diego Kozlowski ∗1 , Viktoriya Semeshenko , and Andrea Molinari DRIVEN, FSTM, University of Luxembourg, Luxembourg Universidad de Buenos Aires. Facultad de Ciencias Económicas. Buenos Aires,Argentina. CONICET-Universidad de Buenos Aires. Instituto Interdisciplinario deEconomía Política de Buenos Aires. Buenos Aires, Argentina

Abstract

The international trade is one of the classic areas of study in economics. Nowadays, giventhe availability of data, the tools used for the analysis can be complemented and enrichedwith new methodologies and techniques that go beyond the traditional approach. The presentpaper shows the application of the Latent Dirichlet Allocation Models, a well known techniquefrom the area of Natural Language Processing, to search for latent dimensions in the productspace of international trade, and their distribution across countries over time. We apply thistechnique to a dataset of countries’ exports of goods from 1962 to 2016. The ﬁndings show thepossibility to generate higher level classiﬁcations of goods based on the empirical evidence, andalso allow to study the distribution of those classiﬁcations within countries. The latter showinteresting insights about countries’ trade specialisation.

Keywords:

COMTRADE Data, Data Analysis, Topic Modelling, Latent Dirichlet Alloca-tion, Unsupervised Learning

The role that countries play in the global market is profoundly determined by their insertion intoglobal value chains, and by the types of goods they produce for the global market (Coe et al. 2004;Gereﬃ, Humphrey, and Sturgeon 2005; Gereﬃ 1994).Production systems, which were traditionally analyzed as almost independent national systems,are now continuously more connected on a global scale. Due to the increasingly complex andinterconnected nature of global supply chain networks, a recent strand of research has appliednetwork science methods to model global supply chain growth and subsequently analyse varioustopological features of these structures. Obviously, this depends on the dataset in use, as it deﬁnesthe topology of the network.In recent years, we have been witnessing a continuous growth of available data. This situationalso poses a great challenge, namely, how to extract hidden relations, determine appropriate pat-terns, clusters and trends to extract valuable conclusions from such large volumes of data (Padhy,Mishra, and Panigrahi 2012). ∗ [email protected] a r X i v : . [ phy s i c s . s o c - ph ] S e p raditional analysis tools are incapable to handle such complexity alone because it requires timeand eﬀorts to extract and analyse information. On the other hand, interdisciplinary sciences providediﬀerent techniques and tools to apply to the analysis of this volume of data. The application ofnetwork formalism in the ﬁeld of socioeconomic science has experienced unprecedented growth inrecent decades (Barabasi 2011; Caldarelli 2007; Ermann and Shepelyansky 2013; Fagiolo, Squartini,and Garlaschelli 2013). Also, there is a wide literature that studies international trade at theproduct level (Balassa 1965; Lall 2000; Lall, Weiss, and Zhang 2006; Haveman and Hummels 2004).In particular, these connections can be analyzed as a bipartite graph among countries and products(Guan et al. 2018; Straka, Guido Caldarelli, and Saracco 2017; Araújo and Ferreira 2016; GuidoCaldarelli et al. 2012), and the complexity of production can be explored using the product space(César A. Hidalgo 2009; C. Hidalgo and Hausmann 2009; C. A. Hidalgo et al. 2007). The worldtrade network can also be examined using multiplex and multilayer networks (Battiston, Nicosia,and Latora 2014; Kivela et al. 2011; Alves et al. 2019).In this paper, we adopt a diﬀerent approach to extract interesting and signiﬁcant patterns frombilateral trade data, using the Latent Dirichlet Allocation (LDA) modelling technique (Blei, Ng, andJordan 2003). Topic models have emerged as an eﬀective method for discovering useful structurein data. At the same time, LDA is a statistical approach used in topic modeling for discoveringhidden topics in large corpora of text.Recently, a growing number of researchers are beginning to integrate topic models into variousdatasets (Pritchard, Stephens, and Donnelly 2000; Rosa et al. 2015; Fei-Fei and Perona 2005; Kim,Narayanan, and Sundaram 2009; Hu and Saul 2009), not only for document collections. To thebest of our knowledge, our work is the ﬁrst eﬀort to adapt and apply this technique for countries’exports.We ﬁnd very suitable an analogy between topic modeling in texts and trade. In our adaptationof LDA, a set of countries plays the role of text documents, products play the role of words, andcomponents (i.e. latent dimensions within which these products group) play the role of topics.Based on the model of Blei, Ng, and Jordan 2003, we suggest a generative process to detect theselatent dimensions in the product space and build an alternative trade nomenclature directly fromdata. Then, using these latent dimensions, we analyze those components’ participation withincountries’ export baskets.Our main contributions and results can be summarized as follows: we develop a generativemodel, based on a well established methodology usually used in the ﬁeld of Natural LanguageProcessing, to study the international trade ﬂows. This model looks for automatic grouping ofthe products in latent components. We study these latent components, characterizing each bytype of production, complexity and its relation to a speciﬁc country over time. Then, we use thecomponents to brieﬂy characterize the role in global trade of diﬀerent groups of countries. Theresults that emerge from our model are in line with the specialized economic and trade historyliterature.The paper is organized in the following way: in section 2 we describe the dataset in use, insection 3 we introduce the notations and explain the methodology applied in the model, in section4 we present the obtained results, and in section 5 we conclude.2 Data

To apply the LDA technique, we used the United Nation Commodity Trade Statistics Database(COMTRADE) dataset of each country’s (four-digits) disaggregated exports from the Center forInternational Development at Harvard University. Such dataset contains trade data for around 250countries and territories, and takes the raw trade data on goods from countries’ reporting to theUnited Nations Statistical Division (COMTRADE).We used these data instead of the raw COMTRADE statistics because such data may containsome inconsistencies. To address this issue, the Center for International Development uses theBustos-Yildirim Method to clean data and "account for inconsistent reporting practices and therebygenerate estimates of trade ﬂows between countries". Such method assumes that since these dataare recorded both as exports and as imports, cross-referencing countries’ reported trade ﬂows againsteach other can produce reliable estimations. It consists of ﬁrst correcting bilateral import values andthen comparing them to the reverse ﬂows reported by the exporting partner. Their (per-country)estimated index of reliability for reporting trade ﬂows measures the consistency of trade totalsreported by all exporter and importer combinations over time. Finally, they generate their owntrade values’ estimates using the data reported by countries together with such reliability index.Bilateral trade ﬂows are mainly recorded in two trade classiﬁcation systems: Harmonized System(HS) and Standard International Trade Classiﬁcation (SITC), and data presents four dimensions:exporter, importer, product, and year. While both classiﬁcations are valid, there is a "time versusdisaggregation" trade-oﬀ entangled in the decision of which dataset to use. SITC data has a longertime-series (1962-2016) but it covers fewer goods (i.e. at higher levels of aggregation, up to 4-digits, approximately 750 products). On the other hand, HS data, being a newer classiﬁcation,oﬀers a more contemporary and detailed classiﬁcation of goods (i.e. disaggregated up to 6-digits,with approximately 5,000 goods), but with the downside of oﬀering a relatively shorter time period(1995-2017).We chose to work with SITC (in this case Revision 2) in order to have a larger time series, havingslightly more aggregated data (i.e. 4- instead of 6-digits) (United Nations Statistics Division 1975).Moreover, we reckon that 750 products should be enough to allow us to apply the LDA technique,as it should allow for enough (but not too much) granularity when labelling the components. Forsuch dataset, we make an empirical search for the best number of latent dimensions.

In this section we describe a probabilistic model constructed to study the trade ﬂow data with theaim to generate an automatic grouping of the products.This cannot be achieved using traditional clustering techniques in high dimensional spaces (Ag-garwal, Hinneburg, and Keim 2001), due to the fact that a product can be used or consumed as anintermediate and/or ﬁnal product at the same time, which means that groups can not be exclusive(Molinari and De Angelis 2016). Therefore, the problem we are dealing with can be examined with fuzzy clustering.At the same time, we need to deal with mitigating high-dimensional data issues through dimen-sionality reduction. This is possible due to the fact that we can explode similarities between the This dataset has been extracted on March, 2019. See https://atlas.cid.harvard.edu/about-data for more details. Imports are reported CIF (i.e. including freight and insurance costs) and exports free on board (FOB). R N ∗ P ∗ Y space. That is, the interaction of N countries, P products and Y years.We ﬁnd it appropriate to use LDA to group products. While Blei, Ng, and Jordan 2003 lookfor a latent dimension of k topics , embedded in a highly dimensional dictionary distributed overthe texts that compose the corpus, here we are looking for a latent dimension of k components ,embedded in a highly dimensional classiﬁcation of products distributed along the countries over theyears.We use the following terms to deﬁne our probabilistic topic model:• product is a basic discrete unit of analysis , deﬁned as an item in a classiﬁcation (SITC). Werepresent products using unit-basis vectors, where the superscript i stands for the i th productin the classiﬁcation and the i th element in the vector. The V th product of the classiﬁcationis the vector w , such that w v =1 and w u =0, u (cid:54) = v .• country-year is a sequence of N products, deﬁned as W = ( w , w , ..., w N ) .• corpus is the collection of M country-years, deﬁned as D = ( d , d , ..., d M ) .• component is a latent dimension on the corpus, deﬁned as K .The objective behind the classiﬁcation of the products is twofold: on the one hand, look fora distribution of components over each country-year; on the other, analyse the distribution of theproducts within each of the components. In the original model proposed by Blei, Ng, and Jordan 2003, the words are supposed to be randomrealisations of chained distributions, ignoring the order in which the words appear in the document.Even when we know that the real data generating process is far from what our model proposes, thisinference process can still provide useful insight on the latent dimensions we are looking for. Thebasic idea of the generative process is that, given the amount of dollars exported by a country in aspeciﬁc year, the assignment of the product that will be exported comes from a random mixturesover latent components, where each component is characterised by a distribution over products.The sequence of the data generation can be described as follows:• For each country-year in the corpus, we assume that exports come from a following two-stageprocess: – choose randomly a distribution for the components, – for every dollar exported:∗ choose randomly the component to which it belongs, and∗ choose randomly a product from the distribution corresponding to that component.The data generating process can be formalised as follows:1. For every component k ∈ { , , ...K } β k ∼ Dir ( η ) , where η ∈ R > is ﬁxed

2. For each country-year d ∈ { , , ...D } • Generate a vector of component proportions θ d ∼ Dir ( α ) , where α ∈ R > is ﬁxed • For every exported dollar:(a) generate an allocation of the component z dn ∼ M ult ( θ d ) (b) assign the product w dn ∼ M ult ( β zn ) A Dirichlet process is a family of stochastic processes where the realizations are themselvesprobability distributions. It is often used in Bayesian inference to describe the prior knowledgeabout the distribution of random variables—how likely it is that the random variables are distributedaccording to one or another particular distribution.The parameters deﬁning the Dirichlet distribution (here, η and α ) determine the degree ofconcentration of the resulting distributions. For a Dir ( α ) distribution, α deﬁnes the degree ofsymmetry of the multinomial distributions that the process generates. With values much smallerthan 1, the resulting distributions will be highly concentrated on some elements, while values muchlarger than 1 would generate very uniform distributions. In terms of our problem, α controls themixture of components for any given country, and parameter η controls the distribution of productsper component. A very small α will generate that each country has few characteristic components,while a very small η will generate a very asymmetric distribution over the products, and thereforethere will be a few very important products, and the rest with almost null probability. In this section we present results for the analysis of components, ﬁrst at their distribution andthen at their main country over time. We conﬁne our analysis over the period 1962-2016, for the250 reporting countries and P products (goods, not services), which, as mentioned, in SITC Rev.2(United Nations Statistics Division 1975) at 4-digits are approximately 750. In other words, we workwithin an order of magnitude similar to that of a regular dataset in a traditional Topic Modellingproblem. As mentioned before, prioritising a longer time series, we decided to use the SITC (Rev.2) 4-digit nomenclature.In the following sections, results are discussed in two stages. We ﬁrst walk the reader throughthe decision of the number of components, also discussing the labelling process adopted, and thenanalyse the evolution of exports in the main country for each component.

In this subsection, we ﬁrst explain the (granularity vs. economic interpretation) trade-oﬀ facedwhen using trade data with LDA. We then describe the process of ﬁnding the best suitable numberof components ( k ) for our problem, and the labelling of each component, to conclude with somereﬂections about the ﬁndings for the chosen k . In this case η = 1 /K . In this case α = 1 /K . k stands for the amount of components and plays a fundamental role in themodel. Fewer components (i.e. small k ) will tend to reﬂect broader concepts. On the other hand, if k is larger than the cardinality of the latent space (i.e. the implicit space for the grouped productsis smaller than the number of proposed components), this can generate repeated or over-speciﬁccomponents. In other words, in our case this issue poses a trade-oﬀ between granularity and well-deﬁned (i.e. easily "taggable") components.First, we ran the model for various values of k = 2 , , , , , , , , , , . A ﬁrstresult, observed for all these values of k , is that components that group best (for every k ) are thosecontaining: petroleum and derivatives, electronics, machinery and textiles. As mentioned above,hyperparameter k deﬁnes the components’ speciﬁcity. However, those phenomena worth exploring(and for economic interpretation) can be found at diﬀerent levels of granularity. Hence, a ﬁrstproblem observed when analysing the diﬀerent exercises is to deﬁne a suitable granularity for thecomponents. For relatively low values of k (i.e. up to k = 10 ) the petroleum component alwaysstands out. Conversely, from k = 20 we also ﬁnd other sectors (e.g. electronic products, textiles, etc)in some components, while others hold a mixture of products that is harder to rationalise as a latentdimension. For values of k between 20 and 50 the resulting composition for each component is ratherstable, resulting in a good balance between more easily interpretable (i.e. taggable) components,together with an interesting level of granularity. For values of k , higher than 50, components tendto repeat themselves.Figure 1 shows an irregular distribution of components (for k = 2 to k = 50 ). Some componentsstand out, such as the ﬁrst component for k = 2 , , ; or the eighth component in k = 8 , , .These components are mainly composed of petroleum products. This result can be interpretedin two (not mutually exclusive) ways. It could reﬂect that the dataset contains relatively moreprimary producing countries than industrial manufacturing ones, but it can also be showing thatthe former countries’ exports are more concentrated on primary products.A second step consisted of choosing a k that was appropriate, considering such granularity versustaggability trade-oﬀ. Since there is no clear optimum value for k , and although the literature withinthe text analysis domain has contributed with some proposals, this parameter should come from asubstantive search where the topics (or components) found are closer to the object of study that isanalyzed (Bonilla and Grimmer 2013; Quinn et al. 2010).To explore the distribution of products over components, and their cumulative function, wedeveloped a dynamic dashboard. Such distribution is plotted according to a widely used techno-logical exports classiﬁcation (Lall 2000). After a substantive search (which involved the mentionedestimation, comparison and manual exploration of the model with diﬀerent k ), we feel comfortablewith choosing k = 30 as the number of components oﬀering the best trade-oﬀ between havingenough (economically interpretable) granularity and (a relatively low) components’ repetition.For k = 30 , we also tested for diﬀerent η . As we want our components to have an asymmetricdistribution, to facilitate their labelling, we ran the model with small values of η . Speciﬁcally, wetested the model for η = 1 / , / , / , / . Components’ composition did not show substan-tive changes with diﬀerent values for η , suggesting that the model is robust to variations in thepriors. For this reason, and given that the default value α = k gives good results in terms ofcountries specialisation, we decided to keep the default values of α = η = k for the mentioned In the limit, in our case, one could have one topic per product. This does not weight components by total exports, i.e. shows an equal basis among countries. See https://diego-kozlowski.shinyapps.io/LDA_worldtrade/ . Appendix A shows the step-by-step labellingexercise and some possible economic interpretation of results for k = 2 k .In a third and ﬁnal step, we manually labelled the components for the chosen model ( k = 30 ).The mentioned dynamic dashboard (with each component product composition and the distribu-tion is plotted according to a widely used technological exports classiﬁcation) helped the labellingexercise in terms of re-grouping products within each component by their technological complexity.Frequently, and even in text topic modeling, component labeling is quite diﬃcult due to the lackof a generalisation criteria. We found that a downside of the LDA technique within the trade ﬂowsdomain is that this issue is reinforced, since the subjective search for a comprehensive concept ofproducts traded among countries can turn to be a more complex task than searching for a generalconcept over a group of words. On the upside, polysemy, an frequent problem found in texts, doesnot exist when using trade data, where all signiﬁers (classiﬁcation indexes) refer to a single andunambiguous meaning. However, other new problems arise, e.g. deciding upon the trade nomencla-ture or the data disaggregation level (which could be associated with choosing the language of thecorpus in text analysis). In our model, we ﬁrst observed that the usual practice of looking at theﬁrst ten elements of the distribution was not suﬃcient to ﬁnd a general label for each component,and for this reason we develop a more comprehensive dashboard.Table 1 shows the labels for our model (with k = 30 ), with a general description by component,except when that is not possible (e.g. component 19), together with a ‘subgroup’ that allows fora ﬁner (or more detailed) product speciﬁcation and, in the case of industrial products, the level of7echnological complexity (according to Lall 2000). Finally, the last column displays the country forwhich each component has the highest share (taking an average over the whole time period).It is interesting to particularly highlight component 5, albeit (as mentioned below) it is notdeﬁned into a few products, given its high tech complexity at the beginning of the series (duringthe ’60s), but which later fell into disuse or decreased their share in international trade. In this sense,it is unsurprising that Czechoslovakia would be the most characteristic country of this component,given that, due to the country’s dissolution in 1992, its time series is shorter than the rest. Having the labelled components, this subsection analyses each country’s exports basket compositionover the period under study (1962-2016). Since by deﬁnition our unit is country-year, it is possibleto compare the evolution in components’ distribution within each country. Below, we highlightsome regularities that can be inferred from looking at the exports shares of the main country ineach component.The following analysis intends to present one of the various possible analysis that could beperformed with the LDA application proposed in this paper. Rather than being exhaustive, theintention of this section is to present results in a way that allows to understand the possible deriva-tions of the methodology. A previous time-series analysis for various country groups (i.e. coveringsome oil producing, North American, European, Latin American and Asian countries) focused itsattention upon the evolution of those countries’ exporting baskets. The overall conclusion is thatthe evolution of exports in Asia leaves a very diﬀerent image to that of Europe. While in thelatter the concentration of the EU countries’ export baskets in a single component shows a nationaldiﬀerentiation, this is not the case in Asian countries, which show a homogenization process.LDA results for Chinese exports structure show an interesting example to highlight (see Figure2). At the beginning of the ‘60s the most relevant component (28) was composed of rice, cotton,tea and some textile products. This component shows a downward trend, while clothing, toys, etc.(component 4) increases and becomes the most important over the period 1980-2003. However,from 1993 textiles and toys start decreasing, with a simultaneous rise in component 23 (televisions,computers, microcircuits and transistors), which towards the last years of the period constitutesapproximately 80% of the country’s exports. This change in the speciﬁc nature of Chinese exportsreﬂects three stages of increasing complexity of the country’s manufacturing industry, starting froma basically agricultural economy and, after a period of low-complexity industrialisation, becomingone of the world’s leading exporters of highly complex products (Chenery et al. 1986; Costantino2013).Among other ﬁndings, Kozlowski 2019 highlights the concentration of EU countries’ export bas-kets in a single component that varies between countries, hence showing national diﬀerentiation,while in most Asian countries tend to show much more homogeneous exports baskets. Further, itﬁnds a clear concentration of the exports baskets of the Organization of the Petroleum Export-ing Countries (OPEC) founding members, with some diﬀerences for Venezuela (mainly due to itsparticular history and some of its active public policies to diversify the countries exports Bértolaand Ocampo 2010. Another result worth mentioning is the that our LDA model captures the ex-ports specialisation in electronic products in the United States moving from analogue to digitaltechnologies over the period of study, together with the Maquila phenomenon in Mexico. That is, recording tapes, telephone lines or photographic paper. This means that its denominator is lower than that of the other countries of the dataset. Moreover, in component 25, the second and thirdproducts are signiﬁcant in terms of Paraguayan exports and show important rises: “Oilcake andother residues (except dregs)” increased from 2% to 12.3% and sales of “Soya beans” from 0.3% in1963 to 23% (becoming the country’s main exporting product, even including services, in 2016).Moreover, British exports of “Passenger motor vehicles (excluding buses)” remained practicallystable (5.2 to 5.3%), although the following relevant products in component 30 (“Parts, nes of theaircraft of heading 792” and “Medicaments (including veterinary medicaments)”) saw signiﬁcantincreases (from 0.3% to 3.6%, and from 0.6% to 5.3%, respectively).A second group of (three) components shows signiﬁcant falls over the period. In component 8,where Chilean exports of “Copper and copper alloys, reﬁned or not, unwrought” fell from 30.3%in 1962 to 22.6% in 2016, while “Copper ore and concentrates; copper matte; cement copper”exports decreased from 33.1% to 19.1%. Also, Finish (component 17) “Wood of coniferous species,sawn, planed, tongued, grooved, etc” exports fell from 21.4% (1962) to 2.7% (2016), while Pakistan(component 28) saw a shrinking share of its “Raw cotton, excluding linters, not carded or combed”exports, from 9.8% to 0.2%. Another result worth highlighting shows one component with a mixture of the second or thirdmain products (which still have a probability similar to that of the ﬁrst one) with signiﬁcant exportsboth falls and increases. In component 13, Switzerland’s exports of “Watches, watch movementsand case” fell from 12.7% (in 1962) to 6.5% (in 2016), but “Gold, non-monetary (excluding goldores and concentrates)” rose from 0% to 28% over the same period.A fourth group is formed by two components that show relatively constant trade over the period.In component 4, Macao experienced stable “Footwear” exports (from 4.1% in 1962 to 3.9% in 2016),and hence its emergence can probably be explained by its signiﬁcant share in services exports (withtourism taking 88.8%). On the other hand, Germany (in component 11) exported an 8% (in 1962)and 11.2% (in 2016) in “Passenger motor vehicles (excluding buses)”, although its preponderancecan be due to the fact that it is the main world exporter of this good (see below).A singularity of this LDA trade data application is that in some (ﬁve) components it singlesout countries with a short time series due to their shorter data history, as mentioned to explainCzechoslovakia in component 5 in the previous subsection. This is the case of the aforementioned(component 7) Turkmenistan (with data from 1992), while Réunion data ranges over the 1962-1995 period and it mainly exports “Sugars, beet and cane, raw, solid” (third main product fromcomponent 9, with a 4% probability), with its exports basket shows an important concentration of The remaining products in the component are not relevant in terms of the country´s export shares. This country’s exports are stable in the other two main products of the component: “Rice, semi-milled or whollymilled” (6% to 7.3%) and very low in “Precious jewellery, goldsmiths’ or silversmiths’ wares” (0% to 0.2%). Other products within this component do not seem to be relevant in the country’s exports basket: “Children’stoys, indoor games, etc” fall from 0.6% to 0.1% and “Outerwear knitted or crocheted, not elastic nor rubberized;jerseys, pullovers, slip-overs, cardigans, etc” from 0.4% to 0.1% over the same period. Further, only one component (22) does not show a particular regularity that can explain therepresentative country (Ghana): its main product (“Petroleum gases and other gaseous hydrocar-bons, nes, liqueﬁed”, with a 38% probability) is currently mainly exported by Qatar, rising from0.2% (in 1975) to 22.4% (in 2016). Finally, another interesting fact derived from our LDA model is that there is one product(“Passenger motor vehicles (excluding buses)”) captured as the main one in six of the 30 components(3, 6, 10, 11, 17 and 30). This seems to reﬂect diﬀerent exports baskets specialisation in the maincountry for each component (respectively, Belgium, Japan, Mexico, Germany, Finland and UK). Aspreviously mentioned, Germany (component 11) has been the main exporter of this product overthe whole period (albeit with a falling share from 37.6% to 22.1% over total exports), while theJapanese share (component 6) grew from 1.9% to 13.5%, those from UK and Belgium fell (from19.6% to 5.9%, component 30; and from 4.7% to 3.8%, component 6; respectively), Mexico’s rose(from 0% to 4.7%; component 10), and Finland’s was the lowest (from 0.1% a 1.8%; component17).

The present work proposes the use of a technique widely explored in Natural Language Processingto the ﬁeld of international trade. By shifting the data domain from text to each country’s exportsﬂows of each product, we managed to develop a typology of global trade based on a number oflatent components. This allows us to do two things. On the one hand, we build an automaticclassiﬁcation of products based on data. On the other, we are able to study the trends in countries’exports, based on those components. Our ﬁndings are mostly in line with the specialized literaturefor each country or region, showing that this particular methodology is able to grasp an insight ofthe position of countries’s exports in global trade, making use of a single type of metric. Given thatthis methodology requires a minimum number of arbitrary decisions to be built, it turns out to bean interesting complement to the traditional forms of analysis.The limit of the proposed methodology is its dependence of the data inputs. Decisions madewith respect of the curation of the dataset can potentially aﬀect all the results. If the dataset used The UK and India were, respectively, the main exporters of this product in 1962 and 2016. This result maybe reﬂecting Botswana’s relative comparative advantage in diamonds (i.e. a large share within its exports basketvis-à-vis the world average). Conversely, Ghana’s exports rose in “Palm oil” (with 10% probability in the component) but from 0.01% to 0.6%,“Natural rubber latex; natural rubber and gums” (6%), from 0.1% to 0.2%, and “Cocoa butter and paste” (1%; from2.8% to 4.4%). Over the period, Ghanaian exports fell in “Sawlogs and veneer logs, of non-coniferous species” (4%)from 7.6% to 0.9% and “Cocoa beans, raw, roasted” (3%) from 59.8% to 16.9%, “Wood, non-coniferous species, sawn,planed, tongued, grooved, etc” (3%) from 6.1% to 0.9%, “Plywood consisting solely of sheets of wood” (3%) from0.4% to 0.02%, and remained stable in “Tin and tin alloys, unwrought” (2%) and “Palm kernel oil” (1%), both withnull (or almost null) exports. th century, the resultant components would be very diﬀerent withthe ones presented in the article due to the larger set of technologies involved, and the optimalnumber of components would probably increase. On the other hand, if a country is restricted to asubset of the years considered, it will have a overall closer relation with components specialized intechnologies of that time-frame, like in the case of Czechoslovakia. Even when each country-yearweights the same in the optimization of the model, i.e. we are not considering the weight of thetotal exports of each country-year on the cost function, countries with larger exports tend to showsmoother results, as is the case of China. This is due to the fact that the higher exports makeit diﬃcult for a speciﬁc product to drastically change its proportion in the total exports of thecountry from one year to another. Small countries are prone to sudden changes in the proportionof components, because a small change in the nominal value of the exports of any speciﬁc productimply a big proportion over the total basket of exports. There is also an interesting phenomenathat occurs on the model with countries that have a highly concentrated export basket. For theOPEC countries we can see a drastic change by the end of the 70’. If we take the case of Iraq, forexample, it goes from an equal distribution on components 20 and 12 to a 100% in the component12 some years later. The distribution on the original SITC classiﬁcation shows that this countryexported 61.68% in "Crude petroleum" and 36.5% "Petroleum products, reﬁned" in 1977, and thenext year this changed to 85.03% and 12.59% respectively. This imply an increase of more than 23%of the overall basket in a single product. Still, it is not a 50% change as showed by the proposedmodel. The explanation for this is that both latent components, 12 and 20 include, with diﬀerentproportions, crude and reﬁned petroleum. The model infer that the reﬁned petroleum exportedfrom the 1978 on-wards comes from a diﬀerent latent component than the one exported previously.We can say that if a countries export can be correctly describe only with two products, like in thiscase, using a model like LDA is not necessary for studying the exports basket. Another interestingphenomena that this model cannot fully capture is the case when the bilateral interactions implyboth imports and exports of highly complex product, and where one of the poles only produce asimple step in the production, like the mentioned Mexican maquilas. As we only use exports data,the model can only account for half of the process, producing potentially misleading conclusion ifnot used carefully. This problem, however, will arise in every metric that only accounts for theexports.Benchmarking the results of the LDA model is a complicated task, as it is an unsupervisedmodel. The best model should be the one that gives the most interpretable results, and that canbe used for the more insightful analysis. To test our model, we tried three other approaches for thesame task: ﬁnding the latent dimensions of international trade. First, we try two other methodstraditionally used for Topic Modeling in Natural Language Processing, namely Latent SemanticAnalysis (LSA) (Landauer et al. 2013) and Non-Negative Matrix Factorization (NMF) (Lee andSeung 1999). Then, we tried to adapt the product space C. A. Hidalgo et al. 2007; C. Hidalgoand Hausmann 2009 to achieve the same task as LDA, by using clustering techniques (Kaufmanand Rousseeuw 1987). The three techniques showed results that are in line with the ones found byLDA, but in a lower level of detail, where the interpretation of results became a harder task.It is interesting to look at the feasibility of the model given the change in the domain of the prob-lem. The very diﬀerent nature of the data traditionally used in text mining and Topic Modelling,with respect to international trade data, raises the question of whether the model can operate inthe new domain. However, in terms of data structure, both problems have more similarities thanwhat it seems. First, the traditional dimension of the problem is NxV (N observations, in the orderof magnitude of thousands, V the vocabulary, also in the order of magnitude of thousands). In this13ase, the problem is approximately NxP, where the N observations are the year-country pairs, with250 countries and 54 years, and P products, which in SITC at 4 digits are approximately 750. Inother words, we are in an order of magnitude similar to that of a small dataset in a traditionalTopic Modelling problem. Finally, an important change in both domains is the diﬀerence betweenthe frequency of words in a text (tens or hundreds, depending on size of the documents) and thedollars exported of a product by each country-year (millions or billions). This diﬀerence in principleshould not aﬀect the model, since what the model considers in its optimization are the distribu-tions between the diﬀerent elements (word frequencies or exported values per product) and not theabsolute values.As future lines of work, as results are deeply connected with the input dataset, new datasources could provide diﬀerent insights. For example, while our period seems long enough to reﬂectstructural changes, economic historians could ﬁnd an even longer time series more useful to describesome phenomena. Also, including services to the dataset could show diﬀerent aspects of global tradethat cannot be captured in an analysis only covering trade in goods. That said, data limitationswould pose a trade-oﬀ, as this would imply either a lower product dissagregation or a shorter timeseries dataset. Other lines of work involve an exploration by country groups, in order for example toexplore specialisation or complementarity among countries exports baskets, e.g. within a regionaltrade block.As ﬁnal remark, we do not think this new types of techniques will be able to replace traditionalmetrics and empirical work on international trade, but rather we intend to complement traditionalanalysis and bring a new tool that might help in the understanding of this ﬁeld. Conﬂict of interest

The authors declare that they have no conﬂict of interest.

Acknowledgment

The Doctoral Training Unit Data-driven computational modelling and applications (DRIVEN) isfunded by the Luxembourg National Research Fund under the PRIDE programme (PRIDE17/12252781), https://driven.uni.lu .This research was partly founded by the Préstamo BID - Proyecto de Investigación Cientíﬁca yTecnológica (PICT) 2016-1185.Authors would like to acknowledge useful discussion with Daniel Heymann, Daniel Aromí andJun Pang.

References [1] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. “On the surprising behavior ofdistance metrics in high dimensional space”. In:

International conference on database theory .Springer. 2001, pp. 420–434.[2] L GA Alves et al. “The nested structural organization of the worldwide trade multi-layernetwork”. In:

Scientiﬁc Reports

Physica A: Statistical Mechanics and its Applications

World Bank Staﬀ Working Paper

The manchesterschool

Nature Physics

Phys.Rev. E

89 (2014), p. 032804.[8] Luis Bértola and José Antonio Ocampo.

Desarrollo, vaivenes y desigualdad. Una historiaeconómica de América Latina desde la Independencia . Secretarıéa General Iberoamericana=Secretaria-Geral Ibero-Americana, 2010.[9] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In:

Journalof machine Learning research

Poetics

Scale-free networks: complex webs in nature and technology . Oxford UniversityPress, 2007.[12] Guido Caldarelli et al. “A network analysis of countries’ export ﬂows: ﬁrm grounds for thebuilding blocks of the economy”. In:

PloS one

La renta de la tierra: formas, fuentes y apropiación . Ediciones ImagoMundi, 2017.[14] Hollis Burnley Chenery et al.

Industrialization and growth . Oxford University Press New York,1986.[15] Neil M Coe et al. “‘Globalizing’regional development: a global production networks perspec-tive”. In:

Transactions of the Institute of British geographers

Nueva Sociedad

244 (2013), pp. 84–96.[17] L Ermann and D L Shepelyansky. “Ecological analysis of world trade”. In:

Physics Letters A

Journal of Economic Interaction and Coordination . Vol. 2. 2005, 524–531 vol. 2.[20] Gary Gereﬃ. “The organization of buyer-driven global commodity chains: how US retailersshape overseas production networks”. In:

Commodity Clains and Global Capitalism (1994),pp. 95–122. 1521] Gary Gereﬃ, John Humphrey, and Timothy Sturgeon. “The governance of global valuechains”. In:

Review of international political economy

PLOS ONE doi : . url : https://doi.org/10.1371/journal.pone.0197575 .[23] Jon Haveman and David Hummels. “Alternative hypotheses and the volume of trade: thegravity equation and the extent of specialization”. In: Canadian Journal of Economics/Revuecanadienne d’économique

Science issn : 00368075. doi : . arXiv: .[25] César A. Hidalgo. “The Dynamics of Economic Complexity and the Product Space over a 42year period”. In: CID Working Papers issn : 6507247197.[26] César Hidalgo and Ricardo Hausmann. “The building blocks of economic complexity”. In:

Proceedings of the National Academy of the Sciences of the United States of America issn : 0027-8424. doi : . arXiv: .[27] Diane J. Hu and Lawrence K. Saul. A Probabilistic Topic Model for Music Analysis . 2009.[28] Leonard Kaufman and Peter J Rousseeuw. “Clustering by means of medoids. Statistical DataAnalysis based on the L1 Norm”. In:

Y. Dodge, Ed (1987), pp. 405–416.[29] S. Kim, S. Narayanan, and S. Sundaram. “Acoustic topic model for audio information re-trieval”. In: .2009, pp. 37–40.[30] M. Kivela et al. “Multilayer networks”. In:

Journal of Complex Networks

Oxford development studies

World development

Handbook of latent semantic analysis . Psychology Press, 2013.[35] Daniel D Lee and H Sebastian Seung. “Learning the parts of objects by non-negative matrixfactorization”. In:

Nature

DT IIEP (2016),pp. 1–59.[37] N. Padhy, D. Mishra, and R. Panigrahi. “The survey of data mining applications and featurescope”. In: arXiv preprint arXiv:1211.5723 (2012).1638] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. “Inference of population struc-ture using multilocus genotype data”. In:

Genetics

American Journal of Political Science

BMC Bioinformatics

16 (2015), S2–S2.[41] Michael Ross.

The oil curse: How petroleum wealth shapes the development of nations . Prince-ton University Press, 2012.[42] Mika J Straka, Guido Caldarelli, and Fabio Saracco. “Grand canonical validation of the bi-partite international trade network”. In:

Physical Review E

Standard International Trade Classiﬁcation Revision 2 .ST/ESA/STAT/SER.M/34/Rev.2. Series M: Miscellaneous Statistical Papers, No.34 Rev.2,New York: United Nations. 1975. 17 ppendix A Model with k=2

Figure 3 displays the mentioned interface in the case of k = 2 , showing each 4-digit SITC (Rev. 2)code and its product description, together with its individual and accumulated probabilities withinthe component. Further, Figure 3(a) shows that the distribution of the ﬁrst component assigns alarge weight to crude oil, followed by other petroleum products (e.g. diesel oil, propane gas, etc.).Hence, a plausible label for such component would be "Petroleum and derivatives". However, it isalso worth noting that component 1 also holds other products such as coal and metals (e.g. iron,gold and copper). Figure 3(b) shows the distribution of the second component (with k = 2 ), whichis more homogeneous than the ﬁrst component, as the ﬁrst product weighs only 5 %, and the mostoutstanding products are passenger vehicles, electronic microcircuits, parts and accessories, etc.Hence, this component can be labelled to represent manufactured products in general.Moreover, Figure 4 shows the components’ distribution (for k = 2 ) according to the mentionedclassiﬁcation developed by Lall 2000. According to that Figure, the ﬁrst component is essentiallycomposed of primary products and manufactures that use primary products as inputs. On theother hand, component 2 presents a more uniform distribution, where medium and high technologymanufactures (e.g. engineering and electronics) stand out.However, it is worth noting that for k = 2 , agricultural, livestock and forestry products cannotbe singled out in one same component. That said, an interesting ﬁnding is that the division ofthe product space in only two groups allows the LDA model to ﬁnd a ﬁrst component mainlyformed by petroleum (and its derivatives) products, while the other holds mostly manufacturedproducts (SITC 5-8). In this sense, such model could allow understanding the classic corollary ofcomparative advantage models, where developed countries export manufactures (i.e. component2) while developing countries specialise their trade in raw materials (Balassa 1979). Some of theliterature places a particular role to oil production (and exports) within an economy’s structure(Ross 2012; Carrera 2017). In this sense, with k = 2 oil-producing countries’ exports seem tolead the LDA model in ﬁnding its optimum by building one of the two components with suchproducts. However, this dichotomy should be taken with care in the case of petroleum. As (Ross2012) states, the resource curse of oil producing countries may be biased upward in poorer countrieswhen using their dependence on hydrocarbon exports and derive "spurious associations betweenoil export dependence and a variety of economic and political maladies that are highly correlatedwith low incomes". This is hence an arguable statement, as oil exports reﬂect an indirect measureof a country’s non-oil economic size, although also the so-called "Dutch Disease" in oil-exportingcountries has often crowded out their agricultural and manufacturing exports due to the citedcomparative advantage (Ross 2012). 18 a) First component(b) Second component Figure 3: screenshot of the interface for component characterization Highlighting of the proportionof the product in the component, and cumulative distribution. k=219 a) First component (b) Second componenta) First component (b) Second component