Latent Dirichlet Allocation Models for World Trade Analysis
LLatent Dirichlet Allocation Models for World Trade Analysis
Diego Kozlowski ∗1 , Viktoriya Semeshenko , and Andrea Molinari DRIVEN, FSTM, University of Luxembourg, Luxembourg Universidad de Buenos Aires. Facultad de Ciencias Económicas. Buenos Aires,Argentina. CONICET-Universidad de Buenos Aires. Instituto Interdisciplinario deEconomía Política de Buenos Aires. Buenos Aires, Argentina
Abstract
The international trade is one of the classic areas of study in economics. Nowadays, giventhe availability of data, the tools used for the analysis can be complemented and enrichedwith new methodologies and techniques that go beyond the traditional approach. The presentpaper shows the application of the Latent Dirichlet Allocation Models, a well known techniquefrom the area of Natural Language Processing, to search for latent dimensions in the productspace of international trade, and their distribution across countries over time. We apply thistechnique to a dataset of countries’ exports of goods from 1962 to 2016. The findings show thepossibility to generate higher level classifications of goods based on the empirical evidence, andalso allow to study the distribution of those classifications within countries. The latter showinteresting insights about countries’ trade specialisation.
Keywords:
COMTRADE Data, Data Analysis, Topic Modelling, Latent Dirichlet Alloca-tion, Unsupervised Learning
The role that countries play in the global market is profoundly determined by their insertion intoglobal value chains, and by the types of goods they produce for the global market (Coe et al. 2004;Gereffi, Humphrey, and Sturgeon 2005; Gereffi 1994).Production systems, which were traditionally analyzed as almost independent national systems,are now continuously more connected on a global scale. Due to the increasingly complex andinterconnected nature of global supply chain networks, a recent strand of research has appliednetwork science methods to model global supply chain growth and subsequently analyse varioustopological features of these structures. Obviously, this depends on the dataset in use, as it definesthe topology of the network.In recent years, we have been witnessing a continuous growth of available data. This situationalso poses a great challenge, namely, how to extract hidden relations, determine appropriate pat-terns, clusters and trends to extract valuable conclusions from such large volumes of data (Padhy,Mishra, and Panigrahi 2012). ∗ [email protected] a r X i v : . [ phy s i c s . s o c - ph ] S e p raditional analysis tools are incapable to handle such complexity alone because it requires timeand efforts to extract and analyse information. On the other hand, interdisciplinary sciences providedifferent techniques and tools to apply to the analysis of this volume of data. The application ofnetwork formalism in the field of socioeconomic science has experienced unprecedented growth inrecent decades (Barabasi 2011; Caldarelli 2007; Ermann and Shepelyansky 2013; Fagiolo, Squartini,and Garlaschelli 2013). Also, there is a wide literature that studies international trade at theproduct level (Balassa 1965; Lall 2000; Lall, Weiss, and Zhang 2006; Haveman and Hummels 2004).In particular, these connections can be analyzed as a bipartite graph among countries and products(Guan et al. 2018; Straka, Guido Caldarelli, and Saracco 2017; Araújo and Ferreira 2016; GuidoCaldarelli et al. 2012), and the complexity of production can be explored using the product space(César A. Hidalgo 2009; C. Hidalgo and Hausmann 2009; C. A. Hidalgo et al. 2007). The worldtrade network can also be examined using multiplex and multilayer networks (Battiston, Nicosia,and Latora 2014; Kivela et al. 2011; Alves et al. 2019).In this paper, we adopt a different approach to extract interesting and significant patterns frombilateral trade data, using the Latent Dirichlet Allocation (LDA) modelling technique (Blei, Ng, andJordan 2003). Topic models have emerged as an effective method for discovering useful structurein data. At the same time, LDA is a statistical approach used in topic modeling for discoveringhidden topics in large corpora of text.Recently, a growing number of researchers are beginning to integrate topic models into variousdatasets (Pritchard, Stephens, and Donnelly 2000; Rosa et al. 2015; Fei-Fei and Perona 2005; Kim,Narayanan, and Sundaram 2009; Hu and Saul 2009), not only for document collections. To thebest of our knowledge, our work is the first effort to adapt and apply this technique for countries’exports.We find very suitable an analogy between topic modeling in texts and trade. In our adaptationof LDA, a set of countries plays the role of text documents, products play the role of words, andcomponents (i.e. latent dimensions within which these products group) play the role of topics.Based on the model of Blei, Ng, and Jordan 2003, we suggest a generative process to detect theselatent dimensions in the product space and build an alternative trade nomenclature directly fromdata. Then, using these latent dimensions, we analyze those components’ participation withincountries’ export baskets.Our main contributions and results can be summarized as follows: we develop a generativemodel, based on a well established methodology usually used in the field of Natural LanguageProcessing, to study the international trade flows. This model looks for automatic grouping ofthe products in latent components. We study these latent components, characterizing each bytype of production, complexity and its relation to a specific country over time. Then, we use thecomponents to briefly characterize the role in global trade of different groups of countries. Theresults that emerge from our model are in line with the specialized economic and trade historyliterature.The paper is organized in the following way: in section 2 we describe the dataset in use, insection 3 we introduce the notations and explain the methodology applied in the model, in section4 we present the obtained results, and in section 5 we conclude.2 Data
To apply the LDA technique, we used the United Nation Commodity Trade Statistics Database(COMTRADE) dataset of each country’s (four-digits) disaggregated exports from the Center forInternational Development at Harvard University. Such dataset contains trade data for around 250countries and territories, and takes the raw trade data on goods from countries’ reporting to theUnited Nations Statistical Division (COMTRADE).We used these data instead of the raw COMTRADE statistics because such data may containsome inconsistencies. To address this issue, the Center for International Development uses theBustos-Yildirim Method to clean data and "account for inconsistent reporting practices and therebygenerate estimates of trade flows between countries". Such method assumes that since these dataare recorded both as exports and as imports, cross-referencing countries’ reported trade flows againsteach other can produce reliable estimations. It consists of first correcting bilateral import values andthen comparing them to the reverse flows reported by the exporting partner. Their (per-country)estimated index of reliability for reporting trade flows measures the consistency of trade totalsreported by all exporter and importer combinations over time. Finally, they generate their owntrade values’ estimates using the data reported by countries together with such reliability index.Bilateral trade flows are mainly recorded in two trade classification systems: Harmonized System(HS) and Standard International Trade Classification (SITC), and data presents four dimensions:exporter, importer, product, and year. While both classifications are valid, there is a "time versusdisaggregation" trade-off entangled in the decision of which dataset to use. SITC data has a longertime-series (1962-2016) but it covers fewer goods (i.e. at higher levels of aggregation, up to 4-digits, approximately 750 products). On the other hand, HS data, being a newer classification,offers a more contemporary and detailed classification of goods (i.e. disaggregated up to 6-digits,with approximately 5,000 goods), but with the downside of offering a relatively shorter time period(1995-2017).We chose to work with SITC (in this case Revision 2) in order to have a larger time series, havingslightly more aggregated data (i.e. 4- instead of 6-digits) (United Nations Statistics Division 1975).Moreover, we reckon that 750 products should be enough to allow us to apply the LDA technique,as it should allow for enough (but not too much) granularity when labelling the components. Forsuch dataset, we make an empirical search for the best number of latent dimensions.
In this section we describe a probabilistic model constructed to study the trade flow data with theaim to generate an automatic grouping of the products.This cannot be achieved using traditional clustering techniques in high dimensional spaces (Ag-garwal, Hinneburg, and Keim 2001), due to the fact that a product can be used or consumed as anintermediate and/or final product at the same time, which means that groups can not be exclusive(Molinari and De Angelis 2016). Therefore, the problem we are dealing with can be examined with fuzzy clustering.At the same time, we need to deal with mitigating high-dimensional data issues through dimen-sionality reduction. This is possible due to the fact that we can explode similarities between the This dataset has been extracted on March, 2019. See https://atlas.cid.harvard.edu/about-data for more details. Imports are reported CIF (i.e. including freight and insurance costs) and exports free on board (FOB). R N ∗ P ∗ Y space. That is, the interaction of N countries, P products and Y years.We find it appropriate to use LDA to group products. While Blei, Ng, and Jordan 2003 lookfor a latent dimension of k topics , embedded in a highly dimensional dictionary distributed overthe texts that compose the corpus, here we are looking for a latent dimension of k components ,embedded in a highly dimensional classification of products distributed along the countries over theyears.We use the following terms to define our probabilistic topic model:• product is a basic discrete unit of analysis , defined as an item in a classification (SITC). Werepresent products using unit-basis vectors, where the superscript i stands for the i th productin the classification and the i th element in the vector. The V th product of the classificationis the vector w , such that w v =1 and w u =0, u (cid:54) = v .• country-year is a sequence of N products, defined as W = ( w , w , ..., w N ) .• corpus is the collection of M country-years, defined as D = ( d , d , ..., d M ) .• component is a latent dimension on the corpus, defined as K .The objective behind the classification of the products is twofold: on the one hand, look fora distribution of components over each country-year; on the other, analyse the distribution of theproducts within each of the components. In the original model proposed by Blei, Ng, and Jordan 2003, the words are supposed to be randomrealisations of chained distributions, ignoring the order in which the words appear in the document.Even when we know that the real data generating process is far from what our model proposes, thisinference process can still provide useful insight on the latent dimensions we are looking for. Thebasic idea of the generative process is that, given the amount of dollars exported by a country in aspecific year, the assignment of the product that will be exported comes from a random mixturesover latent components, where each component is characterised by a distribution over products.The sequence of the data generation can be described as follows:• For each country-year in the corpus, we assume that exports come from a following two-stageprocess: – choose randomly a distribution for the components, – for every dollar exported:∗ choose randomly the component to which it belongs, and∗ choose randomly a product from the distribution corresponding to that component.The data generating process can be formalised as follows:1. For every component k ∈ { , , ...K } β k ∼ Dir ( η ) , where η ∈ R > is fixed
2. For each country-year d ∈ { , , ...D } • Generate a vector of component proportions θ d ∼ Dir ( α ) , where α ∈ R > is fixed • For every exported dollar:(a) generate an allocation of the component z dn ∼ M ult ( θ d ) (b) assign the product w dn ∼ M ult ( β zn ) A Dirichlet process is a family of stochastic processes where the realizations are themselvesprobability distributions. It is often used in Bayesian inference to describe the prior knowledgeabout the distribution of random variables—how likely it is that the random variables are distributedaccording to one or another particular distribution.The parameters defining the Dirichlet distribution (here, η and α ) determine the degree ofconcentration of the resulting distributions. For a Dir ( α ) distribution, α defines the degree ofsymmetry of the multinomial distributions that the process generates. With values much smallerthan 1, the resulting distributions will be highly concentrated on some elements, while values muchlarger than 1 would generate very uniform distributions. In terms of our problem, α controls themixture of components for any given country, and parameter η controls the distribution of productsper component. A very small α will generate that each country has few characteristic components,while a very small η will generate a very asymmetric distribution over the products, and thereforethere will be a few very important products, and the rest with almost null probability. In this section we present results for the analysis of components, first at their distribution andthen at their main country over time. We confine our analysis over the period 1962-2016, for the250 reporting countries and P products (goods, not services), which, as mentioned, in SITC Rev.2(United Nations Statistics Division 1975) at 4-digits are approximately 750. In other words, we workwithin an order of magnitude similar to that of a regular dataset in a traditional Topic Modellingproblem. As mentioned before, prioritising a longer time series, we decided to use the SITC (Rev.2) 4-digit nomenclature.In the following sections, results are discussed in two stages. We first walk the reader throughthe decision of the number of components, also discussing the labelling process adopted, and thenanalyse the evolution of exports in the main country for each component.
In this subsection, we first explain the (granularity vs. economic interpretation) trade-off facedwhen using trade data with LDA. We then describe the process of finding the best suitable numberof components ( k ) for our problem, and the labelling of each component, to conclude with somereflections about the findings for the chosen k . In this case η = 1 /K . In this case α = 1 /K . k stands for the amount of components and plays a fundamental role in themodel. Fewer components (i.e. small k ) will tend to reflect broader concepts. On the other hand, if k is larger than the cardinality of the latent space (i.e. the implicit space for the grouped productsis smaller than the number of proposed components), this can generate repeated or over-specificcomponents. In other words, in our case this issue poses a trade-off between granularity and well-defined (i.e. easily "taggable") components.First, we ran the model for various values of k = 2 , , , , , , , , , , . A firstresult, observed for all these values of k , is that components that group best (for every k ) are thosecontaining: petroleum and derivatives, electronics, machinery and textiles. As mentioned above,hyperparameter k defines the components’ specificity. However, those phenomena worth exploring(and for economic interpretation) can be found at different levels of granularity. Hence, a firstproblem observed when analysing the different exercises is to define a suitable granularity for thecomponents. For relatively low values of k (i.e. up to k = 10 ) the petroleum component alwaysstands out. Conversely, from k = 20 we also find other sectors (e.g. electronic products, textiles, etc)in some components, while others hold a mixture of products that is harder to rationalise as a latentdimension. For values of k between 20 and 50 the resulting composition for each component is ratherstable, resulting in a good balance between more easily interpretable (i.e. taggable) components,together with an interesting level of granularity. For values of k , higher than 50, components tendto repeat themselves.Figure 1 shows an irregular distribution of components (for k = 2 to k = 50 ). Some componentsstand out, such as the first component for k = 2 , , ; or the eighth component in k = 8 , , .These components are mainly composed of petroleum products. This result can be interpretedin two (not mutually exclusive) ways. It could reflect that the dataset contains relatively moreprimary producing countries than industrial manufacturing ones, but it can also be showing thatthe former countries’ exports are more concentrated on primary products.A second step consisted of choosing a k that was appropriate, considering such granularity versustaggability trade-off. Since there is no clear optimum value for k , and although the literature withinthe text analysis domain has contributed with some proposals, this parameter should come from asubstantive search where the topics (or components) found are closer to the object of study that isanalyzed (Bonilla and Grimmer 2013; Quinn et al. 2010).To explore the distribution of products over components, and their cumulative function, wedeveloped a dynamic dashboard. Such distribution is plotted according to a widely used techno-logical exports classification (Lall 2000). After a substantive search (which involved the mentionedestimation, comparison and manual exploration of the model with different k ), we feel comfortablewith choosing k = 30 as the number of components offering the best trade-off between havingenough (economically interpretable) granularity and (a relatively low) components’ repetition.For k = 30 , we also tested for different η . As we want our components to have an asymmetricdistribution, to facilitate their labelling, we ran the model with small values of η . Specifically, wetested the model for η = 1 / , / , / , / . Components’ composition did not show substan-tive changes with different values for η , suggesting that the model is robust to variations in thepriors. For this reason, and given that the default value α = k gives good results in terms ofcountries specialisation, we decided to keep the default values of α = η = k for the mentioned In the limit, in our case, one could have one topic per product. This does not weight components by total exports, i.e. shows an equal basis among countries. See https://diego-kozlowski.shinyapps.io/LDA_worldtrade/ . Appendix A shows the step-by-step labellingexercise and some possible economic interpretation of results for k = 2 k .In a third and final step, we manually labelled the components for the chosen model ( k = 30 ).The mentioned dynamic dashboard (with each component product composition and the distribu-tion is plotted according to a widely used technological exports classification) helped the labellingexercise in terms of re-grouping products within each component by their technological complexity.Frequently, and even in text topic modeling, component labeling is quite difficult due to the lackof a generalisation criteria. We found that a downside of the LDA technique within the trade flowsdomain is that this issue is reinforced, since the subjective search for a comprehensive concept ofproducts traded among countries can turn to be a more complex task than searching for a generalconcept over a group of words. On the upside, polysemy, an frequent problem found in texts, doesnot exist when using trade data, where all signifiers (classification indexes) refer to a single andunambiguous meaning. However, other new problems arise, e.g. deciding upon the trade nomencla-ture or the data disaggregation level (which could be associated with choosing the language of thecorpus in text analysis). In our model, we first observed that the usual practice of looking at thefirst ten elements of the distribution was not sufficient to find a general label for each component,and for this reason we develop a more comprehensive dashboard.Table 1 shows the labels for our model (with k = 30 ), with a general description by component,except when that is not possible (e.g. component 19), together with a ‘subgroup’ that allows fora finer (or more detailed) product specification and, in the case of industrial products, the level of7echnological complexity (according to Lall 2000). Finally, the last column displays the country forwhich each component has the highest share (taking an average over the whole time period).It is interesting to particularly highlight component 5, albeit (as mentioned below) it is notdefined into a few products, given its high tech complexity at the beginning of the series (duringthe ’60s), but which later fell into disuse or decreased their share in international trade. In this sense,it is unsurprising that Czechoslovakia would be the most characteristic country of this component,given that, due to the country’s dissolution in 1992, its time series is shorter than the rest. Having the labelled components, this subsection analyses each country’s exports basket compositionover the period under study (1962-2016). Since by definition our unit is country-year, it is possibleto compare the evolution in components’ distribution within each country. Below, we highlightsome regularities that can be inferred from looking at the exports shares of the main country ineach component.The following analysis intends to present one of the various possible analysis that could beperformed with the LDA application proposed in this paper. Rather than being exhaustive, theintention of this section is to present results in a way that allows to understand the possible deriva-tions of the methodology. A previous time-series analysis for various country groups (i.e. coveringsome oil producing, North American, European, Latin American and Asian countries) focused itsattention upon the evolution of those countries’ exporting baskets. The overall conclusion is thatthe evolution of exports in Asia leaves a very different image to that of Europe. While in thelatter the concentration of the EU countries’ export baskets in a single component shows a nationaldifferentiation, this is not the case in Asian countries, which show a homogenization process.LDA results for Chinese exports structure show an interesting example to highlight (see Figure2). At the beginning of the ‘60s the most relevant component (28) was composed of rice, cotton,tea and some textile products. This component shows a downward trend, while clothing, toys, etc.(component 4) increases and becomes the most important over the period 1980-2003. However,from 1993 textiles and toys start decreasing, with a simultaneous rise in component 23 (televisions,computers, microcircuits and transistors), which towards the last years of the period constitutesapproximately 80% of the country’s exports. This change in the specific nature of Chinese exportsreflects three stages of increasing complexity of the country’s manufacturing industry, starting froma basically agricultural economy and, after a period of low-complexity industrialisation, becomingone of the world’s leading exporters of highly complex products (Chenery et al. 1986; Costantino2013).Among other findings, Kozlowski 2019 highlights the concentration of EU countries’ export bas-kets in a single component that varies between countries, hence showing national differentiation,while in most Asian countries tend to show much more homogeneous exports baskets. Further, itfinds a clear concentration of the exports baskets of the Organization of the Petroleum Export-ing Countries (OPEC) founding members, with some differences for Venezuela (mainly due to itsparticular history and some of its active public policies to diversify the countries exports Bértolaand Ocampo 2010. Another result worth mentioning is the that our LDA model captures the ex-ports specialisation in electronic products in the United States moving from analogue to digitaltechnologies over the period of study, together with the Maquila phenomenon in Mexico. That is, recording tapes, telephone lines or photographic paper. This means that its denominator is lower than that of the other countries of the dataset. Moreover, in component 25, the second and thirdproducts are significant in terms of Paraguayan exports and show important rises: “Oilcake andother residues (except dregs)” increased from 2% to 12.3% and sales of “Soya beans” from 0.3% in1963 to 23% (becoming the country’s main exporting product, even including services, in 2016).Moreover, British exports of “Passenger motor vehicles (excluding buses)” remained practicallystable (5.2 to 5.3%), although the following relevant products in component 30 (“Parts, nes of theaircraft of heading 792” and “Medicaments (including veterinary medicaments)”) saw significantincreases (from 0.3% to 3.6%, and from 0.6% to 5.3%, respectively).A second group of (three) components shows significant falls over the period. In component 8,where Chilean exports of “Copper and copper alloys, refined or not, unwrought” fell from 30.3%in 1962 to 22.6% in 2016, while “Copper ore and concentrates; copper matte; cement copper”exports decreased from 33.1% to 19.1%. Also, Finish (component 17) “Wood of coniferous species,sawn, planed, tongued, grooved, etc” exports fell from 21.4% (1962) to 2.7% (2016), while Pakistan(component 28) saw a shrinking share of its “Raw cotton, excluding linters, not carded or combed”exports, from 9.8% to 0.2%. Another result worth highlighting shows one component with a mixture of the second or thirdmain products (which still have a probability similar to that of the first one) with significant exportsboth falls and increases. In component 13, Switzerland’s exports of “Watches, watch movementsand case” fell from 12.7% (in 1962) to 6.5% (in 2016), but “Gold, non-monetary (excluding goldores and concentrates)” rose from 0% to 28% over the same period.A fourth group is formed by two components that show relatively constant trade over the period.In component 4, Macao experienced stable “Footwear” exports (from 4.1% in 1962 to 3.9% in 2016),and hence its emergence can probably be explained by its significant share in services exports (withtourism taking 88.8%). On the other hand, Germany (in component 11) exported an 8% (in 1962)and 11.2% (in 2016) in “Passenger motor vehicles (excluding buses)”, although its preponderancecan be due to the fact that it is the main world exporter of this good (see below).A singularity of this LDA trade data application is that in some (five) components it singlesout countries with a short time series due to their shorter data history, as mentioned to explainCzechoslovakia in component 5 in the previous subsection. This is the case of the aforementioned(component 7) Turkmenistan (with data from 1992), while Réunion data ranges over the 1962-1995 period and it mainly exports “Sugars, beet and cane, raw, solid” (third main product fromcomponent 9, with a 4% probability), with its exports basket shows an important concentration of The remaining products in the component are not relevant in terms of the country´s export shares. This country’s exports are stable in the other two main products of the component: “Rice, semi-milled or whollymilled” (6% to 7.3%) and very low in “Precious jewellery, goldsmiths’ or silversmiths’ wares” (0% to 0.2%). Other products within this component do not seem to be relevant in the country’s exports basket: “Children’stoys, indoor games, etc” fall from 0.6% to 0.1% and “Outerwear knitted or crocheted, not elastic nor rubberized;jerseys, pullovers, slip-overs, cardigans, etc” from 0.4% to 0.1% over the same period. Further, only one component (22) does not show a particular regularity that can explain therepresentative country (Ghana): its main product (“Petroleum gases and other gaseous hydrocar-bons, nes, liquefied”, with a 38% probability) is currently mainly exported by Qatar, rising from0.2% (in 1975) to 22.4% (in 2016). Finally, another interesting fact derived from our LDA model is that there is one product(“Passenger motor vehicles (excluding buses)”) captured as the main one in six of the 30 components(3, 6, 10, 11, 17 and 30). This seems to reflect different exports baskets specialisation in the maincountry for each component (respectively, Belgium, Japan, Mexico, Germany, Finland and UK). Aspreviously mentioned, Germany (component 11) has been the main exporter of this product overthe whole period (albeit with a falling share from 37.6% to 22.1% over total exports), while theJapanese share (component 6) grew from 1.9% to 13.5%, those from UK and Belgium fell (from19.6% to 5.9%, component 30; and from 4.7% to 3.8%, component 6; respectively), Mexico’s rose(from 0% to 4.7%; component 10), and Finland’s was the lowest (from 0.1% a 1.8%; component17).
The present work proposes the use of a technique widely explored in Natural Language Processingto the field of international trade. By shifting the data domain from text to each country’s exportsflows of each product, we managed to develop a typology of global trade based on a number oflatent components. This allows us to do two things. On the one hand, we build an automaticclassification of products based on data. On the other, we are able to study the trends in countries’exports, based on those components. Our findings are mostly in line with the specialized literaturefor each country or region, showing that this particular methodology is able to grasp an insight ofthe position of countries’s exports in global trade, making use of a single type of metric. Given thatthis methodology requires a minimum number of arbitrary decisions to be built, it turns out to bean interesting complement to the traditional forms of analysis.The limit of the proposed methodology is its dependence of the data inputs. Decisions madewith respect of the curation of the dataset can potentially affect all the results. If the dataset used The UK and India were, respectively, the main exporters of this product in 1962 and 2016. This result maybe reflecting Botswana’s relative comparative advantage in diamonds (i.e. a large share within its exports basketvis-à-vis the world average). Conversely, Ghana’s exports rose in “Palm oil” (with 10% probability in the component) but from 0.01% to 0.6%,“Natural rubber latex; natural rubber and gums” (6%), from 0.1% to 0.2%, and “Cocoa butter and paste” (1%; from2.8% to 4.4%). Over the period, Ghanaian exports fell in “Sawlogs and veneer logs, of non-coniferous species” (4%)from 7.6% to 0.9% and “Cocoa beans, raw, roasted” (3%) from 59.8% to 16.9%, “Wood, non-coniferous species, sawn,planed, tongued, grooved, etc” (3%) from 6.1% to 0.9%, “Plywood consisting solely of sheets of wood” (3%) from0.4% to 0.02%, and remained stable in “Tin and tin alloys, unwrought” (2%) and “Palm kernel oil” (1%), both withnull (or almost null) exports. th century, the resultant components would be very different withthe ones presented in the article due to the larger set of technologies involved, and the optimalnumber of components would probably increase. On the other hand, if a country is restricted to asubset of the years considered, it will have a overall closer relation with components specialized intechnologies of that time-frame, like in the case of Czechoslovakia. Even when each country-yearweights the same in the optimization of the model, i.e. we are not considering the weight of thetotal exports of each country-year on the cost function, countries with larger exports tend to showsmoother results, as is the case of China. This is due to the fact that the higher exports makeit difficult for a specific product to drastically change its proportion in the total exports of thecountry from one year to another. Small countries are prone to sudden changes in the proportionof components, because a small change in the nominal value of the exports of any specific productimply a big proportion over the total basket of exports. There is also an interesting phenomenathat occurs on the model with countries that have a highly concentrated export basket. For theOPEC countries we can see a drastic change by the end of the 70’. If we take the case of Iraq, forexample, it goes from an equal distribution on components 20 and 12 to a 100% in the component12 some years later. The distribution on the original SITC classification shows that this countryexported 61.68% in "Crude petroleum" and 36.5% "Petroleum products, refined" in 1977, and thenext year this changed to 85.03% and 12.59% respectively. This imply an increase of more than 23%of the overall basket in a single product. Still, it is not a 50% change as showed by the proposedmodel. The explanation for this is that both latent components, 12 and 20 include, with differentproportions, crude and refined petroleum. The model infer that the refined petroleum exportedfrom the 1978 on-wards comes from a different latent component than the one exported previously.We can say that if a countries export can be correctly describe only with two products, like in thiscase, using a model like LDA is not necessary for studying the exports basket. Another interestingphenomena that this model cannot fully capture is the case when the bilateral interactions implyboth imports and exports of highly complex product, and where one of the poles only produce asimple step in the production, like the mentioned Mexican maquilas. As we only use exports data,the model can only account for half of the process, producing potentially misleading conclusion ifnot used carefully. This problem, however, will arise in every metric that only accounts for theexports.Benchmarking the results of the LDA model is a complicated task, as it is an unsupervisedmodel. The best model should be the one that gives the most interpretable results, and that canbe used for the more insightful analysis. To test our model, we tried three other approaches for thesame task: finding the latent dimensions of international trade. First, we try two other methodstraditionally used for Topic Modeling in Natural Language Processing, namely Latent SemanticAnalysis (LSA) (Landauer et al. 2013) and Non-Negative Matrix Factorization (NMF) (Lee andSeung 1999). Then, we tried to adapt the product space C. A. Hidalgo et al. 2007; C. Hidalgoand Hausmann 2009 to achieve the same task as LDA, by using clustering techniques (Kaufmanand Rousseeuw 1987). The three techniques showed results that are in line with the ones found byLDA, but in a lower level of detail, where the interpretation of results became a harder task.It is interesting to look at the feasibility of the model given the change in the domain of the prob-lem. The very different nature of the data traditionally used in text mining and Topic Modelling,with respect to international trade data, raises the question of whether the model can operate inthe new domain. However, in terms of data structure, both problems have more similarities thanwhat it seems. First, the traditional dimension of the problem is NxV (N observations, in the orderof magnitude of thousands, V the vocabulary, also in the order of magnitude of thousands). In this13ase, the problem is approximately NxP, where the N observations are the year-country pairs, with250 countries and 54 years, and P products, which in SITC at 4 digits are approximately 750. Inother words, we are in an order of magnitude similar to that of a small dataset in a traditionalTopic Modelling problem. Finally, an important change in both domains is the difference betweenthe frequency of words in a text (tens or hundreds, depending on size of the documents) and thedollars exported of a product by each country-year (millions or billions). This difference in principleshould not affect the model, since what the model considers in its optimization are the distribu-tions between the different elements (word frequencies or exported values per product) and not theabsolute values.As future lines of work, as results are deeply connected with the input dataset, new datasources could provide different insights. For example, while our period seems long enough to reflectstructural changes, economic historians could find an even longer time series more useful to describesome phenomena. Also, including services to the dataset could show different aspects of global tradethat cannot be captured in an analysis only covering trade in goods. That said, data limitationswould pose a trade-off, as this would imply either a lower product dissagregation or a shorter timeseries dataset. Other lines of work involve an exploration by country groups, in order for example toexplore specialisation or complementarity among countries exports baskets, e.g. within a regionaltrade block.As final remark, we do not think this new types of techniques will be able to replace traditionalmetrics and empirical work on international trade, but rather we intend to complement traditionalanalysis and bring a new tool that might help in the understanding of this field. Conflict of interest
The authors declare that they have no conflict of interest.
Acknowledgment
The Doctoral Training Unit Data-driven computational modelling and applications (DRIVEN) isfunded by the Luxembourg National Research Fund under the PRIDE programme (PRIDE17/12252781), https://driven.uni.lu .This research was partly founded by the Préstamo BID - Proyecto de Investigación Científica yTecnológica (PICT) 2016-1185.Authors would like to acknowledge useful discussion with Daniel Heymann, Daniel Aromí andJun Pang.
References [1] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. “On the surprising behavior ofdistance metrics in high dimensional space”. In:
International conference on database theory .Springer. 2001, pp. 420–434.[2] L GA Alves et al. “The nested structural organization of the worldwide trade multi-layernetwork”. In:
Scientific Reports
Physica A: Statistical Mechanics and its Applications
World Bank Staff Working Paper
The manchesterschool
Nature Physics
Phys.Rev. E
89 (2014), p. 032804.[8] Luis Bértola and José Antonio Ocampo.
Desarrollo, vaivenes y desigualdad. Una historiaeconómica de América Latina desde la Independencia . Secretarıéa General Iberoamericana=Secretaria-Geral Ibero-Americana, 2010.[9] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In:
Journalof machine Learning research
Poetics
Scale-free networks: complex webs in nature and technology . Oxford UniversityPress, 2007.[12] Guido Caldarelli et al. “A network analysis of countries’ export flows: firm grounds for thebuilding blocks of the economy”. In:
PloS one
La renta de la tierra: formas, fuentes y apropiación . Ediciones ImagoMundi, 2017.[14] Hollis Burnley Chenery et al.
Industrialization and growth . Oxford University Press New York,1986.[15] Neil M Coe et al. “‘Globalizing’regional development: a global production networks perspec-tive”. In:
Transactions of the Institute of British geographers
Nueva Sociedad
244 (2013), pp. 84–96.[17] L Ermann and D L Shepelyansky. “Ecological analysis of world trade”. In:
Physics Letters A
Journal of Economic Interaction and Coordination . Vol. 2. 2005, 524–531 vol. 2.[20] Gary Gereffi. “The organization of buyer-driven global commodity chains: how US retailersshape overseas production networks”. In:
Commodity Clains and Global Capitalism (1994),pp. 95–122. 1521] Gary Gereffi, John Humphrey, and Timothy Sturgeon. “The governance of global valuechains”. In:
Review of international political economy
PLOS ONE doi : . url : https://doi.org/10.1371/journal.pone.0197575 .[23] Jon Haveman and David Hummels. “Alternative hypotheses and the volume of trade: thegravity equation and the extent of specialization”. In: Canadian Journal of Economics/Revuecanadienne d’économique
Science issn : 00368075. doi : . arXiv: .[25] César A. Hidalgo. “The Dynamics of Economic Complexity and the Product Space over a 42year period”. In: CID Working Papers issn : 6507247197.[26] César Hidalgo and Ricardo Hausmann. “The building blocks of economic complexity”. In:
Proceedings of the National Academy of the Sciences of the United States of America issn : 0027-8424. doi : . arXiv: .[27] Diane J. Hu and Lawrence K. Saul. A Probabilistic Topic Model for Music Analysis . 2009.[28] Leonard Kaufman and Peter J Rousseeuw. “Clustering by means of medoids. Statistical DataAnalysis based on the L1 Norm”. In:
Y. Dodge, Ed (1987), pp. 405–416.[29] S. Kim, S. Narayanan, and S. Sundaram. “Acoustic topic model for audio information re-trieval”. In: .2009, pp. 37–40.[30] M. Kivela et al. “Multilayer networks”. In:
Journal of Complex Networks
Oxford development studies
World development
Handbook of latent semantic analysis . Psychology Press, 2013.[35] Daniel D Lee and H Sebastian Seung. “Learning the parts of objects by non-negative matrixfactorization”. In:
Nature
DT IIEP (2016),pp. 1–59.[37] N. Padhy, D. Mishra, and R. Panigrahi. “The survey of data mining applications and featurescope”. In: arXiv preprint arXiv:1211.5723 (2012).1638] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. “Inference of population struc-ture using multilocus genotype data”. In:
Genetics
American Journal of Political Science
BMC Bioinformatics
16 (2015), S2–S2.[41] Michael Ross.
The oil curse: How petroleum wealth shapes the development of nations . Prince-ton University Press, 2012.[42] Mika J Straka, Guido Caldarelli, and Fabio Saracco. “Grand canonical validation of the bi-partite international trade network”. In:
Physical Review E
Standard International Trade Classification Revision 2 .ST/ESA/STAT/SER.M/34/Rev.2. Series M: Miscellaneous Statistical Papers, No.34 Rev.2,New York: United Nations. 1975. 17 ppendix A Model with k=2
Figure 3 displays the mentioned interface in the case of k = 2 , showing each 4-digit SITC (Rev. 2)code and its product description, together with its individual and accumulated probabilities withinthe component. Further, Figure 3(a) shows that the distribution of the first component assigns alarge weight to crude oil, followed by other petroleum products (e.g. diesel oil, propane gas, etc.).Hence, a plausible label for such component would be "Petroleum and derivatives". However, it isalso worth noting that component 1 also holds other products such as coal and metals (e.g. iron,gold and copper). Figure 3(b) shows the distribution of the second component (with k = 2 ), whichis more homogeneous than the first component, as the first product weighs only 5 %, and the mostoutstanding products are passenger vehicles, electronic microcircuits, parts and accessories, etc.Hence, this component can be labelled to represent manufactured products in general.Moreover, Figure 4 shows the components’ distribution (for k = 2 ) according to the mentionedclassification developed by Lall 2000. According to that Figure, the first component is essentiallycomposed of primary products and manufactures that use primary products as inputs. On theother hand, component 2 presents a more uniform distribution, where medium and high technologymanufactures (e.g. engineering and electronics) stand out.However, it is worth noting that for k = 2 , agricultural, livestock and forestry products cannotbe singled out in one same component. That said, an interesting finding is that the division ofthe product space in only two groups allows the LDA model to find a first component mainlyformed by petroleum (and its derivatives) products, while the other holds mostly manufacturedproducts (SITC 5-8). In this sense, such model could allow understanding the classic corollary ofcomparative advantage models, where developed countries export manufactures (i.e. component2) while developing countries specialise their trade in raw materials (Balassa 1979). Some of theliterature places a particular role to oil production (and exports) within an economy’s structure(Ross 2012; Carrera 2017). In this sense, with k = 2 oil-producing countries’ exports seem tolead the LDA model in finding its optimum by building one of the two components with suchproducts. However, this dichotomy should be taken with care in the case of petroleum. As (Ross2012) states, the resource curse of oil producing countries may be biased upward in poorer countrieswhen using their dependence on hydrocarbon exports and derive "spurious associations betweenoil export dependence and a variety of economic and political maladies that are highly correlatedwith low incomes". This is hence an arguable statement, as oil exports reflect an indirect measureof a country’s non-oil economic size, although also the so-called "Dutch Disease" in oil-exportingcountries has often crowded out their agricultural and manufacturing exports due to the citedcomparative advantage (Ross 2012). 18 a) First component(b) Second component Figure 3: screenshot of the interface for component characterization Highlighting of the proportionof the product in the component, and cumulative distribution. k=219 a) First component (b) Second componenta) First component (b) Second component