[PDF] Algorithmic Complexity and Reprogrammability of Chemical Structure Networks

Abstract

Here we address the challenge of profiling causal properties and tracking the transformation of chemical compounds from an algorithmic perspective. We explore the potential of applying a computational interventional calculus based on the principles of algorithmic probability to chemical structure networks. We profile the sensitivity of the elements and covalent bonds in a chemical structure network algorithmically, asking whether reprogrammability affords information about thermodynamic and chemical processes involved in the transformation of different compound classes. We arrive at numerical results suggesting a correspondence between some physical, structural and functional properties. Our methods are capable of separating chemical classes that reflect functional and natural differences without considering any information about atomic and molecular properties. We conclude that these methods, with their links to chemoinformatics via algorithmic, probability hold promise for future research.

Full PDF

AAlgorithmic Complexity and Reprogrammability ofChemical Structure Networks ∗ Hector Zenil , , , , Narsis A. Kiani , , , , Ming-Mei Shang , and Jesper Tegn´er , , Algorithmic Dynamics Lab, Centre for Molecular Medicine,Karolinska Institute, Stockholm, Sweden Unit of Computational Medicine, Department of Medicine,Karolinska Institute, Stockholm, Sweden Science for Life Laboratory, SciLifeLab, Stockholm, Sweden Algorithmic Nature Group, LABORES for the Natural andDigital Sciences, Paris, France Biological and Environmental Sciences and Engineering Division,Computer, Electrical and Mathematical Sciences and EngineeringDivision, King Abdullah University of Science andTechnology (KAUST), Kingdom of Saudi Arabia { hector.zenil, jesper.tegner } @ki.se Abstract

Here we address the challenge of proﬁling causal properties andtracking the transformation of chemical compounds from an algorith-mic perspective. We explore the potential of applying a computationalinterventional calculus based on the principles of algorithmic proba-bility to chemical structure networks. We proﬁle the sensitivity of theelements and covalent bonds in a chemical structure network algorith-mically, asking whether reprogrammability aﬀords information aboutthermodynamic and chemical processes involved in the transformationof diﬀerent compound classes. We arrive at numerical results suggest-ing a correspondence between some physical, structural and functionalproperties. Our methods are capable of separating chemical classesthat reﬂect functional and natural diﬀerences without considering anyinformation about atomic and molecular properties. We conclude thatthese methods, with their links to chemoinformatics via algorithmic,probability hold promise for future research.

Keywords: molecular complexity; algorithmic probability; Kolmogorov-Chaitin complexity; causality; causal path; information signature; chem-ical compound complexity; algorithmic information theory; Shannonentropy ∗ An online implementation to estimations of graph complexity is available online at a r X i v : . [ q - b i o . M N ] M a r Background and Preliminaries

One of the major challenges in modern physics is to provide proper andsuitable representations of network systems for use in ﬁelds ranging fromphysics [3] to chemistry [7]. A common problem is the description of orderparameters with which to characterize the ‘ complexity of a network ’. Graphcomplexity has traditionally been characterized using graph-theoretic mea-sures such as degree distribution, clustering coeﬃcient, edge density, andcommunity or modular structure.A previous algorithmic information-theoretic view of systems toxicityapplicable both to network analysis and pharmacokinetic analysis has beenproposed [13], with the overarching aim of not only describing but alsoseeking out causal mechanisms. The suggestion was that since the prob-lem of designing new compounds, aiming to develop drugs, for new targetsis challenging, whereas the prediction problem is easier from an inferencepoint-of-view compared to elucidating the mechanisms driving toxicity, com-plementary approaches are warranted.For example, instead of engineering a drug to target a unique pathwayor mutation of a tiny subset of diseases, drug repositioning involves startingwith approved drugs to ﬁnd combinations that can be used to treat dis-eases other than the ones they were designed for, with the advantage thatapproved drugs can bypass much regulation if we correctly control for theeﬀects they can have. Thus prediction and simulation are key. This meansthat the whole ﬁeld has to move towards causal modelling and functionalinference rather than employing traditional statistical and purely geometricapproaches (e.g. distances between compounds or grid-based docking).Here we are interested in combining techniques originating in funda-mental mathematics and theoretical computer science to take a fresh lookat long-standing challenges in molecular complexity from an algorithmicinformation perspective as applied to networks [33, 28].Algorithmic information indices may facilitate the characterization ofsome properties of chemical compounds. Statins, for example, are associ-ated with the heart and cholesterol, while morphine, codeine and heroinshare structural properties and eﬀects. Algorithmic information-theoreticapproaches like the one showcased here are concerned with predictive causalmodels, going beyond statistical/descriptive approaches (such as structuralalignments). This is important because, for statins, for example, block thecholesterol synthesis pathway by inhibiting the HMG-CoA reductase be-cause similarity to HMG-CoA structure, which is the rationale used to treatcardiovascular diseases in patients. The algorithmic approach deployed here2s thus equipped to ﬁnd model-based mechanistic candidates for this kindof causal processes involved in the transformation and interactions of com-pounds. Here we will focus on one kind of representation namely chemicalstructure networks, ﬁrst independent of physical properties that can then betested and connected back to ﬁnd what is a consequence of the compoundcausal structure/topology and what is a consequence of other intrinsic fea-tures such as atomic charge and thermodynamic constraints.For this approach we will use the techniques and methods developed in[33, 31]. The basic idea is to estimate the likelihood of similarity betweencompounds based on an induced partition of possible common underlyingmechanisms (models are found by an exhaustive algorithm that explainssmall pieces of the data). This is, in general, a hard if not impossibletask (uncomputable), but approximations have been shown to be usefuland new numerical methods have been advanced that are complementaryto previous approaches such as the use of lossless compression algorithms toapproximate algorithmic complexity, which are very limited at accountingfor causation [30, 32]. Moreover, even when the method is based on the ideaof ﬁnding minimal programs, the more practical aim is to ﬁnd any, or a setof programs, explaining the data rather than the smallest one, and so theproblem becomes computationally feasible [9, 21, 27, 30].

There are two main notations for chemical substances. The simpliﬁed molecular-input line-entry system or SMILES is an ASCII string speciﬁcation describ-ing the structure of a chemical. The string is obtained by printing the symbolnodes encountered in a depth-ﬁrst tree traversal of the chemical graph. Thechemical graph is ﬁrst trimmed to remove hydrogen atoms, and cycles arebroken to turn it into a spanning tree. Where cycles have been broken,numerical suﬃx labels are included to indicate the connected nodes, andparentheses are used to indicate points of branching on the tree. For exam-ple, nicotine is written as CN1CCC[C@H]1c2cccnc2. A SMILES string thusencodes and contains information about a molecule and is an upper boundof its information content.A more standardized notation in chemistry is the IUPAC InternationalChemical Identiﬁer or InChI, another textual identiﬁer for chemical sub-stances, its chief advantage over SMILES being that the InChI algorithmconverts the structural information of the chemical substance in a 3-stepprocess that can be tweaked to a desired level of structural chemical de-tail, except for an unchanged substring representing the substance (called3he main layer). The algorithm then removes redundant information, keepsas much of the information of the structure as desired, and encodes it ina string. In some sense, InChI is thus a tighter bound of the algorithmiccomplexity of the chemical structure captured by ASCII strings and an im-provement over SMILES.

The concept of molecular complexity has been shown to be relevant in thedesign of syntheses through minimizing the sum of molecular complexities ofthe synthetic intermediates [10]. A number of proposals have been made inthe literature for deﬁning molecular complexity. For example, enumerationsof graph invariants for comparing chemical structures have been used for atleast 3 decades, and QSAR regression models [16] for even longer. Diﬀerentapproximations emphasize diﬀerent aspects of the molecule and are heavilyobserver dependent, because the observer has to make a pre-selection offeatures of interest (e.g. clustering coeﬃcient, some eigenvalues in graphspectra, degree distributions, etc.). This is what the chemistry communityhas been doing with what they call “chemical ﬁngerprints”, and prior tothat in molecular formulae comparing degree distributions.In 1981, Bertz [5] introduced a measure of molecular complexity byapplying Shannon’s entropy to the distribution of subgraphs in moleculargraphs. That was the starting point of a systematic search in chemical the-ory for relevant measures of molecular complexity. Ever since, graph andmolecular complexity measures have focused on statistics of the topolog-ical properties of graphs, such as the size of the non-repeating subgraphset, among similar approaches. For recent results and a survey see [4, 15],including a proposal for using lossless compression as a graph complexityindex [17]. An up-to-date review of mainstream techniques in the area ofmolecular networks can be found in [8].Molecular complexity is not easy to deﬁne or to quantify, and all pre-vious approaches have focused on combinatorial or statistical properties ofthe molecular graphs, either as a function of bond connectivities, speciﬁcityof structures or diversity of elements. Researchers agree that the complex-ity of a molecule increases with increasing size, increasing branching, andincreasing cyclicity for acyclic and cyclic structures [19], but no computablemeasure can cover all possible enumerable computable features like these(both currently deﬁned and undeﬁned) at the same time. Indeed, a moreuniversal and robust measure of molecular complexity should take into ac-count all these features of interest at the same time, without having to4numerate them explicitly or to deﬁne an ad-hoc measure for each of them.A measure that only focuses on some of these properties in the expectationthat it could later be generalized to be able to deal with other properties isout of the question.Small Molecules or compounds are commonly represented by their skele-tal molecular graphs (see Fig. 1A). That is, the union of a set of points,symbolizing atoms other than hydrogen, and a set of lines, symbolizingmolecular bonds. A typical similarity formula (see Fig. 1E-F) is given bythe total number of elements in the bin 0 divided by the square of the totalcount of atoms of the largest molecule. The closer to 0, the more dissim-ilar. The formula can be relaxed by taking near 0 bin elements, but thisonly works for the most simple cases of structural similarity. This kind ofapproach is descriptive rather than predictive. For example, regardless ofdiﬀerent biological mechanisms of action, aspirin and statins have shownsimilar beneﬁcial eﬀects on cardiovascular diseases (CVDs) at populationlevel, combination usage of the two drugs has additive eﬀects, but it is notyet clear in what precise ways aspirin and statins diﬀer, while the generalagreement is that they possess similarities (e.g. accumulating evidence frombasic and observational research demonstrate the anti-inﬂammatory eﬀectsof both drugs contribute to CVDs treatment, and combined usage of twodrugs has additive and synergistic eﬀects over the use of only one[1]). It isnot diﬃcult to see how in some cases compounds or compound substruc-tures may share information from complementary regions, e.g. betweendrugs and targets, because the structure of the docking entity is energet-ically and structurally the complement of the docking region of the otherentity, and thus the compounds may display (partially or entirely) similarclassical and algorithmic information properties and estimations.Molecular complexity is not easy to deﬁne or to quantify, and all pre-vious approaches have focused on combinatorial or statistical properties ofthe molecular graphs, either as a function of bond connectivities, speciﬁcityof structures or diversity of elements. Researchers agree that the complex-ity of a molecule increases with increasing size, increasing branching, andincreasing cyclicity for acyclic and cyclic structures [19], but no computablemeasure can cover all possible enumerable computable features like these(both currently deﬁned and undeﬁned) at the same time. Indeed, a moreuniversal and robust measure of molecular complexity should take into ac-count all these features of interest at the same time, without having toenumerate them explicitly or to deﬁne an ad-hoc measure for each of them.The number of possible statistical and algorithmic properties in all pos-sible networks is countably inﬁnite, but no eﬀective (computable) measure5an account for all of them [14]. This only leave us with uncomputable mea-sures that can serve as general universal measures of complexity equipped toﬁnd any eﬀective (statistical or algorithmic) regularity. Our approach mayﬁnd some applications. For example, one may ﬁnd that low algorithmiccomplexity molecules are easier to synthesize or to assemble into larger newmolecules and drugs because high algorithmic complexity molecules wouldshare fewer physical properties.

The concept of algorithmic complexity [11, 6] is at the core of the chal-lenge of complexity in discrete dynamic systems, as it involves ﬁnding themost statistically likely generating mechanism (computer program) that pro-duces some given data. Formally, the algorithmic complexity (also knownas Kolmogorov-Chaitin complexity) is the length of the shortest computerprogram that reproduces the data from its compressed form when runningon a universal Turing machine.We follow the so-called Coding Theorem (CTM) and Block Decomposi-tion Methods (BDM) as introduced in [9, 21, 27, 30], based on the seminalconcept of Algorithmic Probability [22, 12], which in turn is strongly re-lated to algorithmic complexity [11, 6]. The only parameters used for thedecomposition of BDM as suggested in [30] was the maximum 12 for stringsand 4 for arrays given the current best CTM approximation [21] based onan empirical distribution based on all Turing machines with up to 5 states,and no string/array overlapping decomposition for maximum eﬃciency (as itruns in linear time) and for which the error (due to boundary conditions) isbounded [30]. However, the algorithm introduced here is independent of themethod used to approximate algorithmic complexity, such as BDM. BDMassigns an index associated with the size of the most likely generating mech-anism producing the data according to Algorithmic Probability [22]. BDMis capable of capturing features in data beyond statistical properties [30, 29],and thus represents an improvement over classical information theory. Be-cause ﬁnding the program that reproduces a large object is computationallyvery expensive even to approximate, BDM ﬁnds short candidate programs(which are generative models) using a method introduced in [9, 21] that ﬁndsand reproduces fragments of the original object and then puts them togetheras a candidate algorithmic model of the whole object [30, 27]. These shortcomputer programs are eﬀectively candidate mechanistic models explainingeach fragment, with the long ﬁnite sequence of short models being itself agenerating mechanism. 6n this sense, a causal path is a path where the changes between onestate and another is merely the product of an underlying dynamical systemfollowing its normal course, sans external intervention [31].An important concept is that of the information signature of an ob-ject [31]. An information signature quantiﬁes the algorithmic resilience ofan object to transformations, that is how much its most likely mechanisticmodel may change after modifying the object. In the case of networks, per-turbations can be applied to nodes or edges, that is, in the context of chemi-cal structure networks to atoms and molecular bonds which means that onecan have both node and edge information signatures (see Fig. 1G,H). Com-paring information signatures is therefore a way to perform an algorithmicalignment among diﬀerent objects such as chemical compounds.

We perturb the structure of a chemical compound network and see the eﬀecton the set of candidate generating models by performing interventions andranking them by the disruptiveness and causal contribution to the networks’original algorithmic information content and therefore to the networks’ orig-inal hypothesize generative models (as found by the CTM/BDM method).We may attach the rubric in silico alchemy to the digital computer sim-ulation of the types of changes that a molecule can be subject to regardlessof the thermodynamic aspects of said molecule or the processes involved(later, we will compare it to the known processes and physical propertiesassociated with the old and new compounds). Central to the ideas exploitedhere is the notion of the information signature as the result of an in silicosimulation measuring the sensitivity of causal generative model of a com-pound to perturbations. The information signature depicted in Fig. 1Gillustrates a set of such interventions/perturbations simulating the kinds oftransformations that a compound such as aspirin can undergo, measuringthe structural sensitivity to single-bond changes and the susceptibility ofaspirin to being reprogrammed (artiﬁcially converted) into a diﬀerent orsimilar (causal) structure in what would be a causal path.This is an alchemy of sorts, because some of these transformations maybe thermodynamically unlikely and the simulation takes no account of anyphysical properties (one can easily transform any element into gold undersuch conditions). But all likely causally topological paths are studied asequally possible in order to determine if there are thermodynamic eﬀects that7

B C DE FG H

Figure 1: A: Canonical structural diagram of aspirin. B: The undirectedchemical structure network of aspirin, where shape, particular elements andbond types are not retained. C and D: Statins have similar contact maps.Here we feature lovastatin and simvastatin. Contact maps of a compoundare calculated from the distance among its constituent atoms; the furtheraway, the darker. E: Alignment histograms between lovastatin and sim-vastatin. The greater the number of atoms in or around the zero bin, themore similar. The x -axis represents the average distance among atoms. F:Weaker alignment between lovastatin and aspirin than among statins. G:Algorithmic (mis)alignment from the node information signatures of aspirinversus statins (normalized by aspirin size) with lovastatin topping all oth-ers. H: The edge information signature of aspirin is all positive i.e. allsingle-bond perturbations to aspirin make its generative mechanistic modeleven simpler. In comparison, statins (average signature normalized by as-pirin size) have very similar signatures but diﬀer in the number of molecularbonds that when removed send the compound networks towards algorithmicrandomness. 8an be explained by algorithmic causality, rather than being external eﬀectsfollowing particular laws but being intrinsic properties of the compound inquestion. All values in the node (Fig. 1G) and edge (Fig. 1H) information signaturesof aspirin are positive, that is, no intervention targeting any element pushesaspirin to become a more complex compound. Their algorithmic alignmentis similar to the one produced by classical geometric alignments yet thereare less arbitrary cutoﬀ values (atom to atom distance threshold). Theresult is consistent with the literature characterizing aspirin as a simplestructure. More importantly, the signature of aspirin has two clearly iden-tiﬁable regions. In the disruptive regime at about 25 bits—measuring thediﬀerence between the mutated/disrupted compound and the original aspirinstructure—are the elements of the carbon ring that is identiﬁed as the moststable structure in aspirin. The reduction in algorithmic complexity comesfrom the fact that breaking the carbon cycle produces a simple tree graphwith a long path graph, a graph that is of even lower algorithmic complex-ity because there is an even shorter program that can produce a tree with along path than a structure with the ring/cycle. In contrast, removing all hy-drogen and oxygen elements makes an algorithmically neutral contribution,meaning that their removal is less disruptive to the core of the structure ofaspirin (even though it may be more deeply implicated in its functioning, alimitation of this type of analysis if, e.g., valency values or electric chargesare not incorporated in the network description (e.g. as weights)— whichit is possible to do though we do not cover it in this paper). Yet the signa-ture analysis indicates that such elements may more easily be found in morecausal paths than atoms from the carbon ring. In other words, it is algorith-mically less random to ﬁnd a carbon ring in the middle of a structure than toadd some elements to the molecule by, e.g., methylation or phosphorylation.Another observation from Fig. 1H is that ﬂuvastatin has the most negativenode information signature among the statins and it is also the statin withthe lowest number of interactions compared to most other statins and lo-vastatin is the most positive and also similar to aspirin in its informationchanges remaining algorithmic simple, with all node and edge perturbationspositive. Lovastatin has similar interactions to atorvastatin and simvastatinwhich are also close to Lovastatin in the signature information landscape.Aspirin is, however, in the middle of the statins pack suggesting a greatersimilarity than what classical alignment methods suggest (Fig. 1F).9 .3 Algorithmic causal transformations

Chemical compounds that may be in the same causal path will have similarsignatures [31] from single bond or atomic knock-out interventions. One suchexample may be that of acids and alcohols compared to compounds that arestructurally—and causal structurally more removed—such as compoundsone may expect to form between organic versus inorganic substances (astested in Fig. 2D). This can be seen by analyzing the tails of the signaturedistributions (Fig. 2C-D) that provide a network with the means of movingtowards or away from its original algorithmic model.It is also of interest to ﬁnd a correspondence between the algorithmicdiﬃculty of transforming a compound such as an acid to an alcohol. Whatis suggested in the preliminary experiments is that it is (slightly) more dif-ﬁcult because it implies a reduction of algorithmic randomness (Fig. 2C),as compared to transforming an alcohol to an acid, which is consistent withthe literature showing that oxidants able to perform this operation in, e.g,.complex organic molecules require substantial selectivity, therefore makingit less likely to happen by the chance imposition of a thermodynamic di-rection, something also suggested by the algorithmic causal calculus. Thisis, however, less dramatic than transforming inorganic into organic com-pounds according to the simulation (Fig. 2D), where inorganic compoundsseem to require a larger increase of algorithmic information content to reachthe complexity of organic compounds, suggesting that inorganic compoundsare algorithmically simpler as they are algorithmically more probable, andtherefore may occur naturally with much greater ease. Counterintuitively,the results pertaining to organic versus inorganic substances may suggestthat organic compounds are much more stable (less reprogrammable) thaninorganic compounds in general, and structurally this may be the case, giventhat the main diﬀerence separating the two classes is the stability providedby carbon atoms that are the building blocks of organic matter.Figure. 2(A) illustrates the ﬁnding that the class of heavy metals isthe least complex according to their algorithmic complexity estimation byBDM. The reason is that most of them tend to be very simple and smallwhile possessing properties that endow them with stronger covalent bonds.In contrast, pyrimidines, for example, which comprise the basis of DNAand RNA nucleotides, are the most complex (together with purines theyform the other nucleotides among the highest complexity heterocylic aro-matic organic compounds (see Fig. 2(A))). The results show that taking thehighest algorithmically complex compounds would proﬁle all pyrimidines(558) with high accuracy, of all the other compounds considered in this10

BC D

Figure 2: A and B: Classiﬁcation of compound classes according to their es-timations of algorithmic probability/complexity by BDM. C and D: In silicoalchemy by intervening in chemical structure networks and analyzing theiralgorithmic properties. The algorithmic sensitivity of acids and alcohols isvery similar, corresponding to known mechanistic processes that can trans-form one into the other. A minor asymmetry can be found similar to thediﬃculty of converting one compound into another. In contrast, organic vsinorganic compounds are among the most dissimilar.11 lass Count Class Count Class Count

Acids 537 Iodinated 774 Drugs 4630Alcohols 1190 Ketones 865 Fluorinated 3470Aldehydes 723 Monomers 994 Halogenated 11223Amides 531 Nitriles 1100 Heterocyclic 10964Amines 1916 Pyridines 1435 Inorganic 3605Amino Acid Derivatives 799 Pyrimidines 558 Liquids 10474Carboxylic Acids 1145 Aromatic 18567 Organic 27969Esters 1409 Biomolecules 6159 Organometallic 2315Ethers 619 Brominated 2620 Salts 4718Heavy Molecules 1060 Chiral 5915 Solids 22936Hydrocarbons 1499 Chlorinated 5680

Table 1: Number of elements per compound class used in the experiments.A total of 158 399 extracted from ChemicalData[] in the Wolfram Language,the sources relied upon being provided in the documentation.database (44 089), and likewise the lowest complexity retrieves all inorganiccompounds followed by heavy and iodinated molecules. The total numberof compounds per class can be found in Table 1.It is also of interest to note that classes of compounds whose algorithmiccomplexity estimation median values are close to each other are causally re-lated. For example, acids and alcohols appear to have very similar algorith-mic information content, notwithstanding the fact that alcohol structuresare much larger in size than acids. Their structure and causal origin canbe regarded as similar as they can be derived from each other with theirmain structure unchanged. However, esters are diﬃcult to reduce to ethers,as they decompose to yield alcohols via decomposition of the intermediatehemiacetals even when esters can be reduced to ethers [25]. It is thereforeinteresting to emulate the evolution of these networks through all possiblechemical trajectories and see how algorithmically easy or diﬃcult it is forthem to become other compounds favouring certain properties, and how(un)stable they may be in the face of perturbations, independently of ther-modynamics, while being ultimately related (and we will devise some testsin this regard). We use a measure of sophistication based on logical depthand a measure of reprogrammability gauging the susceptibility of a chemi-cal compound to being converted into some other, more random or simpler,chemical compound.Statistical overlapping of complexity estimations is to be expected, giventhat classes are not distinct. Aspirin is a drug, but it is also classiﬁed as abiomolecule, organic, solid and aromatic. Many classes also share elements12igure 3: Graph branching is one of the most important measures proposedin molecular complexity. Here it is shown how algorithmic-based measuressuch as lossless compression (Compress and Bzip2) are correlated with graphspectra, which in turn capture graph branching [18]. The correlation isstronger than that obtained when using only Shannon entropy. All datacome from the Wolfram Language database retrieved by the ChemicalData[]functionwith chirality properties. However, a signiﬁcant divergence between con-trasting classes such as those not closely causally related is to be expected,as measured by the algorithmic probability of a compound being in thecausal path of another class of compounds (i.e. there being no simple chemi-cal/thermodynamic process—natural or artiﬁcial—to convert most elementsfrom one into another class) such as those of an organic nature (including,for example, elements under organic, biomolecules, heterocyclic, pyridines,pyrimidines and monomers) into those of a more inorganic nature (includ-ing, for example, the category inorganic itself and heavy molecules). Otherfeatures driving the complexity estimation are properties such as are foundin heterocyclic compounds, compounds containing atoms from at least twodiﬀerent elements as members of its rings, thus being structurally richer onaverage. Nucleic acids and most drugs are also high in algorithmic complex-ity and have large variance values, indicating that designed compounds tendto be wide-ranging in nature while inclining towards complexity. Anothercontrasting/disjoint pair of classes is liquids versus solids, with statisticallydiﬀerent complexity values. It is interesting to note that chemical structurenetworks of liquid compounds are signiﬁcantly less complex than those ofsolids, and that inorganic compounds are the least complex, while drugs are13he most complex and feature a high incidence of human-made syntheticcompounds.Figure 4: SMILES and InChI molecular strings correlate with complexityeven when normalized by size, as would be expected since both notationscapture very important properties of the compounds. InChI has a greatercorrelation statistic, as would also be expected from the fact that the se-quence notation was designed to capture more information about the molec-ular compounds it represents.

The simpliﬁed molecular-input line-entry standards SMILES and InChI arespeciﬁcations in the form of a line notation for describing the structure ofchemical species using short ASCII strings. SMILES is a string obtainedby printing the symbol nodes encountered in a depth-ﬁrst tree traversal ofa chemical graph. The chemical graph is ﬁrst trimmed to remove hydrogenatoms, and cycles are broken to turn it into a spanning tree. Hence oneshould ﬁnd some correlations with properties of the represented moleculargraph. InChI contains all the information contained in a SMILES descriptionand more atomic information such as bond connectivity, tautomeric informa-tion, isotope information, stereochemistry, and electronic charge. Figure. 4reports the correlation found between these notations and complexity byseveral complexity indexes and Fig. 3 describing the correlation betweenbranching (a common index of order parameter in molecular complexity)and measure of statistical and algorithmic complexity.The results reported in Fig. 5 (Appendix) are interesting because one canthink of algorithmic complexity as a method sorting by size of the underly-14igure 5: Random sample of chemical structures sorted by algorithmic com-plexity: BDM sorts molecular graphs by their adjacency matrix, not by theirsize but their shape. As shown in Fig. 7 this makes for biases reﬂected inthe class memberships of the molecular compounds, mainly between organicand inorganic, and between solids, liquids and solvents.ing generating mechanism causing each structure, and it would be expectedto ﬁnd compounds that behave or produce similar states to be generatedby similar mechanisms. The results reported together with the class en-richment and depletion analysis shown in Fig. 7(Appendix) are interestingbecause, as has been suggested, in the case of organic molecules, the lowerthe information content the fewer the possibilities for diﬀerent interactionswith other molecular compounds. Fig. 6 illustrates how the algorithmiccomplexity approximated by BDM correlates with some physical propertiesof the molecular compounds. This is interesting because it suggests– if thecorrelation is actually correct– that some information about properties thatare global properties, such as temperature, is in the local structure of themolecule, which is not surprising if one recalls that how rigid or chargeda particle is may have an impact on its dynamic interactions with othermolecular compounds. 15

Conclusions

We have identiﬁed and illustrated interesting research avenues that a causalinterventional calculus based on algorithmic probability/complexity (as thestudy of a system’s changes in algorithmic information) can bring to thediscussion of molecular complexity and chemoinformatics, in particular tochemical structure networks. Indeed, usually a drug has a backbone fromhigh throughput screening leading hits and intensive modiﬁcation of thechemical structure is done to make it drug like, providing improved stability,solubility etc.We have found that this algorithmic approach suggests similarities forgraphs/networks that may be explained by common generative mechanisms,suggesting an algorithmic likelihood of causal transformations from candi-date models found by algorithmic probability. We found that the methodseparates distinct classes of chemical compounds, both by estimations ofthe algorithmic complexity of chemical structures and for causal sensitiv-ity based on a measure of agnostic (no physical properties being involved)reprogramability. The experiments with statins whose similar eﬀects toaspirin are an open question, suggests a measure of similarity and of in-teractions/toxicity that should be further tested. We have also shown thatthe measures show various degrees of correlation with some chemical com-pound’s physical properties.Our approach eﬀectively introduces a new dimension in the study ofinformation-theoretic properties and algorithmic transformations of com-pounds, and further explorations and generalizations to more general com-pound networks (where several compounds are connected), binding and re-action networks should be investigated.

Acknowledgements

H.Z. was supported by the Swedish Research Council (Vetenskapsr˚adet)grant No. 2015-05299.

References [1] V.G. Athyros, D.P. Mikhailidis, A.A. Papageorgiou, V.I.Bouloukos, A.N. Pehlivanidis, A.N. Symeonidis, A.I. Kakaﬁka,S.S. Daskalopoulou, M. Elisaf, Eﬀect of statins and aspirin alone andin combination on clinical outcome in dyslipidaemic patients with16oronary heart disease. A subgroup analysis of the GREACE study,

Platelets , 16(2):65–71, 2005.[2] R. Albert, A.-L. Barab´asi, Statistical mechanics of complex networks,

Reviews of Modern Physics , 74 (1): 47–97, 2002.[3] S. Boccaletti et al. The structure and dynamics of multilayer networks.

Physics Reports , 544(1):1–122, 2014.[4] M. Dehmer and A. Mowshowitz, A history of graph entropy measures,

Information Sciences

J. Am.Chem. SOC. , 103, 3599–3601, 1981.[6] G.J. Chaitin. On the length of programs for computing ﬁnite binarysequences

Journal of the ACM , 13(4):547–569, 1966.[7] Z. Chen, M. Dehmer, F. Emmert-Streib, and Y. Shi. Entropy boundsfor dendrimers.

Applied Mathematics and Computation , 242:462–472,2014.[8] P. Csermely, T. Korcsm´aros, H.J. Kiss, G. London, R. Nussinov, Struc-ture and dynamics of molecular networks: a novel paradigm of drugdiscovery: a comprehensive review,

Pharmacol Ther. , 138(3):333–408,2013.[9] J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings: A Glance Into the Innermost Structure of Algorith-mic Randomness,

Applied Mathematics and Computation

Problems of Information and Transmission , 1(1):1–7,1965.[12] L.A. Levin. Laws of information conservation (non-growth) and as-pects of the foundation of probability theory,

Problems of InformationTransmission , 10(3):206–210, 1974.1713] N.A. Kiani, M. Shang, H. Zenil and J. Tegn´er, Predictive Systems Tox-icology. In Orazio Nicolotti (ed.),

Computational Toxicology - Methodsand Protocols, Methods in Molecular Biology , Springer, 2017 (in press).[14] P. Martin-L¨of, The deﬁnition of random sequences,

Information andControl , 9:602–619, 1966.[15] A. Mowshowitz, Entropy and the complexity of graphs: IV. Entropymeasures and graphical structure,

Bulletin of Mathematical Biophysics

30, pp. 533–546, 1968.[16] C. Nantasenamat, C. Isarankura-Na-Ayudhya, T. Naenna, V.Prachayasittikul A practical overview of quantitative structure-activity relationship,

Excli J.

8: 74–88, 2009.[17] L. Peshkin, Structure induction by lossless graph compression,

DataCompression Conference, DCC ’07 , 27-29, pp:53–62, 2007.[18] M. Randi´c, On Molecular Branching.

Acta Chim. Slo V. , 44, 57–77,1997.[19] G. R¨ucker and C. R¨ucker, Walk Counts, Labyrinthicity, and Com-plexity of Acyclic and Cyclic Graphs and Molecules,

J. Chem. Inf.Comput. Sci. , 40 (1), pp 99–106, 2000.[20] F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,

Corre-spondence and Independence of Numerical Evaluations of AlgorithmicInformation Measures , Computability, vol. 2, no. 2, pp. 125–140, 2013.[21] F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,

CalculatingKolmogorov Complexity from the Frequency Output Distributions ofSmall Turing Machines , PLoS One 9(5), e96223, 2014.[22] R.J. Solomonoﬀ, A formal theory of inductive inference: Parts 1 and2.

Information and Control , 7:1–22 and 224–254, 1964.[23] N. Siddharth, B. Paige, J-W van de Meent, A. Desmaison, N.D.Goodman, P. Kohli, F. Wood, P.H.S. Torr, Learning Disentan-gled Representations with Semi-Supervised Deep Generative Models,arXiv:1706.00400 [stat.ML], 2017.[24] S. Wolfram,

A New Kind of Science , Wolfram Media, Champaign IL.,2002. 1825] M. Yato, K. Homma, A. Ishida, Reduction of carboxylic esters toethers with triethyl silane in the combined use of titanium tetra-chloride and trimethylsilyl triﬂuoromethanesulfonate,

Tetrahedron , 57(25): 5353–5359, 2001.[26] H. Zenil, F. Soler-Toscano, K. Dingle and A. Louis, Graph Automor-phisms and Topological Characterization of Complex Networks by Al-gorithmic Information Content, Physica A: Statistical Mechanics andits Applications, vol. 404, pp. 341–358, 2014.[27] H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,

Two-Dimensional Kolmogorov Complexity and Validation of the CodingTheorem Method by Compressibility , 2013.[28] H. Zenil, N.A. Kiani and J. Tegn´er, Quantifying Loss of Informationin Network-based Dimensionality Reduction Techniques, Journal ofComplex Networks 4, 342–362, 2016.[29] H. Zenil, N.A. Kiani and J. Tegn´er, Low-Algorithmic-ComplexityEntropy-deceiving Graphs,

Physical Review E.

96, 012308, 2017.[30] H. Zenil, S. Hern´andez-Orozco, N.A. Kiani, F. Soler-Toscano, A.Rueda-Toicen, A Decomposition Method for Global Evaluation ofShannon Entropy and Local Estimations of Algorithmic Complexity,arXiv:1609.00110 [cs.IT], 2016.[31] H. Zenil, N.A. Kiani, F. Marabita, Y. Deng, S. Elias, A. Schmidt,G. Ball, J. Tegn´er, An Algorithmic Information Calculus forCausal Discovery and Reprogramming Systems, 2017. BioArXiv DOI:https://doi.org/10.1101/185637[32] H. Zenil, L. Badillo, S. Hernndez-Orozco and F. Hernandez-Quiroz,Coding-theorem Like Behaviour and Emergence of the Universal Dis-tribution from Resource-bounded Algorithmic Probability, Interna-tional Journal of Parallel Emergent and Distributed Systems (inpress).[33] H. Zenil, N.A. Kiani and J. Tegn´er, Methods of Information Theoryand Algorithmic Complexity for Network Biology

Seminars in Celland Developmental Biology, vol. 51, pp. 32-43, 2016.19 ppendix

Figure 6: Four physical properties found to be slightly correlated tothe graph algorithmic complexity of a large set of molecular networks(built from contact maps extracted from atomic data in ChemicalData[]in the Wolfram Language coming from public chemical data banks) ac-cording to their BDM values with statistics (Pearson): -0.16, -0.20, -0.34 and -0.3 and p values all < .

05 except for combustion heat at p = 0 .

07. Fitting lines (red) were found by Least square: x (0 . ◦ C) + x ( − . ◦ C) + 460 . ◦ C, x (0 . / mol) + x ( − . / mol) +5521 . / mol, x (cid:0) . / m (cid:1) + x (cid:0) − . / m (cid:1) + 3306 . / m and x (0 . ◦ C) + x ( − . ◦ C) + 546 . ◦ C. While the correlationvalues are weak, in all these cases no element with high values for eachproperty was found to also have high BDM. In other words, all elementswith high values for each property also had very low algorithmic probabilityestimations and therefore high algorithmic complexity, thereby pinpointingelements with high values for these physical properties.20igure 7: Top enriched and depleted classes separated by estimations ofgraph algorithmic complexity (approximated by CTM/BDM v. randomizedmembership) out of a random sample of 100 complex chemicals sorted byclass. The probability on the yy