aa r X i v : . [ c s . I T ] M a y Transfer entropy: where Shannon meets Turing
David [email protected]
ASML Netherlands B.V. (Dated: May 28, 2019)Transfer entropy is capable of capturing nonlinear source-destination relations between multi-variate time series. It is a measure of association between source data that are transformed intodestination data via a set of linear transformations between their probability mass functions. Theresulting tensor formalism is used to show that in specific cases, e.g., in the case the system con-sists of three stochastic processes, bivariate analysis suffices to distinguish true relations from falserelations. This allows us to determine the causal structure as far as encoded in the probabilitymass functions of noisy data. The tensor formalism was also used to derive the Data ProcessingInequality for transfer entropy.
Efficient inference of the source-destination relationswithin a complex system from observational is essentiallya “catch 22” situation. Pairwise analysis is relativelycheap, but a bivariate analysis will reveal a relation be-tween two non-interacting processes that are correlatedonly due to a common source. Multivariate analysis leadsto a higher precision but it is computational very costly.For transfer entropy [1] several approaches have been de-veloped to resolve the computational and precision issue,e.g., [2, 3]).In this letter we report a novel approach. Our vantagepoint is that of a machine builder. Photolithography ma-chines are extremely complex systems consisting of tensof thousands of interacting components. Designing andbuilding these systems is impossible without the notionof causality. Because the output of a machine can bethought of as the result of a computation, a Turing ma-chine can be used to model a real machine [4]. This isnot a tautology. The laws governing the real machineare encoded in the transition function of the Turing ma-chine. We applied this notion to transfer entropy (TE), ameasures for “information transfer” between source dataand destination data [5]. It is also capable of capturing“true” causal relations (“true” in an interventional sense[6]). If we interpret the source data as the input for aTuring machine and the destination data as the output,the transition function should (also) encode causality ascaptured by TE.We will start with a short recap of the relevant con-cepts of Information Theory and transfer entropy. It isthen shown that the probability mass function (PMF)of the source data is transformed into the PMF of thedestination data occurs via a set of linear transforma-tions. The resulting tensors are used to proof that theData Processing Inequality or DPI [8] is valid for trans-fer entropy. It is also shown that in well defined cases abivariate approach suffices to infer the causal structureof a complex system. We end this letter with an experi-ment to illustrate that our approach is indeed capable ofcapturing nonlinear relations.Information theory was introduced in 1948 by C. Shan-non [7]. It relates two data sets x and y . The data areindexed realizations of quantized random variables rep- resenting discrete-time stationary ergodic Markov pro-cesses X and Y respectively. If there is dependency be-tween the two messages, information is shared betweenthem. The data are ordered sets of symbols from fi-nite alphabets. In this letter we will use three alpha-bets: X = { χ , χ , · · · , χ |X | } , Y = { ψ , ψ , · · · , ψ |Y| } , and Z = { ζ , ζ , · · · , ζ |Z| } . The random variable X is asso-ciated with the alphabet X , Y with Y , and Z with Z respectively.Mutual information (MI) is a measure of the informa-tion shared between two time series I ( X ; Y ) = X x,y p ( x, y ) log (cid:20) p ( y | x ) p ( y ) (cid:21) . (1)It is nonnegative and symmetric in X and Y . The in-formation sharing results from data transmission over a communication channel (or channel in short). Sourcedata is transmitted, destination data is received. In achannel every input alphabet symbol has it’s own input“socket”. Likewise, every output alphabet symbol hasit’s own output socket. Data is transmitted one symbolat a time. The input symbol is fed to the related inputsocket. The channel transforms the input symbol intoan output symbol in a probabilistic fashion and makes itavailable on the associated output socket. The simplesttype of channel is the noisy discrete memoryless commu-nication channel (DMC). In a memoryless channel theoutput ( y t ) only depends on the input ( x t ) and not onthe past inputs or outputs: p ( y t | x t , x t − , y t − ) = p ( y t | x t ).A memoryless channel embodies the Markov property.The maximum rate with which information can be trans-mitted over a channel is called the channel capacity C XY = max p ( x ) [ I ( X ; Y )]. This is achieved for a so calledchannel achieving input distribution.In a noisy channel the output depends on the input andanother random variable representing perturbations, i.e.,noise. Transmission of data over a discrete memorylesscommunication channel transforms the probability massfunction of the input into the PMF of the output via alinear transformation represented by a probability tran-sition matrix [8]. This probability transition matrix fullycharacterizes the DMC. Instead of matrix and vector no-tation we use index notation. Index i is associated with x , index j with y , and index k with z respectively. The j th element of the PMF p ( y ) equals p ( y = ψ j ). Becauseevery random variable has it’s own alphabet letter asso-ciated with it this can be written as p ( ψ j ), or even as p j .Using the Einstein summation convention where we sumover double indices, the transmission of x over a noisychannel resulting in y equals p j = p i A ji . (2)The row stochastic probability transition matrix ele-ments A ji = p ( ψ j | χ i ) represent the elements of the prob-ability transition tensor A [9]. In this letter the place-ment of the indices is used as a mnemonic device. Thesubscript or covariant index indicates over which alpha-bet element we have to condition. The superscript orcontravariant index indicates which alphabet element isconditioned. It follows directly from Eq.(2) that the in-put distribution can be reconstructed from the outputdistribution: p j A ‡ ij = p i , with A ‡ ij = p ( χ i | ψ j ). We callthis reversal in analysis direction the ‡ operation. Ifthe directed graph X → Y represents the transmissionof data from X to Y with the associated tensor A ji , the ‡ -operation associates A ‡ ij with X ← ‡ Y .Because mutual information is a function of A and theinput PMF, we write it as I ( X, Y ) := f ( A , ⋆ ) . The ⋆ indi-cates that apart from A there is another input. As suchMI might not be the best measure to indicate the un-derlying structure for systems of which the structure isindependent from the input. In contrast, the earlier men-tioned channel capacity only depends on the elements ofthe probability transition tensor [10]. We indicate thechannel capacity with the equivalent lower case Greekletter: C XY ( A ) := α , C XY ( B ) := β , and C XY ( C ) := γ .To understand the usefulness of the tensor formalismwe will perform a thought experiment using a simplesystem consisting of the three random variables X , Y ,and Z . We assume that the bivariate relations have thefollowing associated tensors A : X → Y , B : Y → Z , and C : X → Z . The aim is to determine the true structure:(1) The chain X → Y → Z . (2) The fork X → Y , X → Z .(3) The triangle itself. To be able to analyze this graphwe need to introduce two concepts [11]: (1) The causalMarkov condition. (2) The faithfulness assumption. TheCausal Markov Condition states that a process is inde-pendent of its non-effects, given its direct causes, i.e.,parents. A directed graph is said to be faithful to theunderlying probability distributions if the independencerelations that follow from the graph are the exact sameindependence relations that follow from the underlyingprobability distributions.Assuming faithfulness and applicability of the causalMarkov condition, let’s consider the chain. Because itis a straightforward exercise we leave it to the readerto confirm that for the chain we have C ki = A ji B kj . Ifwe assume that the actual structure is the fork, whichcan be interpreted as a chain thanks to the ‡ operation,we get B kj = A ‡ ij C ki . From these expressions it follows that we can not distinguish a chain from a fork when A ‡ ij ′ A ji = δ j ′ j and A ji ′ A ‡ ij = δ i ′ i . The Kronecker delta δ i ′ i is defined as: δ i ′ i = 0 if i ′ = i and δ i ′ i = 1 if i ′ = i . In thiscase A represents a noiseless DMC, i.e., the probabilitymass function of y is a permutation of the PMF of x .Instead of checking both assumptions we only needto perform one check if we use the DPI. This inequal-ity states that processing of data can never increase theamount of information. For the chain this means that I ( X ; Z ) ≤ min[ I ( X ; Y ) , I ( Y ; Z )] . Only in the absence ofnoise there is equality. Because the channel capacity isthe maximal achievable mutual information for a specificchannel, the DPI implies that γ ≤ min[ α , β ]. If γ < β ,the real structure could be a chain and we have to verifythis by using the “tensor check”. In the case β < γ thereal structure could be a fork and we have to check forthat. Please note that the tensor expressions are neces-sary but not sufficient conditions to decide if a relationis false or not. We will discuss the second condition laterin this letter. We can not decide between a chain or afork when γ = β .All this is of course also applicable to time delayed mu-tual information. Schreiber however showed that timedelayed MI is not always capable of determining the cor-rect relation [1]. Transfer entropy T E X → Y = X x − ,y, y − p ( x − , y, y − ) log (cid:20) p ( y | x − , y − ) p ( y | y − ) (cid:21) (3) outperforms time delayed mutual. It is assumed that Y is a Markov process of order ℓ ≥
1. With output y = y t ,the relevant past vector of y , y − = ( y t − , s, y t − ℓ ) andthe input vector x − = ( x t − τ , s, x t − τ − m ) with m ≥ τ ≥
0. Assuming that there is a finite interaction delay(delay from now on) τ , it is proved that this modifiedTE is maximal for the real delay [12]. The alphabet forthe input vector is X m , the m -ary Cartesian power ofthe input alphabet X . Likewise, the alphabet for therelevant past vector is Y ℓ , the ℓ -ary Cartesian power ofthe output alphabet Y .From now on we will use the convention that the index g is associated with the relevant past vector of y and h is associated with the relevant past vector of z . Transferentropy can be associated with communication channels.We start with conditioning the MI from Eq.(1) on theevent y − = ψ − g I ( X ; Y | ψ − g ) = X x − ,y p ( x − , y | ψ − g ) log (cid:20) p ( y | x − , ψ − g ) p ( y | ψ − g ) (cid:21) . (4) Because x − and y − are the only parents of the output y , it follows from the causal Markov condition that theassociated channel is memoryless. The conditioned MIquantifies the amount of information that is transmittedover the g th subchannel. Transfer entropy of Eq.(3) cannow be expressed as T E X → Y = X g p ( ψ − g ) I ( X ; Y | ψ − g ) . (5)Transfer entropy is the result of transmission of data overan inverse multiplexer . Let’s envision the two time seriesas data on two parallel vertical tapes. Our inverse multi-plexer aligns the tapes by shifting the source data accord-ing to the interaction time delay τ . The cell of the inputtape containing x ( t − τ ) is positioned next to the emptycell of the output tape that will contain y ( t ). Next itchooses a transmission channel based on the value of therelevant past vector y − . The input vector x − is fed to aninput socket based on the value of the input vector. Thechannel transforms the input in a probabilistic fashion.The output is written in the appropriate cell of the out-put tape. To be able to distinguish the input vector indexfrom the output vector index, we indicate an element ofthe input vector based on x with the index ˆ i . The indexˆ j is associated with the input vector based on y . With p jg = p ( ψ j | ψ − g ) , p ˆ ig = p ( χ − ˆ i | ψ − g ) , and A jg ′ ˆ i = p ( ψ j | χ − ˆ i , ψ − g ′ ) thelinear transformation associated with the g th channel is p jg = δ g ′ g p ˆ ig A jg ′ ˆ i . (6)The delay for all the subchannels is τ xy . Transfer En-tropy is a function of the input PMF and the tensor A .Using the same shorthand notation as for mutual infor-mation we define: T E X → Y := f ( A , ⋆ ), T E Y → Z := f ( B , ⋆ ),and T E X → Z := f ( C , ⋆ ) .We now perform the same thought experiment as pre-viously. First assume that the structure is a chain. Ad-ditional to Eq.(6) we have two additional linear transfor-mations p kh = δ h ′ h p ˆ jh B kh ′ ˆ j , (7a) p kh = δ h ′ h p ˆ i ′ h C kh ′ ˆ i ′ . (7b)Because ψ − ˆ j ⊂ { ψ j , ψ − g } or { ψ j , ψ − g } ⊂ ψ − ˆ j we can enlargeeither ψ − g or ψ − ˆ j so that { ψ j , ψ − g } = ψ − ˆ j . Due to the causalMarkov condition this does not impact the end result, sowe can replace j by ˆ j . The next step is to condition allsides of Eq.(6) on ζ − h and all sides of Eq.(7a) and Eq.(7b)on ψ − g . Again thanks to the causal Markov conditionwe can assume that the cardinality of the input vectorfor the transformation for X → Z equals the cardinalityof the input vector for the transformation X → Y , i.e.,ˆ i ′ =ˆ i . Because we set { ψ j , ψ − g } = ψ − ˆ j we have B kgh ˆ j = B kh ˆ j .The reader can confirm that the causal Markov conditionimplies that A ˆ jgh ˆ i = A ˆ jg ˆ i . Combining the three conditionedequations finally gives us C k ˆ igh = A ˆ jg ˆ i B kh ˆ j . (8)When we “sum out” index g by multiplying both sideswith δ i ′ i δ h ′ h p gh ′ ˆ i ′ we get Eq.(9a). We repeat these stepsassuming that the fork is the true structure. This givesus two expressions for the tensors of the false relations interms of the tensors of the true relations for both a chain and a fork: C kh ′ ˆ i = δ h ′ h ¯ A ˆ jh ′ ˆ i B kh ˆ j , with ¯ A ˆ jh ′ ˆ i := δ ˆ i ′ ˆ i p gh ′ ˆ i ′ A ˆ jg ˆ i , (9a) B kh ˆ j = δ h ′ h ¯ A ‡ ˆ ih ˆ j C kh ′ ˆ i , with ¯ A ‡ ˆ ih ˆ j := δ ˆ j ′ ˆ j p gh ˆ j ′ A ‡ ˆ ig ˆ j . (9b)Only when δ h ′ h ¯ A ‡ ˆ ih ˆ j ′ ¯ A ˆ jh ′ ˆ i = δ h ′ h δ ˆ j ′ ˆ j and δ h ′ h ¯ A ˆ jh ′ ˆ i ′ ¯ A ‡ ˆ ih ˆ j = δ h ′ h δ ˆ i ′ ˆ i can we not distinguish a chain from a fork. Thereader can confirm that δ h ′ h δ ˆ i ′ ˆ i behaves like the identitymatrix for every h , i.e., it represents a noiseless transmis-sion. When we use the DPI for transfer entropy it followsthat this is the case when noise is absent in the relation X → Y or in the relation Y → Z or noise is absent in bothrelations.The DPI is a consequence of Eq.(9a). The h th subchan-nel of the inverse multiplexer of the chain consists itselfof a chain of two channels represented by the tensors ¯ A and B with fixed h . For this subchannel the DPI is valid: f ( C h , ⋆ ) ≤ min[ f (¯ A h , ⋆ ) , f ( B h , ⋆ )] . Transfer entropy is theweighted sum of the TE per subchannel (Eq.(5)). Fromthis it follows f ( C , ⋆ ) ≤ min[ f (¯ A , ⋆ ) , f ( B , ⋆ )] , i.e., the DataProcessing Inequality for transfer entropy. The tensor¯ A h is the result of two cascaded channels represented by A h and a tensor with elements p g ˆ ih . In this case the DPIleads to f (¯ A , ⋆ ) ≤ f ( A , ⋆ ) . Combining these inequalitieswe find that for the chain X → Y → Z T E X → Z ≤ min [ T E X → Y ,T E Y → Z ] . (10)In conjunction with the tensor equations of Eq.(9) weneed to take the delays into account to determine whethera relation is true or false. We posit that interactiondelays in a chain are additive. This also applies to afork because the ‡ -operation is a time reversal operation : τ ‡ = − τ . The fork X → Y , X → Z is equivalent to thechain Y → ‡ X → Z . The total delay for this equivalentchain is τ yz = − τ yx + τ xz . The fork is also equivalent tothe chain Y ← X ← ‡ Z , so τ zy = − τ zx + τ xy . Of these onlythe relations with a nonnegative total delay could rep-resent physical processes. The proof of additivity is notin scope of this letter, the DPI however makes it plausi-ble. If the total optimal delay in a chain differs from thesum of the individual optimal delays, the TE of at leastone individual relation is not maximized. This lowers theupper boundary as given by Eq.(10).To determine when the bivariate approach can not beused we investigated the v-structure X → Z ← Y . Dueto the multivariate relation D : { X, Z } → Y there is theadditional linear transformation p kh = δ h ′ h p ˆ i ′′ ˆ j ′ h D kh ′ ˆ i ′′ ˆ j ′ . (11)Under the assumption that ˆ i ′′ = ˆ i ′ and ˆ j ′ = ˆ j and usingthe fact that p ˆ i ˆ jh = δ h ′ h δ ˆ i ′ ˆ i p ˆ ih p ˆ jh ˆ i ′ we get the following tworelations relating D to both B and C : C kh ˆ i = δ h ′ h δ ˆ i ′ ˆ i p ˆ jh ˆ i ′ D kh ′ ˆ i ˆ j , (12a) B kh ˆ j = δ h ′ h δ ˆ j ′ ˆ j p ˆ ih ˆ j ′ D kh ′ ˆ i ˆ j . (12b) T r an s f e r E n t r op y [ b i t s ] C hanne l c apa c i t y [ b i t s ] TE X X C X X C X X FIG. 1. The channel capacity for the relations X → X and X → X as function of the coupling strength ǫ . Dots:channel capacity for quantized data. Line: transfer entropyas determined by Schreiber. Let’s assume that ˆ i ≤ N and ˆ j ≤ M . In the bivariateapproach we want to determine the tensor D using thebivariate measurements. The reader can confirm thatthis is only possible in the case N, M ∈ { , } . If there are2 or more indirect paths between two nodes the bivariateanalysis can, in theory, not be used.We finalize this letter with an experiment to illus-trate that nonlinear behavior is indeed captured bymeasuring the probability transition tensors and cal-culating the channel capacities. We use the one-dimensional lattice of unidirectional coupled maps x mn +1 = f (cid:0) ǫx m − n +(1 − ǫ ) x mn (cid:1) . Information can only betransferred from X m − to X m . The Ulam map with f ( x ) = 2 x is interesting because there are two regions( ǫ ≈ . ǫ ≈ .
82) where no information is shared be-tween maps [1]. We used the following quantizationscheme: if x n − ≥ x n < x n +1 or x n − < x n ≥ x n +1 then x ′ n := 1 otherwise x ′ n := 0. Furthermore we chose ℓ = m = 1 (see Eq.(3)). Instead of maximizing TE wemaximized the channel capacity to determine the opti-mal delay. In the case of none or weak autoregressivedata we use the upper boundary max p ( x ) [ T E X → Y ] ≤ X g p ( ψ − g ) C XY | ψ − g . (13) The relations for the set { X , X } were measured withsignificance level 0.01. The delays were varied between1 and 20. The Channel capacity is maximal for an delayof 1 sample. As can be seen in figure 1, the structure isidentical to the one as determined by SchreiberTo conclude, we have shown that we are capable of de-termining the causal structure as far as encoded in theprobability mass functions of quantized time series. In-stead of computing transfer entropy, we determine prob-ability transition tensors that transform source data intodestination data. These were used to show that in spe-cific cases bivariate analysis suffices to distinguish falserelations from true relations. We also used it to derivethe Data Processing Inequality for transfer entropy. Ourapproach is only applicable to noisy data. No assump-tions were made about the cardinality of the alphabets.This implies that there must be an equivalent approachfor non-quantized data.I would like to thank Errol Zalmijn for introducing meto the wonderful topic of transfer entropy and MarcelBrunt for helping me to implement these principles inMatlab. Also thanks to Hans Onvlee, S. Kolumban, RuiM. Castro and T. Heskes for their comments on earlierversions of the manuscript. This work performed underthe auspices of ASML PI System Diagnostics. [1] Thomas Schreiber. Measuring information transfer. Phys. Rev. Lett. , 85:461–464, Jul 2000.[2] Jakob Runge, Jobst Heitzig, Vladimir Petoukhov, andJ¨urgen Kurths. Escaping the curse of dimensionalityin estimating multivariate transfer entropy.
Phys. Rev.Lett. , 108:258701, Jun 2012.[3] Michael Wibral, Patricia Wollstadt, Ulrich Meyer, Nico-lae Pampu, Viola Priesemann, and Raul Vicente. Re-visiting wiener’s principle of causality - interaction-delayreconstruction using transfer entropy and multivariateanalysis on delay-weighted graphs. In
Annual Interna-tional Conference of the IEEE Engineering in Medicineand Biology Society, EMBC 2012, San Diego, CA, USA,August 28 - September 1, 2012 , pages 3676–3679, 2012.[4] A. M. Turing. On Computable Numbers, with an Appli-cation to the Entscheidungsproblem.
Proceedings of theLondon Mathematical Society , s2-42(1):230–265, 01 1937.[5] J. T. Lizier and M. Prokopenko. Differentiating informa-tion transfer and causal effect.
The European PhysicalJournal B , 73(4):605–615, Feb 2010.[6] Judea Pearl.
Causality: Models, Reasoning and Infer- ence . Cambridge University Press, New York, NY, USA,2nd edition, 2009.[7] C. E. Shannon. A mathematical theory of communica-tion.
Bell System Technical Journal , 27(3):379–423.[8] Thomas M. Cover and Joy A. Thomas.
Elements of In-formation Theory . Wiley-Interscience, New York, NY,USA, 1991.[9] Wen Li and Michael K. Ng. On the limiting probabilitydistribution of a transition probability tensor.
Linear andMultilinear Algebra , 62, 03 2014.[10] S. Muroga. On the Capacity of a Discrete Channel.
Journal of the Physical Society of Japan , 8:484–494, July1953.[11] Peter Spirtes, Clark Glymour, Scheines N., and Richard.
Causation, Prediction, and Search . Mit Press: Cam-bridge, 2000.[12] Michael Wibral, Nicolae Pampu, Viola Priesemann,Felix Siebenh¨uhner, Hannes Seiwert, Michael Lind-ner, Joseph T. Lizier, and Raul Vicente. Measuringinformation-transfer delays.