aa r X i v : . [ q - b i o . N C ] A ug Quasi-Conscious Multivariate Systems
Jonathan W. D. Mason, Mathematical Institute, University of Oxford, UK (Submitted to Complexity 2015)
Abstract
Conscious experience is awash with underlying relationships. Moreover, for various brain regions such as the visualcortex, the system is biased toward some states. Representing this bias using a probability distribution shows that the systemcan define expected quantities. The mathematical theory in the present paper links these facts by using expected float entropy(efe), which is a measure of the expected amount of information needed, to specify the state of the system, beyond what isalready known about the system from relationships that appear as parameters. Under the requirement that the relationshipparameters minimise efe, the brain defines relationships. It is proposed that when a brain state is interpreted in the contextof these relationships the brain state acquires meaning in the form of the relational content of the associated experience. Fora given set, the theory represents relationships using weighted relations which assign continuous weights, from 0 to 1, tothe elements of the Cartesian product of that set. The relationship parameters include weighted relations on the nodes of thesystem and on their set of states. Examples obtained using Monte-Carlo methods (where relationship parameters are chosenuniformly at random) suggest that efe distributions with long left tails are most important.
In the present paper we further develop the theory introduced in the article ‘Consciousness and the structuring propertyof typical data’ (see [1]), and demonstrate and investigate the theory through applications in a number of examples usingcomputational methods.It is intended that the theory will provide a way into the mathematics that underpins how the brain defines the relationalcontent of consciousness. Indeed, conscious experience clings to a substrate of underlying relationships: points in a person’sfield of view can be strongly related (if close together) or unrelated (if far apart), giving geometry; colours can appear similar(e.g. red and orange) or completely different (e.g. red and green). We can make a very long list of such examples of relationsinvolving different sounds, smells, tastes and locations of touch. Furthermore, at a higher semantic level involving severalbrain regions, if we see someone we know and hear a person’s name then we know whether the name relates to that person.It is hard to think of any conscious experience that does not involve relations. Whilst it is difficult to explain how thebrain defines the colour blue, in the present paper we hope to provide the beginnings of a mathematical theory for how thebrain defines all of the relations underlying consciousness and, therefore, explain why, for example, blue appears similar toturquoise but different to red. It is proposed that when a brain state is interpreted in the context of all these relations, definedby the brain, the brain state acquires meaning in the form of the relational content of the experience. If we consider therelations defined by the brain to be a type of statistic then we have the following analogy. A single observation of a onedimensional random variable is almost meaningless, but in the context of the statistics of the random variable, such as meanand variance, the observation has meaning. For arguments in support of this approach, the reader is referred to [1].The issue of how a system such as the brain defines relations is crucial. Importantly, for various brain regions such as thevisual cortex, (under temporally well spaced observations of the system) the probability distribution over the different possiblestates of the system is far from being uniform owing to learning rules of which the Bienenstock, Cooper and Munro (BCM)version of Hebbian theory is one candidate; see [2], [3] and [4]. Hence, the brain is not merely driven by the current sensoryinput, but is biased toward certain states as a result of a long history of sensory inputs. The probability distribution over thestates of the system is therefore a property of the system itself allowing the system to define expected quantities.1n the theory presented in the present paper, the brain defines relations under the requirement that the expected quantity ofa particular type of entropy is minimised. We call this entropy float entropy . For a collection of relations on the system andany given state of the system, the float entropy of the state is a measure of the amount of information required, in additionto the information given by the relations, in order to specify that state. We make the definition of float entropy precisein Subsection 1.1. However, later in the present paper we will give a more general definition (multi-relational float entropy)which allows the involvement of more than two relations; see Subsection 4.1. We will also consider a time dependent version,and the theory of the present paper will be compared with Integrated Information Theory and Shannon entropy.
In this subsection we provide the main definitions in the present paper. Systems such as the brain, and its various regions, arenetworks of interacting nodes. In the case of the brain we may take the nodes of the system to be the individual neurons orpossibly larger structures such as cortical columns. The nodes of the system have a repertoire (range) of states that they canbe in. For example, the states that neurons can be in could be associated with different firing frequencies. In the present paperwe assume that the node repertoire is finite (as was assumed in [1]), and the state of the system is the aggregate of the statesof the nodes.The original theory in [1] used a mainly set theoretic approach, where a relation on a nonempty set S was usually taken tobe a binary relation R ⊆ S . Weighted relations (see below) are slightly more general than binary relations, and the furtherdevelopment (presented in the present paper) of the original theory uses weighted relations because they allow a system todefine a weighted relation on the repertoire of its nodes. This is desirable as we will see from examples later in the paper.In Definition 1.1 the elements of the set S are to be taken as the nodes of the system. Definition 1.1.
Let S be a nonempty finite set, n : = S. Then a data element for S is a set (having a unique arbitrary indexlabel i) S i : = { ( a , f i ( a )) : a ∈ S , f i : S → V } , where f i is a mapand V : = { v , v , . . . , v m } is the node repertoire . The set of all data elements for S given V is W S , V so that W S , V = m n .For temporally well spaced observations, it is assumed that a given finite system defines a random variable with probabilitydistribution P : W S , V → [ , ] for some finite set S and node repertoire V . If T is a finite set of numbered observations of thesystem then T is called the typical data for S. The elements of T (called typical data elements ) are handled using a function t : { , . . . , T } → { i : S i ∈ W S , V } , where S t ( k ) is the value of observation number k for k ∈ { , . . . , T } . In particular, the function t need not be injective sincesmall systems may be in the same state for several observations. emark 1.1. Note that P in Definition 1.1 extends to a probability measure on the power set W S , V of W S , V by definingP ( A ) : = (cid:229) S i ∈ A P ( S i ) , for A ∈ W S , V . Hence, we have a probability space ( W S , V , W S , V , P ) with sample space W S , V , sigma-algebra W S , V , and probability measure P. We now need the definition of a weighted relation.
Definition 1.2 (Weighted relations) . Let S be a nonempty set. A weighted relation on S is a function of the formR : S → [ , ] , where [ , ] is the unit interval. We say that R is:1. reflexive if R ( a , a ) = for all a ∈ S;2. symmetric if R ( a , b ) = R ( b , a ) for all a , b ∈ S.The set of all reflexive, symmetric weighted-relations on S is denoted Y S . Remark 1.2.
Except where stated, the weighted relations used in the present paper are reflexive and symmetric. Relativeto such a weighted relation, the value R ( a , b ) quantifies the strength of the relationship between a and b, interpreted inaccordance with the usual order structure on [ , ] so that R ( a , b ) = is a maximum. For a small finite set, it is useful todisplay a weighted relation on that set as a weighted relation table (i.e. as a matrix). Before giving the definition of float entropy we require Definitions 1.3 and 1.4.
Definition 1.3.
Let S be as in Definition 1.1 and let U : V → [ , ] be a reflexive, symmetric weighted-relation on the noderepertoire V ; i.e. U ∈ Y V . Then, for each data element S i ∈ W S , V , we define a function R { U , S i } : S → [ , ] by settingR { U , S i } ( a , b ) : = U ( f i ( a ) , f i ( b )) for all a , b ∈ S . It is easy to see that R { U , S i } ∈ Y S . Definition 1.4.
Let S be a nonempty finite set. Every weighted relation on S can be viewed as a S -dimensional real vector.Hence, the d n metric is a metric on the set of all such weighted relations by setting d n ( R , R ′ ) : = (cid:18) (cid:229) ( a , b ) ∈ S | R ( a , b ) − R ′ ( a , b ) | n (cid:19) / n , where R and R ′ are any two weighted relations on S. Similarly we have the metric d ¥ ( R , R ′ ) : = max S | R ( a , b ) − R ′ ( a , b ) | . efinition 1.5 (Float entropy) . Let S be as in Definition 1.1, let U ∈ Y V , and let R ∈ Y S . The float entropy of a data elementS i ∈ W S , V , relative to U and R, is defined as fe ( R , U , S i ) : = log ( { S j ∈ W S , V : d ( R , R { U , S j } ) ≤ d ( R , R { U , S i } ) } ) , where, in the present paper (unless otherwise stated), d is the d metric. Furthermore, let P : W S , V → [ , ] and T be as inDefinition 1.1. The expected float entropy , relative to U and R, is defined as efe ( R , U , P ) : = (cid:229) S i ∈ W S , V P ( S i ) fe ( R , U , S i ) . The efe ( R , U , T ) approximation of efe ( R , U , P ) is defined as efe ( R , U , T ) : = T T (cid:229) k = fe ( R , U , S t ( k ) ) , where t need not be injective by Definition 1.1. By construction, efe is measured in bits per data element (bpe). It is proposed that a system (such as the brain and its subregions) will define U and R (up to a certain resolution) underthe requirement that the efe is minimised. Hence, for a given system (i.e. for a fixed P ), we attempt to find solutions in U and R to the equation efe ( R , U , P ) = min R ′ ∈ Y S , U ′ ∈ Y V efe ( R ′ , U ′ , P ) . (1)In practice we replace efe ( · , · , P ) in (1) with efe ( · , · , T ) . Remark 1.3.
In Definition 1.5 the d metric is used. It turns out that, amongst many metrics, a change in metric has only asmall effect on the solutions to (1). There are also plenty of pathological metrics which, when used, will significantly changethe solutions to (1). In Remark 1.2 we mentioned that, for a weighted relation, the value of R ( a , b ) is interpreted in accordancewith the usual order structure on [ , ] . We argue that the order structure to be used on [ , ] should be determined by themetric that is being used in Definition 1.5. Hence, for a pathological metric, whilst the solutions to (1) will have changed,their interpretation as weighted relations may be largely unchanged when the order structure used on [ , ] is determinedby the metric being used (when this makes sense). In practice, we want to use the usual order structure on [ , ] , and thisrequirement limits which metrics should be used in Definition 1.5. We will look at the issue of metrics in some detail inSubsection 3.3. Remark 1.4.
The theory presented in the present paper uses the definitions in Subsection 1.1. Suppose we restricted thesedefinitions so that the only weighted relation we could use on the node repertoire V was the Kronecker delta, and the onlyelements of Y S we could use were weighted relations taking values in the two point set { , } . Then, under these restrictions, efinition 1.5 would yield a definition of float entropy equivalent to that given in [1]. Indeed, note that a weighted relationR : S → { , } is given by the indicator function for the relation { ( a , b ) ∈ S : R ( a , b ) = } ⊆ S . Hence, the theory presentedin the present paper is indeed a development of the theory presented in [1]. Remark 1.5.
With reference to Remark 1.1 and Definition 1.5, for A ∈ W S , V , we have the weak conditional efeefe ( R , U , P | A ) : = (cid:229) S i ∈ W S , V P ( S i | A ) fe ( R , U , S i ) . Weak conditional efe can be useful when considering a system that has entered a particular mode such that this mode restrictsthe system to a particular set of data elements. There may be other useful definitions of conditional efe . The examples in the present paper are intended to have relevance to the visual cortex and our experience of monocular vision.In lieu of typical data for the visual cortex we apply the theory to typical data for digital photographs of the world around us.If the theory, as used in the examples, is relevant to the visual cortex then the examples show that the perceived relationshipsbetween different colours, the perceived relationships between different brightnesses, and the perceived relationships betweendifferent points in a person’s field of view (giving geometry) are all defined by the brain in a mutually dependent way.Hence, in this case, there is a connection between the relationships that underly colour perception and our perception of theunderlying geometry of the world around us. Of course the states of the visual cortex are somewhat more complicated thandigital photographs since some neurons have sophisticated receptive fields. However, the theory presented in the presentpaper does not assume that the nodes of the visual cortex have to be individual neurons. Instead, each node can consist ofmany neurons; effectively representing the data elements using a larger base (note that we can think of the node repertoireas being analogous to a choice of base in the representation of integers). Hence, the examples could well be relevant to thevisual cortex. A preliminary discussion and investigation regarding base is presented in Subsection 3.1.We also apply the theory to a system where the probability distribution P in Definition 1.1 is uniform over W S , V . In this casethe solutions to (1) vary greatly (instead of all being similar) and, hence, the system fails to define weighted relations thatgive a coherent interpretation of the states of the system. The variation in the solutions to (1) is partly due to symmetries, andthis is discussed in Example 3.4.It is argued in [1] that the theory presented there provides a solution to the binding problem and avoids the homunculusfallacy. Those arguments also apply to the theory presented in the present paper. In particular, consciousness is not the outputof some algorithmic process but it may instead, largely, be the states of the system interpreted in the context of the weightedrelations that minimise expected float entropy, where here we are talking about a definition of float entropy that involves morethan the two weighted relations used in (1); see Subsection 4.1. This argument may become clearer for the reader after goingthrough the examples in the present paper. The rest of the paper is organised as follows. Section 2 looks at obtaining typical5ata from digital photographs, and specifies the computational methods used for finding solutions to (1). Section 3 providessix examples in which the theory is applied. We continue the development of the theory by looking at changing the base ofa system, joining and partitioning systems, and metric independence. Section 4 provides generalisations of Definition 1.5, acomparison between the present theory and both Giulio Tononi’s Integrated Information Theory (IIT) and Shannon entropy,followed by the conclusion. Appendix A lists the software used, and Appendix B provides a list of notation. In this section we look at obtaining typical data from digital photographs, a binary search algorithm for finding solutionsto (1), and using efe-histograms to assess guesses when guessing solutions to (1).
When obtaining a typical data element from a digital photograph, in the present paper, only a small part of the photographis used. This is because the computational methods used in the present paper are suitable for small systems ( W S , V ≤ ) although, at the expense of clarity and ease of implementation, other more efficient computational methods are possible forinvestigating larger systems; see Appendix A which lists the software used during the research for the present paper andprovides a discussion on more efficient computational methods.Figure 1 shows the sampling of a digital photograph such that the typical data element obtained is for a system comprisedof five nodes with a four state node repertoire ( W S , V = ) . Also, in the case of Figure 1, we are using pixel brightness to Figure 1: Digital photograph sampling using five nodes and a four shade gray scale.6
Figure 2: Digital photograph sampling using five nodes and a nine colour red/green palette.determine node state. From top-left to bottom-right, the first image is the original. This image is desaturated (the colours areturned into shades of gray) and then the contrast is enhanced. The contrast enhancement is not required, but it was thoughtthat it might reduce the number of typical data element needed in order to obtain meaningful results. Indeed, when similar,the solutions to (1) are rather like a type of statistic and, therefore, when using typical data we need to make sure that thesample size is large enough. The image is then posterised (in this case the number of shades is reduced to four giving a fourstate node repertoire). Finally, five pixels are sampled giving the state of each of the five nodes of the typical data element;see Table 1. To obtain the typical data for the system, this way of obtaining typical data elements is used for several hundredTable 1: Node states of the typical data element obtained from the sampling in Figure 1. node 1 node 2 node 3 node 4 node 5 S t ( ) digital photographs. Importantly, what ever the geometric layout of the pixel sampling locations (in Figure 1 the layout ispart of a grid that has adjacent locations every ten pixels), the same layout must be used for all of the digital photographs.Similarly, the same criteria must be used for determining the node states.The sampling in Figure 2 obtains a typical data element for a system comprised of five nodes with a nine state node repertoire ( W S , V = ) . Here node state is determined by pixel colour over a red/green palette. From top-left to bottom-right, wefirst have the original image to which colour contrast enhancement is applied. The image is then restricted to colours made upof red and green by setting blue values to zero. The image is then posterised (three values for red and three values for greenare used giving a nine state node repertoire). Finally, five pixels are sampled; the result is given in Table 2. We now considercomputational methods for finding solutions to (1). 7able 2: Node states of the typical data element obtained from the sampling in Figure 2. node 1 node 2 node 3 node 4 node 5 S t ( ) For any given system, let n = S and m = V .Step 1. The initial approximation of a solution to (1) is taken to be the pair U ∈ Y V and R ∈ Y S with U ( v , v ′ ) = and R ( a , b ) = for all v , v ′ ∈ V , v = v ′ , and a , b ∈ S , a = b , respectively.Step 2. For U and R (shown in Table 3) a given approximate solution to (1), let k = − ( q + ) where q = min { i ∈ N : 2 i u , ∈ N } .We now calculate the efe value of the system for each combination of the entries in Table 4 that give symmetricTable 3: Approximate solution to Equation (1). U v v v ··· v u , u , ··· v u , u , ··· v u , u , ··· ... ... ... ... ... R node 1 node 2 node 3 ··· node 1 1 r , r , ··· node 2 r , r , ··· node 3 r , r , ··· ... ... ... ... ... weighted relations. This is a binary search in the sense that there are two options per entry.Table 4: Binary entries over which to search for approximate solutions to (1). U v v v ··· v u , ± k u , ± k ··· v u , ± k u , ± k ··· v u , ± k u , ± k ··· ... ... ... ... ... R node 1 node 2 node 3 ··· node 1 1 r , ± k r , ± k ··· node 2 r , ± k r , ± k ··· node 3 r , ± k r , ± k ··· ... ... ... ... ... Step 3. If the minimum of the efe values, obtained in Step 2, was given by only one of the pairs of weighted relations tested inStep 2 then redefine U and R as this new pair of weighted relations and return to Step 2. Otherwise, output U , R andtheir associated efe value, and stop.If the algorithm did not stop then the chronology of approximate solutions, given by the applications of Step 3, would be aconvergent sequence with respect to d and any of the metrics in Definition 1.4. However, for m ≥ n ≥
2, both Y V and Y S are uncountable infinite sets; whereas the number of possible efe values is finite. Hence, some efe values result frominfinitely many weighted relations in Y V and Y S . It is not surprising then that, as the approximate solutions become closerwith respect to d , ultimately the algorithm stops at Step 3. In short, the system defines U and R (up to a certain resolution)under the requirement that the efe is minimised.This search algorithm works well; see its use in Section 3. However, the number of efe values calculated during each8pplication of Step 2 is 2 ( n ( n − )+ m ( m − )) / . For example, a system with S = V = efe values before stopping. Hence, the present paper also uses the following, computationally less expensive,method for approximating solutions to (1); also see Appendix A concerning more efficient computational methods. efe -histograms obtained from Monte-Carlo methods Here we choose U ∈ Y V and R ∈ Y S uniformly at random. With reference to Table 3, this is done by choosing each off-diagonal upper-triangular entry of U and R uniformly at random from the interval [ , ] (the off-diagonal lower-triangularentries are then those making U and R symmetric). The efe value is then calculated and stored, and the whole process isrepeated producing a list of many thousands of efe observations from which an efe-histogram can be obtained. With thissetup, if we wish to treat efe as a random variable then standard methods can be used for approximating the probabilitydistribution from the efe values (although this can be difficult for distributions with very thin tails). In any case, providedenough observations are made, the efe-histogram can be used to help assess guesses when guessing approximate solutionsto (1).However, we need to be careful concerning what is meant by ‘choose uniformly at random from the interval [ , ] ’. Usually,this means that all subintervals of the same length are equally probable events. This is fine for us as long as the length ofsubintervals is determined by the metric used in Definition 1.5, which conveniently is d ; see Subsection 3.3 for relevantdetails.We are now ready to apply the theory. This section provides insight concerning how the theory performs in practice by way of several informative examples andinvestigations.
Example 3.1.
In this example 200 digital photographs of the world around us are used. The typical data is obtained usingexactly the method shown in Figure 1, where the photographs have a four shade gray scale. Hence, T = and the systemis comprised of five nodes with a four state node repertoire ( W S , V = ) . The binary search algorithm of Subsection 2.2was applied to T and, after ten cycles, returned the weighted relations in Table 5. Figure 3 provides a graph illustration ofthe weighted relations. For U, values above 0.2 are indicated with a solid line, whilst values from 0.02 to 0.2 are indicatedwith a dash line. For R, values above 0.9 are indicated with a solid line, whilst values from 0.75 to 0.9 are indicated witha dash line. Although T = is rather small, T has defined the correct relationships under the requirement that efe isminimised. As described in Subsection 2.3, Figure 4 provides an efe -histogram for T . For U and R in Table 5 we have efe ( R , U , T ) = . , to six sf, and this value is indicated in Figure 4 by the triangular marker furthest to the left. The efe -histogram is negatively skewed with a long left tail and this shape is usual for systems where the probability distribution P, U R node 1 node 2 node 3 node 4 node 5node 1 1 0.99853515625 0.62353515625 0.92041015625 0.78369140625node 2 0.99853515625 1 0.94580078125 0.75244140625 0.93505859375node 3 0.62353515625 0.94580078125 1 0.73486328125 0.88330078125node 4 0.92041015625 0.75244140625 0.73486328125 1 0.98193359375node 5 0.78369140625 0.93505859375 0.88330078125 0.98193359375 1 Figure 3: Graph illustration of the weighted relations in Table 5, showing strongest relationships( solid lines ) and intermediate relationships ( dash lines ). Figure 4: An efe-histogram for Example 3.1 using 200,000 observations and a bin interval of0.01. For each cycle of the binary search algorithm, the efe value of the approximate solutionobtained is shown ( triangular marker ). in Definition 1.1, is far from uniform over W S , V . Example 3.2 involves a larger system than that of Example 3.1. Here enlarging the system results in an increase in thedifference between the minimum efe and the location (mean or median) of the efe-histogram.
Example 3.2.
In this example 400 digital photographs of the world around us are used. The typical data is obtained usingthe method shown in Figure 1, except the number of sampling locations is increased from five to nine to form a three by threegrid. Since, T = and W S , V = = , this system is too large to apply the binary search algorithm. Instead,Table 5 in Example 3.1 was used to guess an approximate solution. Figure 5 provides an efe -histogram for T . The efe value
10 11 12 13 14 15 16 17 180255075100125150175200225
Figure 5: An efe-histogram for Example 3.2 using 2000 observations and a bin interval of 0.05.The efe value of the approximate solution is shown ( triangular marker ). for the approximate solution is indicated with a triangular marker and shows that the guess is favorable. In the next two examples the theory is applied to systems where the probability distribution P , in Definition 1.1, is uniformover W S , V . These two examples can be compared with Example 3.1. Example 3.3.
In this example T = and the system is comprised of five nodes with a four state node repertoire ( W S , V = ) , as is the case in Example 3.1. However, in the present example, the elements of T are chosen uniformly at random from W S , V . Figure 6 provides an efe -histogram for T . The binary search algorithm was also applied to T and completed thirteencycles. The efe -histogram is not negatively skewed and the difference between the efe value of the approximate solution, foundby the binary search algorithm, and the mean of the efe -histogram is only 0.62, to three sf, compared to 4.26 for Example 3.1.A second choice for the elements of T was then made uniformly at random from W S , V . The approximate solution, found bythe binary search algorithm, for the second choice of T was very different to that of the first choice of T . Figure 6: An efe-histogram for Example 3.3 using 200,000 observations and a bin interval of0.01. For each cycle of the binary search algorithm, the efe value of the approximate solutionobtained is shown ( triangular marker ).11 xample 3.4.
In this example the system is again comprised of five nodes with a four state node repertoire. However, T = such that there is exactly one observation of each element of W S , V in T . In this case if we take the probabilitydistribution P, in Definition 1.1, to be uniform over W S , V then efe ( · , · , T ) = efe ( · , · , P ) ; see Definition 1.5. In particular, ifwe let T ′ denote the typical data in Example 3.3 then the present example is the limit case for Example 3.3 as T ′ → ¥ .Figure 7 provides an efe -histogram for T . The binary search algorithm was applied to T but stopped before completing onecycle because the minimum of the efe values, obtained in Step 2 of the algorithm, was given by many of the pairs of weightedrelations tested in Step 2. This is due to a type of symmetry within T which we now consider.We can represent T in the form of a table with each row corresponding to a typical data element; e.g. see Tables 1 and 2.A transformation of T can be made by, for example, switching the content of columns 3 and 4 of T , which is equivalent toswitching round the node labels at the top of the columns. Table 6 presents one of the pairs of weighted relations that gavethe minimum efe value obtained in Step 2 of the algorithm. A transformation of R in Table 6 can be made by switching thecontent of columns 3 and 4, and then switching the content of rows 3 and 4. Clearly, the efe is invariant under performingboth the transformation to T and the transformation to R. Now, because T is comprised of exactly one observation of eachelement of W S , V , the rows of the transformed version of T can be reordered to give back T before the transformation wasmade. Since efe is invariant regarding the order of typical data elements, the efe value given by T relative to U and thetransformed version of R is the same as efe ( R , U , T ) . Since R and its transformed version are different, the minimum of the efe values, obtained in Step 2 of the algorithm, is given by more than one of the pairs of weighted relations tested in Step 2.The same argument also applies to the solutions to (1) and, consequently, these solutions vary greatly with respect to d .Also in the present example, for every fixed U and R, the transformation on T is a type of efe preserving involution (i.e. Thas a type of symmetry). More generally beyond the present example, the extent to which T is invariant, up to the order of Figure 7: An efe-histogram for Example 3.4 using 200,000 observations and a bin interval of0.0002.12able 6: One of the pairs of weighted relations, in Example 3.4, that gave the minimum efevalue obtained in Step 2 of the binary search algorithm.
U v v v v v v v v R node 1 node 2 node 3 node 4 node 5node 1 1 0.25 0.25 0.75 0.25node 2 0.25 1 0.75 0.25 0.25node 3 0.25 0.75 1 0.25 0.25node 4 0.75 0.25 0.25 1 0.25node 5 0.25 0.25 0.25 0.25 1 its rows following such transformations, may be important regarding the shape of the efe -histogram. Remark 3.1.
Note that, in Example 3.4, the involution on T can be put into a broader context as an element of a group ofpermutations of the contents of the columns of T . Similarly, the transformation applied to R is an element of a group of suchtransformations on Y S . There is also a similar group of transformations on Y V . Beyond Example 3.4, for a given system itmay be that such a transformation on Y S acts almost as the identity on the solutions to (1). In this case the system has definedgeometry on S, under the requirement that the efe is minimised, that has a symmetry such as a rotation or reflection etc.Upon consideration of the positively skewed efe -histogram in Figure 7, the reader might ask why we do not look for pairsof weighted relations that maximise efe instead of minimise it. For every given system, the weighted relations U ∈ Y V andR ∈ Y S that maximise efe are the constant functions which everywhere take the value 1; see Definition 1.5. In the next example the typical data is obtained from colour digital photographs.
Example 3.5.
In this example 600 digital photographs of the world around us are used. The typical data is obtained usingexactly the method shown in Figure 2, where the photographs have a nine colour red/green palette. Hence, T = andthe system is comprised of five nodes with a nine state node repertoire ( W S , V = ) . The system is too large to apply thebinary search algorithm. Hence, in this case, approximate solutions to (1) are guessed and their associated efe values arecompared with an efe -histogram for the system. Table 7 presents the guess for R; the right hand side of Figure 3 providesa graph illustration for R. Table 8 gives two different guesses, U ′ and U, for the weighted relation on V (note that the noderepertoire labels are of the form red,green i.e. , is the label for pure red). Figure 8 provides a graph illustration of Table 7: Guess for R in Example 3.5. R node 1 node 2 node 3 node 4 node 5node 1 1 0.95 0.65 0.95 0.75node 2 0.95 1 0.95 0.75 0.95node 3 0.65 0.95 1 0.60 0.75node 4 0.95 0.75 0.60 1 0.95node 5 0.75 0.95 0.75 0.95 1 the weighted relations in Table 8. Figure 9 provides an efe -histogram for T and, whilst U ′ is an obvious first guess, it isU that gives the lower efe value. With respect to U, elements of V of the form x , x + a, where a is constant, are morestrongly related than elements of the form x , a − x; i.e. with respect to U, the representative of pure red is very distinct from V in Example 3.5. U ′ U U ′ U Figure 8: Graph illustration of the weighted relations in Table 8, showing strongest relationships ( solid lines ),intermediate relationships ( dash lines ) and, for U only, weak intermediate relationships ( dotted lines ). Figure 9: An efe-histogram for Example 3.5 using 5000 observations and a bin interval of 0.05. The valuesefe ( R , U ′ , T ) = . ( R , U , T ) = . triangular marker ).14 he representative of pure green. We note that, given the efe -histogram and the system, R and U appear to be favorableand appropriate weighted relations. However, R and U are still only guesses and actual solutions to (1) could be somewhatdifferent. Ideally we would use the binary search algorithm on a similar system but with a larger node repertoire and a largertypical data, but this comes at a high computational expense. We now have the first of three investigations concerning the theory.
In this subsection we look at base changing operations, base branching structure, and the affect of changing base on efe-histograms.
Here we look at two different types of base changing operations. One of the types of operations involves combining nodeswhilst the other involves splitting nodes. Many operations of the same type are equivalent in the sense that the resultingsystems only differ in the choice of labels used for nodes or repertoire elements. Furthermore, every combining operation isthe inverse of some splitting operation and vice versa. As an example, suppose we have a system with S = V = S ′ = V ′ =
4. More generally, fromTable 9: Example of changing the base of a system. node 1 node 2 node 3 node 4 node 5 node 6 S t ( ) v v v v v v ... ... ... ... ... ... ...node 1 node 2 node 3Node allocation (node 1,node 4) (node 5,node 2) (node 6,node 3) S ′ t ( ) ( v , v ) ( v , v ) ( v , v )... ... ... ... v ′ v ′ v ′ v ′ Repertoire allocation ( v , v ) ( v , v ) ( v , v ) ( v , v )node 1 node 2 node 3 S ′ t ( ) v ′ v ′ v ′ ... ... ... ... Table 9, we see that there are 6! different possible node allocations and 4! different possible repertoire allocations. Hence,in this case, the total number of such combining operations (or splitting operations if reversing the process) is 6!4! = W S , V = = = W S ′ , V ′ . Similarly we note that W S , V = = so that there are 6!8!different combining operations resulting in systems with two nodes and an eight state node repertoire.Such operations do have an affect on efe-histograms. Indeed, since W S , V = = , we can apply a combining operation15hat results in a system with one node and a node repertoire that has a state for every state of the system. The resultingefe-histogram has a standard deviation of zero and is located at the maximum possible efe value for a system with 64 states,which is log ( ) =
6; see Definition 1.5 and Subsection 2.3. We will further consider the affect of base changing operationson efe-histograms in Subsection 3.1.3.
We have already noted in Subsection 3.1.1 that many base changing operations are equivalent in the sense that the resultingsystems only differ in the choice of labels used for nodes or repertoire elements. The advantage of this redundancy, forappropriate systems, is that it allows us to apply a splitting operation in the first instance (i.e. we can start with a repertoireallocation) instead of being restricted to combining operations. Alternatively we can avoid this redundancy by treating asystem in its initial base as being at the bottom of a branching structure which branches under combining nodes such thateach branch terminates with the system represented by a single node. Table 10 shows one such branch. We note that, withregard to weighted relations on the nodes, the order of the columns in Table 10 is not important as long as column heading andcolumn contents are kept together. Furthermore, there is no repertoire allocation since we retain the vector form of the nodestates. These simplifications reduce the number of combining operations discussed in Subsection 3.1.1 from 17280 to 120.Table 10: One branch of a base branching structure. node 1 node 2 node 3 node 4 node 5 node 6 S t ( ) v v v v v v ... ... ... ... ... ... ...Branch (node 1,node 4) (node 5,node 2) (node 6,node 3) S ′ t ( ) ( v , v ) ( v , v ) ( v , v )... ... ... ...End of Branch ((node 5,node 2),(node 1,node 4),(node 6,node 3)) S ′′ t ( ) (( v , v ),( v , v ),( v , v ))... ... Now the definition of float entropy in Definition 1.5 uses only one base for a system. However, multi-relational float entropy(see Subsection 4.1) involves more weighted relations by involving more than one base. For some systems it may be thatparticular bases are important regarding weighted relations that minimise efe and/or regarding maximising the length of theleft tail of the efe-histogram. Indeed, we have already noted that combining all of the nodes of a system into a single node isnot good in this respect, showing that other bases are preferable; see Subsection 3.1.1. Moreover, a change of base may allowa system to define weighted relations at a higher level of meaning. For example, the solutions to (1) may define (to a highresolution) a weighted relation R on the nodes of some system, giving two dimensional geometry. For a particular branch ofthe base branching structure, the states of the composite nodes will be images under the geometry (given by R ) on the nodes16hat have been combined. Hence, under the requirement of further minimising efe, the system defines a weighted relation onthe repertoire of the composite nodes and hence on a set of images; see Subsection 4.1. This may have relevance to someaspects of the Gestalt theory of visual perception; see [5].Comparing base changing operations with base branching structure we note that allowing arbitrary application of successivecombining and splitting operations may provide too much freedom in the sense that a system may then define too manyweighted relations (under requirements such as the minimisation of efe) to specify a single consistent interpretation of thestates of the system. Hence, restricting the theory to the base branching structure may be desirable (or perhaps at least to somefurther generalisation of the base branching structure). In spite of this we will now look at the affect of combining nodes andsplitting nodes on efe-histograms. efe -histograms The following lemma says that uniform randomness is preserved by both combining and splitting operations.
Lemma 3.1.
Suppose we have a system where the probability distribution P in Definition 1.1 is uniform over W S , V . For anygiven combining or splitting operation, as described in Subsection 3.1.1, let S ′ and V ′ (with V ′ as small as possible) besuch that W S ′ , V ′ is the codomain of W S , V under the operation. Furthermore, for S ′ i an element of the image of W S , V under theoperation, define P ′ ( S ′ i ) : = P ( A S ′ i ) , where A S ′ i is the preimage of S ′ i . Then P ′ is a uniform probability distribution over W S ′ , V ′ .Proof. Immediate since the operation is a bijection from W S , V to W S ′ , V ′ .We now consider the case where P is far from uniform over W S , V . Because computational recourses are limited, a choicehad to be made between looking at the affects of many different base changing operations on just one system and looking atthe affects of one or two different base changing operations per system for several different systems. The latter was chosen inorder to avoid accidentally giving results for some highly unusual system. Typical data was obtained for each of the systemsfrom digital photographs. The method in Figure 1 was used except the number of shades in the gray scale, the location ofthe sampling grid and the number of nodes involved varied from system to system; details are given on the left-hand side ofTable 11. Table 11: Seven systems from which efe observations were taken both before and after theapplication of a base changing operation. System S V S ′ V ′ For each of the systems, 400 typical data elements were collected. Subsequently a large number of efe observations were17
Figure 10: For each system the figure shows: the skewness, and mean minus minimum, of the efe-histogramwhen using the original base ( x axis ) and after changing to the alternative base ( y axis ); the shift in theminimum and the shift in the median when changing back to the original base from the alternative base.obtained using the method described in Subsection 2.3. The same number of efe observations was then obtained havingapplied a base changing operation to the typical data. Apart from the size of base (i.e. the size of the repertoire V ′ inTable 11), the base changing operation was chosen at random for each system. With respect to the seven systems in Table 11,we note that System 3 is actually the same as System 2 in the sense that the same typical data is used. However, a differentbase changing operation has been applied to System 3 than that applied to System 2. Similarly, System 6 and System 7 are thesame but have had different base changing operations applied. For each system, Figure 10 compares statistics obtained fromthe efe observations, made before the change of base, with statistics obtained from the efe observations made after the changeof base. Note that, in Figure 10, skewness is measured using the adjusted Fisher-Pearson standardized moment coefficient.For each of the systems investigated it can be seen from Figure 10 that, when changing back to the original base from thealternative base, the efe-histogram undergoes an increase in negative skewness and mean minus minimum as well as a rightshift in location. Furthermore, for most of the systems, the minimum efe value observed, when using the original base, is tothe left of the minimum efe value observed when using the alternative base. These results suggest that the bases maximisingthe length of the left tail of the efe-distribution (here approximated by an efe-histogram) are important for the theory presentedin the present paper. One caveat concerning this investigation is that, for each system, the variance in the observed minimummight be rather high because the distribution being sampled has a very thin left tail. We now move onto our next investigation.18 .2 Joining and partitioning systems Consider the visual cortex and the auditory cortex of the brain. There is evidence that the brain defines relationships betweenthe states of these different brain regions at a high level of meaning (i.e. between images and sounds); see [6]. However, at thelower level of meaning at which the images and the sounds are defined it may be that the two brain regions are self contained.The brains of two different people is perhaps a more overt example of self containment or privacy.In the context of the theory of the present paper, suppose we have two systems. We can solve (1) for both of the systemsseparately and sum the resulting minimised efes. If this sum is significantly more than the minimum efe obtained whenjoining the two systems then it makes sense to consider the two systems as a single system. Examples of such systems areeasy to construct. Conversely, for a given system, it might be possible to partition the set of nodes S such that the sum of theminimum efes of the resulting systems is less than that of the original system. In this case, at least in the given basis, it makessense to consider the original system as several different systems. It is not so easy to find examples of such systems, at leastwhen the systems are small. However, Table 12 provides an example where the minimum efe of the system is greater than 3whilst, after partitioning, the sum of the minimum efes is only 2.8. The result was obtained from the system by investigatingTable 12: Typical data of a system before and after a partition which results in lowering the totalminimum efe. node 1 node 2 node 3 node 4 S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v S t ( ) v v v v node 1 node 2 S ′ t ( ) v v S ′ t ( ) v v S ′ t ( ) v v S ′ t ( ) v v S ′ t ( ) v v node 3 node 4 S ′′ t ( ) v v S ′′ t ( ) v v S ′′ t ( ) v v S ′′ t ( ) v v S ′′ t ( ) v v an efe-histogram involving 4 · observations, and by running the binary search algorithm. Note that the typical data of thesystem is such that the partitioned systems are independent when considered as random variables (this is why the number of19ypical data elements can be divided by five after partitioning).The number of different partitions that there are of a system with n nodes is given by the Bell number B n . For S = n wehave B n = B ( n ) ( ) , where B ( x ) = exp ( e x − ) is the generating function for the Bell number; see [7]. This number quicklybecomes large as n increases; making the investigation of all the different partitions of a system computationally expensivefor all but small systems. In the final investigation of this section we consider the metric used in Definition 1.5. Remark 1.3 suggests that the theory presented in the present paper is independent of the choice of metric used in Definition 1.5provided that the metric determines a total order on [ , ] in some natural way. Before considering this in more detail, wehave the following example. Example 3.6.
This example uses the typical data T that was obtained in Example 3.1. The binary search algorithm ofSubsection 2.2 was again applied to T but this time d ¥ was used in Definition 1.5 instead of d . After four cycles, theweighted relations in Table 13 were returned with efe ( R , U , T ) = . using d ¥ . We see that U in Table 13 yield the same Table 13: Approximate solution for Example 3.6 using d ¥ in Definition 1.5. U R node 1 node 2 node 3 node 4 node 5node 1 1 0.96875 0.84375 0.90625 0.90625node 2 0.96875 1 0.84375 0.90625 0.96875node 3 0.84375 0.84375 1 0.84375 0.84375node 4 0.90625 0.90625 0.84375 1 0.90625node 5 0.90625 0.96875 0.84375 0.90625 1 graph illustration as that given by U in Table 5 from Example 3.1. The same cannot be said when comparing R in Table 13with R in Table 5, although there are some similarities. However, it also turns out that U and R in Table 5 are a betterapproximate solution to (1) in the present example, i.e. when using d ¥ instead of d , than that given by U and R in Table 13.Indeed the efe drops from . to . bpe. The result in Example 3.6 is perhaps not surprising once we appreciate certain similarities between d ¥ and d . Toappreciate these similarities and further results, the following assumption will be useful. Assumption 3.1.
Under this assumption, for a metric d : [ , ] n → R + , there exists a metric d ′ : [ , ] → R + such that for all ≤ i ≤ n, ( c , · · · , c i − , c i + , · · · , c n ) ∈ [ , ] n − and a , b ∈ [ , ] we have d (( c , · · · , c i − , a , c i + , · · · , c n ) , ( c , · · · , c i − , b , c i + , · · · , c n )) = d ′ ( a , b ) . urthermore, there is a min ∈ [ , ] (e.g. classically a min = ) such that ≤ d , given bya ≤ d b ⇔ d ′ ( a min , a ) ≤ d ′ ( a min , b ) , (2) is a total order on [ , ] and (up to reverse ordering) no other choice of a min ∈ [ , ] in (2) gives a different total order.Moreover, ≤ d determines a maximum element a max ∈ [ , ] (e.g. classically a max = ) which, if using d in Definition 1.5 and ≤ d in the interpretation of weighted relations, would be used in the definition of a reflexive weighted relation in Definition 1.2. Remark 3.2.
Assumption 3.1 gives rise to the following remarks.1. Under Assumption 3.1 it can be argued that d determines a single well defined metric on [ , ] and that this metric is d ′ .2. Furthermore, d ′ (and hence d ) determines intervals in [ , ] , i.e. [ a , b ] d : = { c ∈ [ , ] : a ≤ d c ≤ d b } for a , b ∈ [ , ] , and the length of such intervals is given by d ′ ( a , b ) .3. With the above two remarks in place, we note that, in each coordinate, d defines d -uniform random variables on [ , ] ,i.e. if [ a , b ] d and [ c , d ] d are of the same length then the probability of a d -uniform random variable taking a value in [ a , b ] d is the same as it taking a value in [ c , d ] d .4. To appreciate some of the similarities between d ¥ and d , note that all of the metrics given in Definition 1.4 satisfyAssumption 3.1 and that d ′ ( a , b ) = | a − b | = d ′ ¥ ( a , b ) for all a , b ∈ [ , ] . There are also some important differencesbetween d and d ¥ ; we will look at some of these shortly. We now return to the issue of metric independence.
Lemma 3.2.
Suppose Definition 1.5 uses a metric d that satisfies Assumption 3.1, and let f : [ , ] → [ , ] be a bijection.Then d f : [ , ] n → R + , d f (( a , · · · , a n ) , ( b , · · · , b n )) : = d (( f ( a ) , · · · , f ( a n )) , ( f ( b ) , · · · , f ( b n ))) , is also a metric on [ , ] n and the theory in the present paper is independent of a change of metric from d to d f in Definition 1.5provided that, in Definition 1.2, f − ( a max ) is used in the definition of a reflexive weighted relation, ≤ d f is used in place of ≤ d in the interpretation of values given by weighted relations and, when obtaining efe -histograms (see Subsection 2.3), d f -uniform random variables are used instead of d -uniform random variables.Proof. Lemma 3.2 follows immediately from the fact that d f is merely d under relabeling each a ∈ [ , ] with f − ( a ) .21t is expected that a more general result than Lemma 3.2 is possible such that d ′ in Assumption 3.1 may have dependenceon ( c , · · · , c i − , c i + , · · · , c n ) ∈ [ , ] n − for 1 ≤ i ≤ n . In this case, the interpretation of each value in a weighted relationtable would be dependent on the other values in that table; the interpretation itself is determined by the metric d being used.To appreciate one of the differences between d and d ¥ we require the following definition. Definition 3.1.
Let d : [ , ] n → R + be a metric conforming to Assumption 3.1. Moreover, let a , b , c ∈ [ , ] n be such that d ′ ( a i , c i ) ≥ d ′ ( b i , c i ) , for i = , . . . , n, and for one or more i we have d ′ ( a i , c i ) > d ′ ( b i , c i ) . If for all such a , b , c ∈ [ , ] n wehave d ( a , c ) > d ( b , c ) then d is called an increasing function of coordinatewise distance . The metric d is an increasing function of coordinatewise distance but, for n > ¥ is not; indeedd ¥ ( a , c ) = = d ¥ ( b , c ) for n = a = ( , . ) , b = ( , ) and c = ( , ) .It is hoped that, upon further investigation, a class of metrics will emerge as being the most optimal (in some natural way) inthe context of the theory of the present paper. Hence, independence arguments would then only need to apply to this classof metrics. It may be that being an increasing function of coordinatewise distance is a necessary condition for a metric to beoptimal, but this is for future work. Regarding the theory in the present paper, it is certainly the case that the meaning of thevalues in weighted relation tables is given by the characteristics of the metric being used in Definition 1.5.We conclude this section with a reminder of the variety of different metrics that there are on R n . Lemma 3.3 shows that, evenwhen restricting attention to metrics that are equivalent to d , the variety is great. Lemma 3.3.
Let d be the Euclidean metric on R n , n ∈ N . For all a , b ∈ R n , define d f ( a , b ) : = d ( f ( a ) , f ( b )) , wheref : R n → R n is such that f : ( R n , d ) → ( R n , d ) is a homeomorphism. Then ( R n , d f ) is a metric space and d f is equivalentto d ; i.e. the open subsets of ( R n , d f ) are the same as those of ( R n , d ) .Proof. Since f : ( R n , d ) → ( R n , d ) is a homeomorphism, f : R n → R n is a bijection. From this it easily follows that d f is ametric. To show equivalence we need to show that A ⊆ ( R n , d f ) is open if and only if A ⊆ ( R n , d ) is open. Hence, to showonly if, let A ⊆ ( R n , d f ) be open. In this direction it is enough to show that f ( A ) , as a subset of ( R n , d ) , is open since then A = f − ( f ( A )) ⊆ ( R n , d ) is open by f : ( R n , d ) → ( R n , d ) being a homeomorphism. Let a ′ ∈ f ( A ) . Then a ′ = f ( a ) forsome a ∈ A . Since A ⊆ ( R n , d f ) is open, there exists e > b ∈ R n with d f ( a , b ) < e we have b ∈ A , andthus f ( b ) ∈ f ( A ) . Hence, for all b ′ ∈ R n with d f ( a , f − ( b ′ )) < e we have f − ( b ′ ) ∈ A , and thus f ( f − ( b ′ )) = b ′ ∈ f ( A ) .Noting that d f ( a , f − ( b ′ )) = d ( f ( a ) , f ( f − ( b ′ ))) = d ( a ′ , b ′ ) , it follows from the last statement that for all b ′ ∈ R n withd ( a ′ , b ′ ) < e we have b ′ ∈ f ( A ) . Hence, f ( A ) ⊆ ( R n , d ) is open. The proof in the other direction is similar. Example 3.7.
Let f : R → R conform to the conditions of Lemma 3.3. If f maps ℓ in Figure 11 to the unit circle in ( R , d ) and f ( ) = then ℓ ⊆ R is the unit circle in ( R , d f ) . Section 4 provides some generalisations of Definition 1.5, a comparison with both Integrated Information Theory andShannon entropy, followed by the conclusion. 22 ℓ R Figure 11: The path ℓ in Example 3.7. In Subsection 4.1 we extend Definition 1.5 to involve many more weighted relations.
We start with a definition.
Definition 4.1 (Multi-relational float entropy) . Let S be as in Definition 1.1, let U ∈ Y V , and let R ∈ Y S . Furthermore, letU , U , . . . and R , R , . . . be weighted relations analogous to U and R but for the system in different bases; see Subsection 3.1.2on base branching structure. The multi-relational float entropy of a data element S i ∈ W S , V , relative to U , U , U , . . . andR , R , R , . . . , is defined as fe ( R , U , R , U , R , U , . . . , S i ) : = log ( { S j ∈ W S , V : C ( R , U , R , U , R , U , . . . , S i , S j ) ∧ C ( R , U , R , U , R , U , . . . , S i , S j ) ∧ · · · } ) , where the first condition C ( R , U , R , U , R , U , . . . , S i , S j ) is d ( R , R { U , S j } ) ≤ d ( R , R { U , S i } ) , as in Definition 1.5. In Definition 4.1, all of the conditions C , C , · · · need to be satisfied for a data element S j to contribute toward the multi-relational float entropy of a data element S i . The additional conditions should be those that increase the length of the left tailof the efe-distribution.For example, for some given system (and under the requirement of minimising expected multi-relational float entropy), R might be such that it define two dimensional geometry on the nodes of the system. Furthermore, for a particular branch of thebase branching structure, the states of the composite nodes will be images under the geometry (given by R ) on the nodes thathave been combined. Hence, for C analogous to C but using the new base, the system defines a weighted relation U onthe repertoire of the composite nodes and hence on a set of images. This may have relevance to some aspects of the Gestalttheory of visual perception; see [5].Suppose that the system is part of the brain. As suggested in Subsection 3.2, at the level of meaning at which images23re defined by the visual cortex and sounds are defined by the auditory cortex, it may be that the two brain regions areself contained; i.e. they may be separate systems. However, at a higher level of meaning (and for a particular branch ofthe base branching structure), one of the nodes of the brain will be the whole visual cortex and another will be the wholeauditory cortex. The states of the visual cortex are visual objects and the states of the auditory cortex are sounds. Applyingthe present theory appropriately should give a weighted relation on the two cortical regions and another giving relationshipvalues between objects and sounds. One caveat, however, is that in this case the two cortical regions as nodes do not share thesame node repertoire, and so some care needs to be taken when considering how to apply the definitions of the present paper.There is also evidence of sparse coding in various cortical regions; see [8] and [9]. For example, there are neurons that areactive if and only if activity related to a specific object (auditory or visual etc) is present in the respective cortex. Hence,under the minimisation of efe also on this set of neurons, the system defines relationships between objects.Finally, Definition 1.5 allows time dependent versions of the results presented in the present paper, and in general. Suppose inExample 3.1 that the digital photographs sampled are in fact frames from videos. Choose an integer k ∈ N . For each sampledframe, sample in the same way the subsequent k − k (i.e. each node in each typical data element is replaced by k nodes that form a sequence of states of the original nodeover a short time period). In this case, it is anticipated that if U and R solve (1) then R will define geometry on the nodes ofthe system that has a dimension for time. This subsection starts with an initial comparison between the theory of the present paper and Giulio Tononi’s IntegratedInformation Theory (IIT) of consciousness. IIT has gained much attention in recent years (see [10], [11], [12], [13] and [14]),and it maybe that the two theories are quite compatible in some areas. There is a significant difference in emphasis in theformulation of the two theories. In [11] IIT has been formulated and further developed with the intention that it will satisfycertain self-evident truths about consciousness, which Tononi refers to as axioms. In brief, the axioms are as follows: • Existence: Consciousness exists. • Composition: Within the same experience, one can see, for example, left and right, red and green, a circle and a square,a red circle on the left, a green square on the right, and so on. • Information: Consciousness is informative: each experience differs in its particular way from an immense number ofother possible experiences. • Integration: Each experience is irreducible to independent components. • Exclusion: At any given time there is only one experience having its full content. This axiom also states constraints onconsciousness such as resolution. 24rom these axioms IIT postulates a number of properties that physical systems must satisfy in order to generate consciousness.These properties introduce a substantial amount of initial theory involving cause and effect within systems. Since this initialtheory is fundamental to the formulation of IIT, it is crucial that the set of axioms is correct and complete. The theory of thepresent paper has a significant difference in emphasis because it uses the following axiom in its formulation: • Relations: Consciousness is awash with underlying relationships which provide the relational content of experience.It is natural that relations should be fundamental to the formulation of a theory of consciousness because, in one form oranother, they are ubiquitous among mathematical structures. Hence, in the author’s opinion, this axiom should be addedto Tononi’s list of axioms. However, IIT does have something to say about the quality of conscious experience, and this isdiscussed below.It is worth noting that the theory in the present paper is more or less compatible with the IIT axioms. For example, regardingthe unity of consciousness (integration), according to the theory in the present paper, when a brain state is interpreted in thecontext of the weighted relations that minimise expected multi-relational float entropy, the brain state acquires meaning in theform of the relational content of the associated experience. Furthermore, regarding resolution (part of the exclusion axiom)we recall that, for all but trivial examples, (1) will have many solutions and, hence, only defines weighted relations up to acertain resolution that depends on the system. Of course a more rigorous comparison with the axioms is desirable, but this isfor future work.
One of the strengths of IIT is that it attempts to distinguish between brain regions that contribute toward consciousness andthose that do not. This is undertaken at several different scales from small mechanisms (i.e. small subsystems) up to wholesystems such as the brain. For this purpose, at the scale of mechanisms, the theory introduces a quantity called
IntegratedInformation , and analogous quantities are introduced for larger scales. It is worth giving the reader some insight into howthis quantity is defined for mechanisms. Suppose we have a small number of logic gates that are interconnected in some way,and that the resulting mechanism updates over discrete time. The current state of the mechanism (say at time t =
0) providescausal information about what the state of the mechanism might have been at time t = −
1. In fact it implies a probabilitydistribution on the set of all states of the mechanism for t = −
1. If we were to partition the mechanism in some way by cuttingconnections and treating cut inputs to gates as extrinsic noise then, in many cases, there would be a reduction in the amountof causal information that the current state of the mechanism provides about what the state of the mechanism might have beenat t = −
1. As in the case of the unpartitioned mechanism, the partitioned mechanism also implies a probability distributionon the set of all states of the mechanism for t = −
1. The reduction in the causal information is quantified by measuring thedistance between these two probability distributions using the Wasserstein metric, also known as the earth-mover’s distance.If, out of all the different ways to partition the mechanism, the partition chosen actually loses the minimum amount of causalinformation then the partition is called the minimum information partition (MIP) for the mechanism in its given state at t = causal information of a mechanism in its given state is defined as the Wasserstein distance betweenthe probability distribution for the state of the unpartitioned mechanism, at t = −
1, and the probability distribution for thestate of the mechanism’s MIP at t = −
1. In IIT there is also an analogous definition for the quantity of effect information ofa mechanism in its given state. Finally, the quantity of integrated information of a mechanism in its given state is defined asthe minimum of its causal information and its effect information for that state. The integration postulate of IIT says that onlywhen the quantity of integrated information is positive can a mechanism contribute to consciousness.We will now consider an example from [12] which formed part of the motivation behind the definition of integrated informa-tion. We will see that there is an alternative (or complimentary) interpretation of the example which leads in the direction ofthe theory of the present paper. Consider a digital-camera sensor chip made up of 1 million photodiodes. From the perspectiveof an external observer, the chip has a large number of different states. From an intrinsic perspective, however, the chip canbe considered as 1 million independent photodiodes; cutting the chip down into individual photodiodes would not change theperformance of the camera. It is hard to imagine that the chip can be conscious of the images that fall upon it. On the otherhand, the visual experiences we enjoy are integrated and we experience whole images. Accordingly, cutting the visual cortexdown into individual neurones would completely change the performance of the system.It is then stated in the example that what underlies the unity of experience is a multitude of causal interactions among therelevant parts of the brain. From this we can see why cause and effect is a fundamental part of the definitions used in IIT,and why the theory developed in the direction it did. An alternative (or complimentary) interpretation of the example is thatthe interactions between neurons make some states of the system more likely than other states; i.e. the system is inherentlybiased and this defines a probability distribution P on the set of states of the system. The probability distribution is a propertyof the system itself and allows the system to define expected quantities. This allows the theory of the present paper to bedeveloped with an emphasis on relations, which is desirable since relationships are an inherent part of consciousness.Now let’s consider the camera chip in the context of the theory of the present paper. Each photodiode is unbiased since itsstate is driven by its input signal. The 1 million photodiodes are completely independent. If the chip defines a probabilitydistribution on its states at all (which is debatable) then it is the uniform distribution. In Examples 3.3 and 3.4 of the presentpaper, we saw that when P is uniform the solutions to (1) vary greatly and, hence, the system fails to define weighted relationsthat give a coherent interpretation of the states of the system. Furthermore, the associated efe-histogram is without a left tail.So, for contrast with IIT, the theory of the present paper suggests that, to contribute to consciousness, a mechanism will atleast need an inherent probability distribution on its set of states that gives an efe-histogram with a long left tail. The length ofthe left tail may turn out to be of great importance; when the tail is very long, the solutions to (1) are very distinct from otherweighted relations. The length of the left tail is also important in multi-relational float entropy regarding which branches ofthe base branching structure should be involved; see Subsection 4.1.From a practical perspective, we might use cause and effect to estimate the inherent probability distribution of a mechanism.For a deterministic mechanism, we can estimate the probability of a state S i as the number of states that cause S i divided26y the total number of states of the mechanism. Of course, Markov Chain theory is appropriate here (particularly in thenondeterministic case) and a rigorous approach should be taken. Suppose we have a mechanism that has n states. In IIT (see [12]), the Qualia space of the mechanism is an n -dimensionalspace with a real axis given for each state of the mechanism. Each probability distribution on the set of states of the mechanismdefines a point in an ( n − ) -dimensional subspace of Q-space, noting that, for each probability distribution, probabilitiesmust sum to 1. The point closest to the origin in this subspace is given by the uniform distribution. For a given state of themechanism at t =
0, the state defines a probability distribution on the set of states of the mechanism at t = − q-arrows ; the connections of the mechanism involved in determining the point at the bottom of a q-arroware included in the subset of connections involved in determining the point at the top of the q-arrow. This forms a lattice,embedded in Q-space, which has the uniform distribution at its bottom and the distribution given by the complete mechanismat its top. The shape that the lattice encloses is called the quale , and the q-arrows are a geometric realization of informationrelationships.Changing the state of the mechanism at t = entanglement . Suppose a latticein Q-space has a point p that is at the bottom of two q-arrows q , and q , which terminate at points p and p respectively.The connections of the mechanism involved in determining the points p and p , when taken together, determine a point p .Treating p and p as vector from the origin, if p = p + q , + q , then the q-arrows q , and q , are said to be tangled.In other words, the information relationship given by q , does not reduce to the information relationships given by q , and q , .With respect to vision, it is suggested in [12] that, for a mechanism in the form of a grid, connections of the mechanism thatare close together will give entangled q-arrows in Q-space near the bottom of the lattice, but connections of the mechanismthat are far apart will not. Hence, these entanglements give rise to the concept of local regions and, therefore, geometry.From the perspective of the author of the present paper, entanglement is a desirable addition to the theory of integratedinformation since it acknowledges the need for the theory to include the capacity to define relationships. For comparisonregarding the quality of consciousness, the aim of the present paper is to provide a mathematical theory for how the braindefines the relationships underlying consciousness. If applicable to the visual cortex, the examples in Section 3 show that the27erceived relationships between different colours, the perceived relationship between different brightnesses, and the perceivedrelationship between different points in a person’s field of view (giving geometry) are all defined by the brain in a mutuallydependent way. If we were to apply the theory to the auditory cortex then the resulting weighted relations might define howwe perceive the relationships between the pitches of the chromatic scale. Of course, more work is required. Although anearly example, when considering the scope of the theory of the present paper, readers may find the Definitive Player Problemto be of interest; see [1]. In short, when IIT leans in the direction of defining relationships synergies start to emerge with thetheory of the present paper. Shannon entropy is notably used in the neuroscience of consciousness. The definition of float entropy (see Definitions 1.5and 4.1) has some similarity to that of Boltzmann’s entropy. Whilst not to be confused with Shannon entropy, expected floatentropy, efe, does have some similarities with Shannon entropy. Indeed, efe is a measure (in bits per data element) of theexpected amount of information needed, to specify the state of the system, beyond what is already known about the systemfrom the weighted relations provided. Shannon entropy is a measure of information content in data. As data becomes morerandom, Shannon entropy increases because structure in data is actually a form of redundancy. By solving (1) for a givensystem we obtain a structure in the form of weighted relations defined by the system. Relative to these weighted relations, ifthe system was to become more random then the efe value for the system would increase. In order to make the similaritiesbetween efe and Shannon entropy clearer, consider the summation (cid:229) S i ∈ W S , V P ( S i ) log (cid:18) P ( S i | A S i ) (cid:19) , (3)where A S i : = { S j ∈ W S , V : d ( R , R { U , S j } ) ≤ d ( R , R { U , S i } ) } . The summation in (3) is similar in form to the definition ofShannon entropy. Furthermore, (3) can be written as (cid:229) S i ∈ W S , V P ( S i ) log (cid:229) S j ∈ A Si P ( S j ) P ( S i ) ! , (4)and, when the probabilities in the argument of the logarithm are comparable, this will give a value similar to efe ( R , U , P ) .Finally, we can write (4) as H + (cid:229) S i ∈ W S , V P ( S i ) log (cid:229) S j ∈ A Si P ( S j ) , (5)where H is the Shannon entropy of the system and, with consideration of the log function, the second term has a negativevalue between − H and 0. As per Example 3.4, even when P is uniform over W S , V , the second term of (5) need not be equalto 0. However, for U and R the constant functions which everywhere take the value 1, (5) simplifies to H .28 .3 Conclusion The present paper significantly extends the work introduced in [1] by further developing theory and testing theory usingseveral informative examples. We have noted the following two facts. Firstly, conscious experience is awash with underlyingrelationships. Secondly, for various brain regions, such as the visual cortex, the probability distribution over the differentpossible states of the system is far from being uniform owing to the effect of learning rules that weaken or strengthensynapses. Hebb’s principle says that what fires together wires together and the BCM version of Hebbian theory is oneof many such learning paradigms; see [2] and [3]. There is also evidence for the relevance of BCM theory regarding thehippocampus; see [4]. Furthermore, the probability distribution over the states of the system is a property of the system itselfallowing the system to define expected quantities. The theory in the present paper provides a link between the above facts.Under the requirement of minimising expected (multi-relational) float entropy, the brain defines relationships; the theoryrepresents relationships using weighted relations. It is proposed that when a brain state is interpreted in the context of allthese weighted relations, defined by the brain, the brain state acquires meaning in the form of the relational content of theassociated experience. The examples in the present paper provide evidence that supports the theory.In Example 3.1, T was obtained from digital photographs having a four shade gray scale. In this case, T has defined thecorrect relationships under the requirement that efe is minimised. Similarly, in Example 3.5, T was obtained from digitalphotographs having a nine colour red/green palette. We note that, given the system involved, R and U in this example alsoappear to be favorable weighted relations, and appropriate as approximate solutions to (1). However, in this case R and U were guessed and judged appropriate from the efe-histogram; the actual solutions to (1) could be somewhat different.The results in these examples suggest that the perceived relationships between different colours, the perceived relationshipsbetween different brightnesses, and the perceived relationships between different points in a person’s field of view (givinggeometry) are all defined by the brain in a mutually dependent way. Hence, in this case, there is a connection between therelationships that underly colour perception and our perception of the underlying geometry of the world around us.If we were to apply the theory to the auditory cortex then the resulting weighted relations might define how we perceive therelationships between the pitches of the chromatic scale. Of course, more work is required. Although an early example, whenconsidering the scope of the theory, readers may find the Definitive Player Problem to be of interest; see [1].In Example 3.4, we applied the theory to a system where the probability distribution P in Definition 1.1 is uniform over W S , V .In this case the solutions to (1) vary greatly (instead of all being similar) and, hence, the system fails to define weightedrelations that give a coherent interpretation of the states of the system. We found that the variation in the solutions to (1) ispartly due to a type of symmetry within T ; this is discussed in Example 3.4. Also, the associated efe-histogram is without aleft tail. This example supports the claim that the theory may satisfies the empirical observation that not all systems appearto be capable of consciousness.In Subsection 3.1.3, we investigated the effect of applying base changing operations. Typical data was obtained for sevensystems from digital photographs. For each of the systems investigated, Figure 10 shows that, when changing back to29he original base from the alternative base, the efe-histogram undergoes an increase in negative skewness and mean minusminimum as well as a right shift in location. Furthermore, for most of the systems, the minimum efe value observed, whenusing the original base, is to the left of the minimum efe value observed when using the alternative base. These resultssuggest that the bases maximising the length of the left tail of the efe-distribution (here approximated by an efe-histogram)are important for the theory presented in the present paper. However, instead of permitting all base changing operations,restricting the theory to the base branching structure may be necessary; see Subsection 3.1.2.It is argued in [1] that the theory presented there provides a solution to the binding problem and avoids the homunculusfallacy. Those arguments also apply to the theory presented in the present paper. In particular, consciousness is not the outputof some algorithmic process but it may instead, largely, be the states of the system interpreted in the context of the weightedrelations that minimise expected multi-relational float entropy; see Definition 4.1. The weighted relations that Definition 4.1involves, in addition to U and R , are brought in to play by increasing the number of conditions in Definition 1.5. The extraconditions utilise higher bases of the base branching structure. The findings of the present paper suggest that the conditionsC , C , · · · should be those that increase the length of the left tail of the efe-distribution.In Subsection 3.2, we investigated joining and partitioning systems. Table 12 provides an example where the minimum efeof the system is greater than 3 whilst, after partitioning, the sum of the minimum efes is only 2.8.In Subsection 3.3, we considered whether the theory presented in the present paper is independent of the choice of metricused in Definition 1.5 when the metric determines a total order on [ , ] in some natural way. In this case, the meaning ofthe values in weighted relation tables is determined by the metric being used. Example 3.6 and Lemma 3.2 provide someevidence of such independence. However, some more work is required.Finally, in Subsection 4.2 we made some comparisons between the theory of the present paper, Integrated Information Theory,and Shannon entropy. The integration postulate of IIT says that only when the quantity of integrated information is positivecan a mechanism contribute to consciousness. For comparison, the theory of the present paper suggests that, to contributeto consciousness, a mechanism will at least need an inherent probability distribution on its set of states that gives an efe-histogram with a long left tail. The length of the left tail may turn out to be of great importance.According to IIT, the shape of a quale in Q-space completely specifies the quality of the experience, and it is suggested in [12]that similarity in shape corresponds to similarity in experience. The theory in [12] also suggests a way in which relationshipsmight be defined in Q-space by entangled q-arrows. For comparison, the theory of the present paper suggests that, underthe requirement of minimising expected (multi-relational) float entropy, the brain defines relationships (represented in thetheory by weighted relations) such that when a brain state is interpreted in the context of all these relationships the brain stateacquires meaning in the form of the relational content of the associated experience.Finally in Subsection 4.2.3 we showed that efe is a measure (in bits per data element) of the expected amount of informationneeded, to specify the state of the system, beyond what is already known about the system from the weighted relationsprovided. 30t is hoped that future research will someday determine the extent to which the word ‘quasi’ can be removed from the title ofthe present paper. Whilst rather different in content, readers may also find [15], [16], [17] and [18] to be of interest. A Software
Table 14: Software used during the research for the present paper.
Software Availability Use in the present paperGIMP 2.6 Freeware Used to posterise digital raster images, i.e. reduce the palette size to a smallnumber of shades or colours.RasterSampler 1.0 (Java) From the author Used to sample pixels and collate data.URFinder 3.7 (Java) From the author Used to search for solutions to (1) and collect observations for efe-histogram.Excel 2007 Microsoft Used to generate binary entry tables (such as those in Table 4), store outputsand perform statistical analysis.Minitab 17 Minitab Inc Statistical analysis.
URFinder 3.7 can be used to implement the binary search algorithm, specified in Subsection 2.2, and for collectingobservations from which efe-histograms can be produced. The author ran URFinder 3.7 on a desktop dual-core CPU machine,and is happy to distribute the software. The algorithm and machine were chosen for convenience and their performance (i.e.the maximum size of system that can practically be investigated) is far from what could potentially be achieved. Indeed, for asystem with n = S and m = V , Step 2 of the binary search algorithm calculates 2 ( n ( n − )+ m ( m − )) / exact efe values. This iscomputationally expensive for all but quite small systems, particularly since the algorithm calculates exact efe values ratherthan estimates obtained by employing statistical methods.For future investigations we could consider taking advantage of the continuing increase in power and affordability of multi-GPU machines and hybrid CPU-GPU machines. The use of GPUs can result in orders of magnitude improvement in speedover conventional processors. Furthermore, (1) is an optimisation problem and falls within a common general class ofproblems studied in optimisation theory for which a number of efficient algorithms are available. These involve, gradientmethods, stochastic gradient methods and derivative free optimisation; see [19], [20] and [21]. B Notation
Table 15: Notation (most of the formal definitions can be found in Subsections 1.1, 3.3 and 4.1).
Symbol Description a , b , c ,... elements of S but also used to denote elements of other sets when the meaning is clear from thecontext. A an element of 2 W S , V . B n the Bell number for S = n . B ( x ) = exp ( e x − ) the generating function of B n . C , C , C ,... conditions, involving weighted relations, in the definition of multi-relational float entropy.d a metric on the set of all weighted relations on S or, in places, a metric on [ , ] n . n for n ∈ N ∪ { ¥ } , a metric (on the set of all weighted relations on S ) obtained from the corresponding p -norm, for p = n , on a finite dimensional vector space.d f a metric on R n ; a function f : R n → R n is used in its definition.d f a metric on [ , ] n ; a function f : [ , ] → [ , ] is used in its definition. ≤ d a total order on [ , ] determined by the metric d. [ · , · ] d an interval determined by the metric d.efe ( R , U , P ) the expected float entropy, relative to U and R , of the given system.efe ( R , U , T ) the mean approximation of efe ( R , U , P ) .fe ( R , U , S i ) the float entropy, relative to U and R , of the data element S i .fe ( R , U , R , U , R , U ,..., S i ) the multi-relational float entropy, relative to U , U , U ,... and R , R , R ,... , of the data element S i . f i the map f i : S → V corresponding to the data element S i .node 1,node 2,node 3, ... elements of S . P the probability distribution P : W S , V → [ , ] of the random variable defined by the bias of the givensystem. P extends to a probability measure on 2 W S , V . R an element of Y S . R { U , S i } the element of Y S given by the canonical definition R { U , S i } : = U ( f i ( a ) , f i ( b )) for all a , b ∈ S . S a nonempty finite set; in most places S denotes the set of nodes of a system. S i a data element for S , i.e. a system state given by the aggregate of the node states. T the typical data for the given system, i.e. T is a finite set of numbered observations of the givensystem. t the map t : { ,... , T } → { i : S i ∈ W S , V } for which S t ( k ) is the value of observation number k in T . t need not be injective. U an element of Y V . v , v , v ,... elements of V . V the node repertoire, i.e. the set of node states for a given system. Y S the set of all reflexive, symmetric weighted-relations on S . Y V the set of all reflexive, symmetric weighted-relations on V . W S , V the set of all data elements S i , given S and V .2 W S , V the power set of W S , V . Acknowledgment
The author is grateful to the anonymous referee for carefully reading this paper and providing helpful comments. The authoris also grateful to the production editor for ensuring that the finished article is nicely presented.
References [1] Jonathan W. D. Mason. Consciousness and the structuring property of typical data.
Complexity , 18(3):28–37, 2013.[2] E L Bienenstock, L N Cooper, and P W Munro. Theory for the development of neuron selectivity - orientation specificityand binocular interaction in visual-cortex.
Journal of Neuroscience , 2(1):32–48, 1982.[3] A Kirkwood, M G Rioult, and M F Bear. Experience-dependent modification of synaptic plasticity in visual cortex.
Nature , 381(6582):526–528, JUN 6 1996.[4] S M Dudek and M F Bear. Homosynaptic Long-Term Depression in Area CA1 of Hippocampus and Effects of N-Methyl-D-Aspartate Receptor Blockade.
Proceedings of The National Academy of Sciences of The United States ofAmerica , 89(10):4363–4367, MAY 15 1992. 325] Johan; et al Wagemans. A century of gestalt psychology in visual perception: Ii. conceptual and theoretical foundations.
Psychological Bulletin , 138(6):1218–1252, 2012.[6] Svetlana V. Shinkareva, Vicente L. Malave, Robert A. Mason, Tom M. Mitchell, and Marcel Adam Just. Commonalityof neural representations of words and pictures.
Neuroimage , 54(3):2418–2425, FEB 1 2011.[7] T Mansour.
Combinatorics of Set Partitions, Discrete Mathematics and its Applications . CRC Press, Boca Raton, FL.,2012.[8] R. Quian Quiroga, G. Kreiman, C. Koch, and I. Fried. Sparse but not grandmother-cell coding in the medial temporallobe.
Trends in Cognitive Sciences , 12(3):87 – 91, 2008.[9] D. J. Graham and D. J. Field. Natural images: coding efficiency. In Larry R. Squire, editor,
Encyclopedia of Neuro-science , pages 19–27. Academic Press, Oxford, 2009.[10] Giulio Tononi. Consciousness as Integrated Information: a Provisional Manifesto.
Biological Bulletin , 215(3):216–242,DEC 2008.[11] Masafumi Oizumi, Larissa Albantakis, and Giulio Tononi. From the phenomenology to the mechanisms of conscious-ness: Integrated information theory 3.0.
PLoS Comput Biol , 10(5):e1003588, 05 2014.[12] David Balduzzi and Giulio Tononi. Qualia: The geometry of integrated information.
PLoS Comput Biol , 5(8):e1000462,08 2009.[13] David Balduzzi and Giulio Tononi. Integrated information in discrete dynamical systems: Motivation and theoreticalframework.
PLoS Comput Biol , 4(6):e1000091, 06 2008.[14] Adam B. Barrett and Anil K. Seth. Practical measures of integrated information for time-series data.
PLoS ComputBiol , 7(1):e1001052, 01 2011.[15] G M Edelman and G Tononi.
A Universe of Consciousness: How Matter Becomes Imagination . Basic Books, 2000.[16] G A Ascoli. The complex link between neuroanatomy and consciousness.
Complexity , 6(1):20–26, Sep 2000.[17] O Sporns. Network analysis, complexity, and brain function.
Complexity , 8(1):56–60, Sep 2002.[18] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hiroki C. Tanabe, Norihiro Sadato,and Yukiyasu Kamitani. Visual Image Reconstruction from Human Brain Activity using a Combination of MultiscaleLocal Image Decoders.
Neuron , 60(5):915–929, Dec 11 2008.[19] Natasa Krejic and Natasa Krklec Jerinkic. Stochastic Gradient Methods for Unconstrained Optimization.
PesquisaOperacional , 34:373 – 393, 12 2014. 3320] J Nocedal and S Wright.
Numerical Optimization . Springer Series in Operations Research and Financial Engineering.Springer-Verlag New York, 2nd edition, 2006.[21] Scheinberg K Conn A R and Vicente L N.