CCHAOS GAME REPRESENTATION ∗ EUNICE Y. S. CHAN † AND
ROBERT M. CORLESS ‡ Abstract.
The chaos game representation (CGR) is an interesting method to visualize one-dimensional sequences. In this paper, we show how to construct a chaos game representation. Theapplications mentioned here are biological, in which CGR was able to uncover patterns in DNAor proteins that were previously unknown. We also show how CGR might be introduced in theclassroom, either in a modelling course or in a dynamical systems course. Some sequences thatare tested are taken from the Online Encyclopedia of Integer Sequences, and others are taken fromsequences that arose mainly from a course in experimental mathematics.
Key words.
Chaos game representation, iterated function systems, partial quotients of contin-ued fractions, sequences
AMS subject classifications.
1. Introduction.
Finding hidden patterns in long sequences can be both diffi-cult and valuable. Representing these sequence in a visual way can often help. Theso-called chaos game representation (CGR) of a sequence of integers is a particularlyuseful technique, that visualizes a one-dimensional sequence in a two-dimensionalspace. The CGR is presented as a scatter plot (most frequently square), in whicheach corner represents an element that appears in the sequence. The results of theseCGRs can look very fractal-like, but even so can be visually recognizable and distin-guishable. The distinguishing characteristics can be made quantitative, with a goodnotion of “distance between images.”Many applications, such as analysis of DNA sequences [12, 15, 16, 18] and proteinstructure [4, 10], have shown the usefulness of CGR; we will discuss these applica-tions briefly in section 4. Before that, in section 2, we will look at a random iterationalgorithm for creating fractals. In section 5, we will apply CGR to simple abstractmathematical sequences, including the digits of π and the partial quotients of contin-ued fractions. Additionally, we will look at certain “distances between images” thatare produced. Finally, we will explain how some of the patterns arise, and what thesedepictions can tell us about the sequences.One important pedagogical purpose of this module is to provoke a discussionabout randomness, or what it means to be random. In this paper, we at first use theword “random” very loosely, on purpose.“What is probability? I asked myself this question many years ago,and found that various authors gave different answers. I found thatthere were several main schools of thought with many variations.”— Richard W. Hamming [13]
2. Creating Fractals using Iterated Function Systems (IFS).
The ran-dom iteration algorithm for creating pictures of fractals uses the fixed attracting setof an iterated function system (IFS). Before defining IFS, we will first define somenecessary preliminaries. ∗ Submitted to the editors DATE. † Department of Anesthesia and Perioperative Medicine, MEDICI Centre, Western University,London, Ontario, Canada ([email protected]). ‡ Ontario Research Center for Computer Algebra, School of Mathematical and Statistical Sci-ences, and Rotman Institute of Philosophy, Western University, London, Ontario, Canada ([email protected]). 1 a r X i v : . [ m a t h . HO ] D ec E. Y. S. CHAN, AND R. M. CORLESS
Fig. 1: DNA of human beta globin region on chromosome 11 (HUMHBB)—73308 bps
One can write a two-dimensional affine transformation in the Euclidean plane w : R → R in the form(1) w ( x , x ) = ( ax + bx + e, cx + dx + f )where a , b , c , d , e , and f are real numbers [3]. We will use the following more compactnotation: w ( x ) = w (cid:18) x x (cid:19) = (cid:20) a bc d (cid:21) (cid:20) x x (cid:21) + (cid:20) ef (cid:21) (2) = Ax + t . The matrix A is a 2 × (cid:20) a bc d (cid:21) = (cid:20) r cos θ − r sin θ r sin θ r cos θ (cid:21) , where ( r , θ ) are the polar coordinates of the point ( a, c ) and ( r , ( θ + π / )) are thepolar coordinates of the point ( b, d ). An example of a linear transformation(4) (cid:20) x x (cid:21) → A (cid:20) x x (cid:21) in R is shown in figure 2. We can see from this figure that the original shape remainsas a parallelogram, but now has changed in size and in rotation. This is what wemean by linear transformation. An iterated function system (IFS) is a finiteset of contraction mappings on a metric space. These can be used to create pictures To be more specific, this is a hyperbolic iterated function system. However, in practice, theword “hyperbolic” is sometimes dropped in practice [3, p. 82].HAOS GAME REPRESENTATION (a) Shape before transformation (b) Shape after transformation Fig. 2: Transformation of a unit square using the transformation matrix A (as shownin equation (3)), where r = 2, r = 3, θ = π / , and θ = π / .of fractals. These pictures essentially show what is called the attractor of the IFS.We will be defining what an attractor is in Section 3. In this paper, we will only belooking at the form where each contraction mapping is an affine transformation onthe metric space R .We will illustrate the algorithms for an IFS as an example. Consider the followingmaps: w ( x, y ) = (cid:20) / / (cid:21) (cid:20) xy (cid:21) + (cid:20) (cid:21) ,w ( x, y ) = (cid:20) / / (cid:21) (cid:20) xy (cid:21) + (cid:20) / (cid:21) , (5) w ( x, y ) = (cid:20) / / (cid:21) (cid:20) xy (cid:21) + (cid:20) / / (cid:21) , each with the probability factor of / . These values can also be displayed as a table,where the order of the coefficients a through f corresponds to the order presented inequation (2.1) and p represents the probability factor. The corresponding table forequation (2.2) can be found in table 1.Table 1: Table representing the IFS from equation (2.2) w a b c d e f p / / / / / / / / / / / / To create a fractal picture using IFS, we first choose a starting point ( x , y ).We then randomly choose a map from our IFS and evaluate it at our starting point E. Y. S. CHAN, AND R. M. CORLESS ( x , y ) to get our next point ( x , y ). We do this again many times (determined bythe user) until a pattern (usually fractal) appears. The Matlab code shown in Listing1 computes and plots 50000 points corresponding to the IFS from table 1. To choosewhich map to use for each iteration, we used a uniform random number generator randi (built-in function in
Matlab ) to choose a number between (and including) 0and 2. Each value corresponds to a map in our IFS. The picture produced from Listing1 is shown in Figure 3. It is clear that this figure shows a fractal, which appears to bethe Sierpinski triangle [3, 9], where its three vertices are located at (0 , ,
1) and(1 , Listing 1
Matlab
Code for an IFS generating a Sierpinski triangle n = 50000;pts = zeros(2, n);rseq = randi([0, 2], 1, n-1);for i = 2:nif rseq(i-1) == 0pts(:, i) = [0.5, 0; 0, 0.5]*pts(:, i-1) + [0; 0];elseif rseq(i-1) == 1pts(:, i) = [0.5, 0; 0, 0.5]*pts(:, i-1) + [0; 0.5];else pts(:, i) = [0.5, 0; 0, 0.5]*pts(:, i-1) + [0.5; 0.5];endendplot(pts(1, :), pts(2, :), ’k.’, ’MarkerSize’, 1);axis(’square’);set(gca, ’xtick’, []);set(gca, ’ytick’, []);
Fig. 3: Results from Listing 1. We see visible regularity arising from a program thatuses randomness.Iterated function systems can be quite versatile; a set of several attractors can be
HAOS GAME REPRESENTATION w a b c d e f p .
00 0 .
00 0 .
00 0 .
16 0 .
00 0 .
00 0 .
012 0 .
85 0 . − .
04 0 .
85 0 .
00 1 .
60 0 .
853 0 . − .
26 0 .
23 0 .
22 0 .
00 1 .
60 0 . − .
15 0 .
28 0 .
26 0 .
24 0 .
00 0 .
44 0 . (a) Barnsley’s Fern (n = 50000) (b) Black spleenwort aa Fig. 4: Comparison of Barnsley’s Fern to the black spleenwort, which is what the IFSwas based on.
3. Chaos game.
CGR is based on a technique from chaotic dynamics called the“chaos game,” popularized by Barnsley [3] in 1988. The chaos game is an algorithmwhich allows one to produce pictures of fractal structures [15]. In its simplest form,this can be demonstrated with a piece of paper and pencil using the following steps(adapted from [15]):1. On a piece of paper, draw three points spaced out across the page; they donot necessarily need to be spaced out evenly to form an equilateral trianglethough they might be.2. Label one of the points with “1, 2,” the other point with “3, 4,” and the lastpoint with “5, 6.”3. Draw a point anywhere on the page: this will be the starting point.4. Roll a six-sided die. Draw an imaginary line from the current dot towards tothe point with the corresponding number, and put a dot half way on the line.This becomes the new “current dot.”5. Repeat step 4 (until you get bored).
E. Y. S. CHAN, AND R. M. CORLESS
Obviously, one would not want to plot a large number of points by hand ; instead,this can be done by computer; the code is shown in Listing 2. For this algorithm, weassigned (0 , ,
1) and (1 ,
1) as the vertices of the triangle; they each are labelledwith the value 0, 1, and 2, respectively. Instead of using a die to decide which vertexto choose, similarly to Listing 1 we use
Matlab ’s built-in function randi , which picksfrom the set { , , } .To calculate the new point (shown in Listing 2), we take the average between theprevious point and the corresponding vertex. For example, if the first random numbergenerated was 2, then we would take the average between our initial point (0, 0) andthe vertex that corresponds to 2, which is in our case is (1 , . , . Listing 2
MATLAB Code for CGR with three vertices using a random numbergenerator n = 50000;pts = zeros(2, n);rseq = randi([0, 2], 1, n-1);for i = 2:nif rseq(i-1) == 0pts(:, i) = 0.5*(pts(:, i-1) + [0; 0]);elseif rseq(i-1) == 1pts(:, i) = 0.5*(pts(:, i-1) + [0; 1]);else pts(:, i) = 0.5*(pts(:, i-1) + [1; 1]);endendplot(pts(1, :), pts(2, :), ’k.’, ’MarkerSize’, 1);axis(’square’);set(gca, ’xtick’, []);set(gca, ’ytick’, []);
One would expect that the outcome of Listing 2 would show that the dots wouldappear randomly; however, this is not the case: Figure 5a shows the result of thechaos game if the game was played 50000 times. We can see that this figure showsa fractal-like pattern (such as one of a Sierpinski triangle), and looks almost (if not)identical to the figure which resulted from the IFS.It is quite interesting that a sequence of random numbers could produce such adistinct pattern. Indeed, some bad pseudorandom number generators were identifiedas being bad precisely because they failed a similar-in-spirit visualization test [22].We will discuss this issue further in section 6. Figure 5b shows an example of thefirst five steps of the chaos game (overlaid on top of a CGR with 10000 points),which corresponds to the following values that were randomly generated: 2, 1, 0,0, 2. Displaying the CGR in this manner can give us some insight as to why thisparticular chaos game with three vertices gives us such a striking fractal: we can seefrom the figure that each of the points from the first five steps is located at a vertex(in this case the right-angled one) of one of the many triangles within the fractal. A We tried this, though not in class. We got up to 80 points on one figure, using the die roller onrandom.org. It’s kind of fun, in a physical “drawing and measuring” way, but it’s hard not to makemistakes. It could make a useful “active learning” exercise and we might try it in class in the future.HAOS GAME REPRESENTATION (a) Sierpinski triangle ( n = 50000) (b) First 5 steps of chaos game with threevertices. In this example, the first 5 randomintegers generated are 2, 1, 0, 0, 2. Fig. 5: CGR with three vertices of randomly generated integers.careful observer would notice that the triangle at which the point is located becomessmaller each iteration. This implies that within the next few iterations, the triangleassociated with the computed point would be too small to be seen in this illustration.[Note: if we zoom into that microscopic level, the dots would look random. Indeed,the random dots are ultimately dense on the Sierpinski triangle.] This sequence ofpoints generated by the chaos game is called the orbit of the seed and it is attracted tothe Sierpinski triangle, sometimes called a strange attractor . For another descriptionof why this chaos game creates the Sierpinski triangle, see [11].
We have already seen that a fractal patterncan appear when we follow the chaos game described above for a polygon with threevertices (triangle) using a random number generator, but we are not limited to onlythis. There are many variations for CGR, though some are more useful than others. Inthis section, we will look into what patterns arises when we perform CGR on differentpolygons, changing the placement of the m -th point relative to the ( m − Figure 6 shows the chaos game representation,where the m -th point is placed halfway between the ( m − m -th base, for various polygons, using 50,000 points. As wecan see from Figure 6a, there is no apparent pattern: the chaos game produced asquare uniformly and randomly filled with points. The other figures from Figure 6appear to exhibit some patterns. These patterns are due to several parts of their at-tractors overlapping, indicated by the the uneven distribution of points (where pointsdo appear); i.e. we can see some darker areas and some lighter areas in the figures.Therefore, for this particular chaos game (in which we say that the dividing rate r is0.5—we will talk about this in section 3.1.2), the square is the best choice to visualize E. Y. S. CHAN, AND R. M. CORLESS (a) Square (b) Pentagon (c) Hexagon(d) Heptagon (e) Octagon (f) Nonagon
Fig. 6: Chaos game of polygon with different number of verticesone-dimensional sequences. We will see in section 3.1.3 that we can use other polygons(we will be referring to them as n -gons) for CGR to uncover patterns in sequencesnot with 4 elements by using the appropriate dividing rate. In the previous subsection, we saw the different patternsappear for different shapes. We saw that for shapes with more than 4 vertices had sev-eral parts of the attractor overlapping. Here we will show that attractor overlappingalso happens when choosing a small dividing rate.The chaos game is not restricted to the “rule” in which the new point has to beplaced halfway between the current point and the corresponding vertex as depicted inthe previous examples. The new point can be placed anywhere within the line segmentcreated by the two points of reference. The placement of the new point affects howthe attractor of the CGR looks. We will see this in the following example. To quantifythe placement of the point, we take the proportion of which the distance between thenew point and current point is from the distance between the current point and thecorresponding vertex. This is called the dividing rate , r . Therefore, when the newpoint is placed halfway between the current point and the vertex, the dividing rateis r = 0 .
5, and if the new point is placed closer to the location of the current point,then the dividing rate r < .
5, whereas if the point is closer to the vertex, then thedividing rate r > . r for the square-shaped CGR of a sequenceof random integers. As we have seen in the previous subsection, the square-shaped HAOS GAME REPRESENTATION (a) r = 0 . r = 0 .
25 (c) r = 0 . r = 0 . r = 0 .
75 (f) r = 0 . Fig. 7: Illustration of different r affecting how a square CGR looks using 50000 points.CGR did not exhibit any fractal patterns when r = 0 .
5, so we wondered if we varied r ,would this change? Figure 7 shows the square CGRs for r = 0 . , . , . , . , . , and 0 .
9. From these figures, we see how r affects the visualization.For r < .
5, we can see that some pattern appear, especially those with a dividingrate close to r = 0 .
5. We can see for r = 0 .
6, the points cover the entire plot area,but they are unevenly-space (in comparison to what we have seen for r = 0 . r decreases,we can see that the CGR becomes more circular with the points spaced more closelytogether. From this, it infers that as r decreases, more of the attractors overlap.For r > .
5, we can see that a different pattern appears (compared to those with adividing ratio r < . r > . r > .
5, we can see that the points are spread apart fromeach other. This can possibly indicate that the attractors do not overlap, as well as donot touch each other. Although this can be used to distinguish any nonrandomness,it would be much more difficult and using the dividing rate r = 0 . E. Y. S. CHAN, AND R. M. CORLESS
CGR.
In [10], Fiser et al. generalized CGR to be ap-plicable for sequences of any number of elements. There are many ways of generalizingCGR, which are highlighted in [2], but the generalization the authors had chosen wasto use an n -sided polygon (which they refer to as an n -gon), where n was the numberof elements in a sequence that should be represented [10]. As we saw in Figure 6, theattractors of the n -gons for when n > n -gons for different values of n : r = (cid:16) (cid:16) πn (cid:17)(cid:17) − . Although the dividing ratio calculated from this formula does prevent attractors fromoverlapping, Almeida and Vinga noticed that the attractors are not optimally packedin [2]. For example, for n = 4, the solution gives a dividing ratio r = 0 . typical r = 0 .
5. In [2], the authors improved upon this formula for the dividingratio: r = 2 cos (cid:0) π (cid:0) − kn (cid:1)(cid:1) − (cid:0) π (cid:0) − n (cid:1)(cid:1) cos (cid:0) (2 k − π n (cid:1) (cid:18) tan ( (2 k − π n ) tan ( π − ( n +2 k − ( π n )) (cid:19) (cid:0) π (cid:0) − kn (cid:1)(cid:1) where k = round (cid:18) n + 14 (cid:19) . Figure 8 shows the generalized CGRs for various n -gons using the division formulafrom [10] (left) and [2] (right).In the next section, we will look at applications of CGR, mainly biological andmathematical. For the following examples, unless otherwise specified, assume that r = 0 .
4. CGR of biological sequences.
The chaos game representation has beenapplied to biological areas. In this section, we will look at how the chaos game has beenapplied the DNA and protein (amino acids) sequencing and visualizations. We alsowill give a brief overview of the
Map of Life , an extension of the visualization of DNAsequences. The Map of Life shows potential in allowing researchers to quantitativelyclassify species that was once unclear in its taxonomy.Life certainly has random elements, but it has deterministic elements as well. Thehope of CGR is to reveal some of the latter amidst the clutter of the former.
H. J. Jeffrey was the first to propose using thechaos game representation as a novel way of visualizing nucleotide sequences. Thisrevealed previously unknown patterns in certain proteins [15, 16]. The genetic se-quence is made up of four bases: adenine (A), guanine (G), cytosine (C), and eitherthymine (T) or uracil (U) for DNA or RNA, respectively. Using a square-shapedCGR, we label the four corners with the name of each base. In this paper, A is inthe bottom-left corner, C in the top-left, G in the top-right, and U/T in the bottomright. Rather than use a random number generator to determine which map to use
HAOS GAME REPRESENTATION r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . r = 0 . Fig. 8: Graphical comparison of proposed CGR generalization using 50000 points.The left uses the dividing ratio formula from [10], and the right uses the dividingratio formula from [2].for each iteration as we did in Section 3, we follow the genetic sequence that we wantto create a CGR for.To demonstrate the chaos game for DNA sequences, let us walk through plottingthe first five bases of the DNA sequence HUMHBB (human beta globin region, chro-mosome 11), “GAATT,” shown in Figure 9. We first mark the center as our initialpoint. The first base in the sequence is “G” so, we plot a point half way between ourinitial point and the “G” corner. The next base in the sequence is “A” so we plot apoint half way between the point that we just plotted and the “A” corner. We leavethe reader to check the placement of the remaining steps.The finished CGR is shown in Figure 1, which certainly has an identifiable fractalpattern. The most prominent pattern is what is referred to as the “double scoop,”which actually appears in almost all vertebrate DNA sequences. This pattern is dueto the fact that there is a comparative sparseness of guanine following cytosine inthe gene sequence since CG dinucleotides are prone to methylation and subsequentlymutation.To fully understand the double scoop pattern, we must understand the biologicalmeaning of the CGR. Each point plotted in the CGR corresponds to a base, anddepending on where it is placed, we can trace back and figure out parts of the sequence2
E. Y. S. CHAN, AND R. M. CORLESS
Fig. 9: First five steps of producing CGR using HUMHBB DNAwe are examining [12]. Figure 10a shows the relationship corresponding area of theCGR and the DNA sequence. In reference to this figure, we can see that for anypoint that corresponds with base G, it will be located in the upper right quadrantof the CGR plot. To see what the previous base is, we can divide the quadrant intosub-quadrants (labeling them in the same order as the quadrants), and dependingon where the point is, we can determine what the previous base of the sequence is.We can repeat this step again and again to find the order in which the bases appearin the sequence. Figure 10b shows a CGR square where all the CG quadrants areunfilled. We can see from this figure that even though it is not a real CGR (sincewe did not use a sequence to produce it), we still get the same double scoop patternfound in Figure 1. By identifying regions of the CGR square in this way, it is possibleto identify features of DNA sequences that correspond to patterns of the CGR.Figure 11 shows other examples of CGRs for DNA sequences. We can see thatFigure 11a (CGR of DNA sequence of human herpesvirus strain) also exhibit thedouble scoop pattern that we have seen for HUMHBB. This agrees with what wasmentioned earlier in that it is common to see the lack of C and G dinucleotidestogether within the human genome. Figure 11b, on the other hand, shows the CGRof the DNA sequence of the chloroplast of quinoa plant, and does not show a doublescoop pattern. We can see from using the CGR of DNA sequences, we are able todistinguish between different species very easily.
Taking DNA sequence visualization one step further, theauthors of [19] proposed a novel combination of methods to create
Molecular DistanceMaps . Molecular distance maps visually illustrate the quantitative relationships andpatterns of proximities among the given genetic sequences and among the species theyrepresent. To compute and visually display relationships between DNA sequences,the three techniques that were used include chaos game representation, structuraldissimilarity index (DSSIM), and multi-dimensional scaling. In the paper [19], thismethod was applied to a variety of cases that have been historically controversial and
HAOS GAME REPRESENTATION (a) Correspondence between DNA se-quences and areas of the CGR of DNA se-quences (b) Explanation of the double scooppattern—plot of CGR square with all CGquadrants unfilled. Fig. 10: CGR quadrants to explain biological phenomenon such as double scooppatternwas there demonstrated to have the potential for taxonomical clarification. In thefollowing, we will discuss the methods used from this paper, more particularly thestructural dissimilarity index and the multi-dimensional scaling as we have alreadylooked into the chaos game representation for nucleotides.To understand the structural dissimilarity index, we will first have to explain whatthe structural similarity index is. The structural similarity (SSIM) index is a methodfor measuring the similarity between two images based on the computation of threeterms, namely luminance distortion, contrast distortion, and linear correlation [31]. Itwas designed to perform similarly to the human visual system, which is highly adaptedto extract structural information. The overall index is a multiplicative combinationof the three terms: SSIM( x, y ) = [ l ( x, y )] α · [ c ( x, y )] β · [ s ( x, y )] γ , where l ( x, y ) = 2 µ x µ y + C µ x + µ y + C (6) c ( x, y ) = 2 σ x σ y + C σ x + σ y + C (7) s ( x, y ) = σ xy + C σ x σ y + C , (8)where µ x , µ y , σ x , σ y and σ xy are the local means, standard deviations, and cross-covariance for images x and y . C , C , and C are the regularization constants forthe luminance, contrast, and structural terms, respectively. If α = β = γ = 1, and4 E. Y. S. CHAN, AND R. M. CORLESS (a) DNA of Human herpesvirus 5 strain TR,complete genome—235681 bp (b) DNA of Chenopodium quinoa cultivarReal Blanca chloroplast, complete sequence,whole genome shotgun sequence–152282 bp
Fig. 11: Examples of CGR of DNA sequences. From the bottom left corner, goingclockwise, the bases are A, C, G, T. C = C / , the index simplifies to:SSIM( x, y ) = (2 µ x µ y + C )(2 σ xy + C )( µ x + µ y + C )( σ x + σ y + C ) . The theoretical range of SSIM( x, y ) ∈ [ − , x . They refer to this as a distance matrix.Both global and local SSIM index can be computed in Matlab by using the function ssim . For our own experiments presented in this paper, we use the global SSIM valuerather than the local.Now that we know how to compute the SSIM index, we are able to compute thestructural dissimilarity (DSSIM) index:DSSIM( x, y ) = 1 − SSIM( x, y ) , whose theoretical range is [0 ,
2] with the distance being 0 between two identical images,and 2 if the two images are negatively correlated. An example of two images that arenegatively correlated would be if one is completely white, while the other is completelyblack. Typically, the range that DSSIM falls with is [0 , . HAOS GAME REPRESENTATION (a) CGR of human (Homo sapiens sapiens)mitochondrial DNA — 16,569 bp (b) CGR of neanderthal (Homo sapiens ne-anderthalensis) — 16,565 bp Fig. 12: CGR of human and neanderthal mitochondrial DNA.(12a) and of the neanderthal (12b). We can see that the two CGRs are quite similar(which is agrees with the fact that the two species are from the same genus) andtherefore, we expect the the DSSIM to be small. Using
Matlab ’s ssim function,the structural similarity index between the two images is 0 . − . . x, y ) = . . . . . . . . . . . . , where the first row/column represents humans, the second row/column represents theneanderthals, the third row/column represents the great spotted kiwi, and lastly, thefourth row/column represents the pearlfish. Obviously, the diagonal of the matrixwould be 0, since the DSSIM of two of the same images would produce a result of0. As we have seen previous, the DSSIM(human , neanderthal) = 0 . E. Y. S. CHAN, AND R. M. CORLESS (a) CGR of the great spotted kiwi (Apteryxhaastii) mitochondrial DNA — 16,980 bp (b) CGR of the pearlfish (Carapus bermu-densis) — 16,613 bp
Fig. 13: CGR of other species’ mitochondrial DNA.as cognitive science, information science, pychometrics, marketing, ecology, social sci-ence and other areas of study [5]. The goal of MDS is to find a spatial configuration ofobjects when all that is known is some measure of their general similarity or dissimilar-ity [32]. In our case, MDS takes in the distance matrix and outputs a two-dimensionalmap, where each item is represented by a point. In [19], the authors used a classicalMDS, which assumes that all the distances (from the distance matrix) are Euclidean.For the algorithm for classical MDS, refer to [32, p. 10].
Matlab has its own built-infunction that does MDS, called cmdscale .Figure 14 shows an example of a molecular distance map of 4844 animals fromvarious classes from the chordate phylum (meaning that all these animals have a ver-tebrate), coloured according to their class. The sequences were taken from NCBIReference Sequence Database (RefSeq). We can see from this figure that this method(Map of Life) does quite a good job in sorting species into different categories. Oneparticular aspect that we want to point out in this figure is where the points repre-senting the lungfish are—they are in the area of where the bony fish, cartilage fish,and amphibians touch. This is truly remarkable as lung fish has qualities of all theseclasses.
CGR has also been applied to visualizing and ana-lyzing both the primary and secondary structures of proteins. The primary structureof a protein is simply an amino acid sequence. To analyze the primary structure ofproteins, the authors from [10] using an 20-sided regular polygon, each representingan amino acid. Table 2 shows all 20 amino acids along with their 3-letter code, aswell as their 1-letter code. To avoid the attractor from overlapping itself, we will usethe dividing rate for a 20-gon shown from [2] in Figure 8, r = 0 . HAOS GAME REPRESENTATION α -helices and the β -sheets. Usingthe chaos game on the secondary structure can indicate any non-randomness of thestructural elements in proteins. One way to achieve this is to divide the 20 aminoacids into 4 groups based on different properties and assign each group to a corner.Some properties that have been considered for CGR include hydrophobicity, molecularweight, isoelectric point (pI), α propensity, and β propensity. For details, see [4, 10].
5. CGR of mathematical sequences.
In the first-year course we gave in2015 [6], we applied the chaos game representation to mathematical sequences found8
E. Y. S. CHAN, AND R. M. CORLESS
Table 2: Table of all 20 amino acidsName 3 letter code 1 letter codealanine ala Aarginine arg Rasparagine asn Naspartic acid asp Dcysteine cys Cglutamine gln Qglutamic acid glu Eglycine gly Ghistidine his Hisoleucine ile Ileucine leu Llysine lys Kmethionine met Mphenylalanine phe Fproline pro Pserine ser Sthreonine thr Ttryptophan trp Wtyrosine tyr Yvaline val Von the Online Encyclopedia of Integer Sequences (OEIS) [14]. In the following, weshow some CGR experiments that came from the first-year course. Some sequencesthat were used from the Online Encyclopedia of Integer Sequences (OEIS) includeddigits of π (A000796), Fibonacci Numbers (A000045), prime numbers (A000040), andsome from the continued fractions section (which were introduced at the beginningof the course). We will not only look at the visualizations created in the first-yearclass, but we will also determine whether these experiments are considered to be goodvisual representations. Note that for the following 4-vertex examples, the verticesare labeled from “0” to “3” starting from the bottom left corner of the plot goingclockwise (i.e. “0” corresponds to the coordinate point ( − , − − , We first applied CGR to the digits of π (A000796 from theOEIS ), since we believe that the sequence of the digits are random. Since we focusedon the 4-vertex CGR in the course, we na¨ıvely took the digits of π modulo 4 andapplied it to the 4-vertex CGR. The result of this is shown in Figure 16a. Since weassumed that the digits of π are random, this means that the visualization should besimilar to the square-shaped CGR of random integer sequences (Figure 6a). However,we can see from Figure 16a that this is not the case; instead of a uniform covering ofthe space, we can see a pattern of vertical lines occurring. Quantitatively, the DSSIMindex between Figures 16a and 6a is 0 . https://oeis.org/A000796HAOS GAME REPRESENTATION (a) cytochrome b (mito-chondrion) [Homo sapiens](GenBank: ASY00349.1) (b) cytochrome b (mito-chondrion) [Mus musculus](GenBank: NP 904340.1) (c) ribonuclease inhibitor[Homo sapien] (GenBank:NP 976317.1)(d) major vault protein[Rattus norvegicus] (Gen-Bank: NP 073206.2) (e) pertussis toxin sub-unit 1 [Bordetella pertus-sis Tohama I] (GenBank:NP 882282.1) (f) NADH dehydrogenasesubunit 5 (mitochondrion[Mus musculus] (GenBank:NP 904338.1) Fig. 15: Illustration of the chaos game representation of proteins (specifically usingtheir amino acids).why our CGR of the digits of π does not match the CGR of random integers eventhough we know for a fact that the digits of π are indeed random. Due to this reason,Figure 16a is not a good visual representation of the digits of π .Another CGR to visualize the digits of π is shown in Figure 16b. We use a 10-vertex CGR ( r = 0 . We also created a visualization of the FibonacciSequence (A000045 from the OEIS). One example of a visualization of this sequenceis shown in Figure 17. It is a 10-vertex CGR of the Fibonacci sequence modulo 10.We can see from the figure that most of the numbers of the Fibonacci sequence areeven. Other visualizations for the Fibonacci sequence can be done, such as takingthe modulo of the numbers using a different divisor and using a CGR with differentnumber of vertices.However, since the Fibonacci sequence grows very quickly, the students learnedabout the limitations of floating point since they were unable to take more than the0
E. Y. S. CHAN, AND R. M. CORLESS (a) 4 vertex-CGR of 50000 digits of π (b) 10 vertex-CGR of 100000 digits of π Fig. 16: CGR of the digits of π Fig. 17: 10 vertex-CGR of the first 3000 Fibonacci sequence modulus 10first 3000 Fibonacci numbers before the computer registering the number as infinity.
Following the visualization of the digits of π and theFibonacci sequence using the chaos game, we looked at another popular sequence: thesequence of prime numbers (A000040 from the OEIS ). We looked at four differentways of creating the CGR for prime numbers.We first visualized the prime numbers in a similar way as Figures 16b and 17 bytaking the sequence of prime numbers modulo 10. This is shown in Figure 18a. We https://oeis.org/A000040HAOS GAME REPRESENTATION k thprime numbers mod 4, a diagonal line which spans from the “1” corner to the “3”corner. As we all know, prime numbers larger than 2 will never be even, so the cornerpoints “0” and “2” which are representative “even” values would never occur, thuscreating a straight diagonal line. If we had reassigned the values the coordinate pointsrepresent, the plot would have turned out differently; instead of a diagonal line, itcould possibly be a horizontal (spanning from coordinate points ( − , −
1) to (1 , − − ,
1) to (1 , − ,
1) to ( − , −
1) or (1 , , − − , −
1) islabelled “1”, ( − ,
1) is labelled “3”, etc.), and plotted the points according to thechaos game ( r = 0 . We then explored visualizingthe sequences taken from the partial quotients of continued fractions, which weretaught at the beginning of the course. Continued fractions are fractions written inthe following general form [26, 7]:(9) a + 1 a + 1 a + 1 a + 1 a + 1. . . , which produces a sequence of numbers, a + [ a , a , a , a , · · · ], called the partialquotients of the continued fraction. This can be demonstrated with an example: wecan rewrite ⁄ in the form(10) 97 = 1 + 27 = 1 + 1 / = 1 + 13 + / = 1 + 13 + 11 + / . Here, the partial quotients of / are the elements of 1 + [3 , , √ e and π , where thesequences of their partial quotients were used in chaos game representations.2 E. Y. S. CHAN, AND R. M. CORLESS (a) All prime numbers less than 1000000taken mod 10 (b) All prime numbers between 7 and1000000 taken mod 10 where the verticesare 1, 3, 7, 9 clockwise from the bottom left(c) 103 + k th prime mod 4 (d) All prime numbers between 7 and1000000 taken mod 8 where the vertices are1, 3, 5, 7 clockwise from the bottom left Fig. 18: Examples of CGRs of prime numbersOne recurring theme of the experimental mathematics course was √
2, so withthis in mind, some students jumped on the opportunity to plot the chaos game rep-resentation of the quotients of the continued fraction of √
2. As we had seen earlierin the course, the sequence goes like1 + [2 , , , , . . . , . Because all elements of the sequence (apart from the first one) are 2’s, it is not
HAOS GAME REPRESENTATION e , whichis equal to(11) 2 + [1 , , , , , , , , · · · ][27], shown in figure 19b. What is seen here makes sense as the sequence mostlycontains 1’s and these alternates with even values, so in this case, either 0 or 2.Therefore, it is clear that there are no points in the lower right corner since thatrepresents the value 3. Unfortunately, these two plots do not look all that impressive;in fact, it is pretty underwhelming. Disappointed with this result, the students decidedto experiment with other sequences in which all four coordinates occur.The students then thought, “Why not take the partial quotients of the continuedfraction of π ? The results must be random.” As shown in figure 19c, unexpectedly,there is indeed a pattern, which shows that the sequence of partial quotients of thecontinued fraction of π is not as random as we thought at the beginning of the course.This showed the students that the distribution is not uniform. (a) Partial quotients of con-tinued fraction of √ e mod 4 (c) Partial quotients of con-tinued fraction of π mod 4 Fig. 19: Examples of CGR for partial quotients of continued fractions of √ e (19b), and π (19c).In fact, the distribution of partial quotients is very well known, owing to theresults of Khinchin [21]. As a very startling aside, recently, Bill Gosper has remarkedon an amazing identity(12) (cid:89) k ! ∆ ln k = K ln 2 ., where K is Khinchin’s constant. We did not inform our students of Khinchin’s re-markable results because CGR was only done at the end of the course. Next time,perhaps!
6. Random vs Pseudorandom.
The theory of probability arose much laterin the development of science than the theory of dynamical systems and of exacttrajectories, and therefore it may be supposed to be more difficult to grasp. There aremany different works on the fundamentals of probability which discuss this difficulty;see for instance [13], but also [17]. We believe that combining deterministic dynamicswith randomness is even more difficult; and that this, in fact, is the main business4
E. Y. S. CHAN, AND R. M. CORLESS nowadays of the applied mathematician. There have been, of course, millions of wordswritten on the topic. We believe that it is crucial that entering students see some ofthe discussions of the fundamentals; they need to have a chance to grasp the deeper,most practical aspects of the theory. We believe that this module offers the instructora chance to begin those discussions.One important early part of the discussion is whether or not what the computerproduces is “really” random. We alluded earlier to the fact that the poor qualityof some early bad random number generators was detected by patterns arising intwo-dimensional pictures. How are the patterns arising in the pictures in this paperdifferent? Would they arise if “really” random numbers were used instead of thepseudorandom numbers generated by computer? [The answer is yes.] And what isthe difference, anyway? There are knots here that the students (and professors) cantie themselves in: once a sequence of numbers is written down, however randomlyit was generated, how can it be random, any more? It’s now perfectly predictable!The most comforting words that we know about this come from Kolmogorov himself(quoted in full at the end of this paper) which we paraphrase as “it’s only a model.”Indeed see [23] for a brief discussion of a mathematical foundation for probability.This applies whether your sequence is “really” random, or only “pseudo” random.These discussions are important, in a situation such as that of this paper, wherewe are trying to tease out deterministic aspects of apparently random sequences (orof sequences such as that of DNA which is surely influenced both by random events(mutation, horizontal gene transfer) and very non-random events (selection). We haveused some “clearly” non-random (in some sense) sequences such as the digits of π toshow that we can detect a signature of randomness there; we have used the sametechniques to detect a signature of regularity. To be convinced, the students must beallowed to discuss these issues at some length. In particular, this may be the firsttime the students have encountered pseudorandom numbers.There are several high-quality methods for uniform pseudorandom number gener-ation: the most popular includes the multiple-recursive method and families formedby shift-register methods [8]. The multiple-recursive method is an extension of the lin-ear congruential method by replacing the first-order linear recursion by one of higherorder. One family in the shift-register method generates uniform pseudorandom num-bers by means of linear recurring sequences modulo 2. Another uses vector recursionsmodulo 2 of higher order; this family includes the very popular Mersenne twisterMT19937 [24], which is the default for MATLAB’s pseudorandom number generat-ing functions, which includes rand (uniformly distributed random number), randn (normally distributed random numbers), and randi (uniformly distributed pseudo-random integers) [25]. Other pseudorandom number generators in MATLAB includeSIMD-oriented Fast Mersenne Twister, Combined Multiple Recursive, MultiplicativeLagged Fibonacci and Legacy MATLAB generators. Users can use MATLAB’s rng function to control which generator to use, along with the seed for the pseudorandomnumber generator to produce a predictable sequence of numbers.As mentioned above, one of the pseudorandom number generators MATLAB’s rand function uses is the Mersenne twister MT19937 to generate uniform pseudoran-dom numbers. The MT19937 produces sequences of uniform pseudorandom numberswith period length 2 − HAOS GAME REPRESENTATION
7. Concluding Remarks.
This module can be used to teach students about auseful visualization of sequences, in which case the emphasis from the instructor couldbe on interpreting the results, perhaps by using structural similarity or distance maps.The module can also be used to teach elementary programming techniques, in whichcase the instructor can emphasize programming tools, correctness, efficiency, style, oranalysis.Using elementary sequences of integers makes the module accessible even to first-year students, avoiding difficult biological details: but even so, students are exposedto the deep concepts of pattern and randomness more or less straight away. Wehave only begun a discussion of what it means to be random, and what it means tohave a pattern. Randomness in a model can be a way to hide our ignorance, or itcan be a profound statement of our understanding. Modelling viral evolution usingrandomness [30], where mutations occur in large populations, is clearly warranted;similarly for the vast lengths of DNA sequences. For prime numbers, in one senseclearly not: primes cannot be random, even though they behave in some ways as ifthey are.“Such considerations may be repeated as often as we like, but it isclear that this procedure will never allow us to be free of the neces-sity, at the last stage, of referring to probabilities in the primitiveimprecise sense of this term.It would be quite wrong to think that difficulties of this kind arepeculiar in some way to the theory of probability. In the mathe-matical investigation of actual events, we always make a model ofthem. The discrepancies between the actual course of events and thetheoretical model can, in its turn, be made the subject of mathemat-ical investigation. But for these discrepancies we must construct amodel that we will use without formal mathematical analysis of thediscrepancies which again would arise in it in actual experiment.”— Andrey Kolmogorov [1]
Acknowledgments.
We would like to thank Lila Kari for introducing us to thechaos game representation. We also like to thank the students of the second run of thesenior experimental mathematics course for their input and ideas on this topic. Wewould like to acknowledge financial support from The University of Western Ontario(aka Western University), NSERC, and OGS.6
E. Y. S. CHAN, AND R. M. CORLESSREFERENCES[1]
A. D. Aleksandrov, M. A. Lavrent’ev, et al. , Mathematics: Its content, methods andmeaning , Courier Corporation, 1999.[2]
J. S. Almeida and S. Vinga , Biological sequences as pictures–a generic two dimensionalsolution for iterated maps , BMC bioinformatics, 10 (2009), p. 100.[3]
M. F. Barnsley , Fractals everywhere , Academic press, 2014.[4]
S. Basu, A. Pan, C. Dutta, and J. Das , Chaos game representation of proteins , Journal ofMolecular Graphics and Modelling, 15 (1997), pp. 279–289.[5]
I. Borg and P. J. F. Groenen , Modern multidimensional scaling: Theory and applications ,Springer Science & Business Media, 2005.[6]
E. Y. S. Chan and R. M. Corless , A random walk through experimental mathematics ,in Springer Proceedings in Mathematics & Statistics, Springer International Publish-ing, 2020, pp. 203–226, https://doi.org/10.1007/978-3-030-36568-4 14, https://doi.org/10.1007/978-3-030-36568-4 14.[7]
R. M. Corless , Continued fractions and chaos , The American mathematical monthly, 99(1992), pp. 203–215.[8]
M. R. Dennis, P. Glendinning, P. A. Martin, F. Santosa, and J. Tanner , The Princetoncompanion to applied mathematics , Princeton University Press, 2015.[9]
D. P. Feldman , Chaos and fractals: an elementary introduction , Oxford University Press,2012.[10]
A. Fiser, G. E. Tusnady, and I. Simon , Chaos game representation of protein structures ,Journal of molecular graphics, 12 (1994), pp. 302–304.[11]
Y. Fisher, M. McGuire, R. F. Voss, M. F. Barnsley, R. L. Devaney, and B. B. Mandel-brot , The science of fractal images , Springer Science & Business Media, 2012.[12]
N. Goldman , Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observedin chaos game representations of DNA sequences , Nucleic Acids Research, 21 (1993),pp. 2487–2491.[13]
R. W. Hamming , Art of Probability , Addison Wesley Publishing Company, 1991.[14]
O. F. Inc. , The on-line encyclopedia of integer sequences . https://oeis.org/, 2019.[15]
H. J. Jeffrey , Chaos game representation of gene structure , Nucleic Acids Research, 18 (1990),pp. 2163–2170.[16]
H. J. Jeffrey , Chaos game visualization of sequences , Computers & Graphics, 16 (1992),pp. 25–33.[17]
M. Kac , Probability and related topics in physical sciences , vol. 1, AMS, 1959.[18]
R. Karamichalis, L. Kari, S. Konstantinidis, and S. Kopecki , An investigation intointer-and intragenomic variations of graphic genomic signatures , BMC bioinformatics,16 (2015), p. 246.[19]
L. Kari, K. A. Hill, A. S. Sayem, N. Bryans, K. Davis, and N. S. Dattani , Map oflife: Measuring and visualizing species’ relatedness with” molecular distance maps” , arXivpreprint arXiv:1307.3755, (2013).[20]
L. Kari, K. A. Hill, A. S. Sayem, R. Karamichalis, N. Bryans, K. Davis, and N. S.Dattani , Mapping the space of genomic signatures , PloS one, 10 (2015), p. e0119815.[21]
A. Khinchin and T. Teichmann , Continued fractions , Physics Today, 17 (1964), p. 70.[22]
D. E. Knuth , Seminumerical Algorithms , vol. 2 of The Art of Computer Programming, Addi-son Wesley, Reading, MA, 3 ed., 1997.[23]
A. N. Kolmogorov , On logical foundations of probability theory , in Probability theory andmathematical statistics, Springer, 1983, pp. 1–5.[24]
M. Matsumoto and T. Nishimura , Mersenne twister: a 623-dimensionally equidistributeduniform pseudo-random number generator , ACM Transactions on Modeling and ComputerSimulation (TOMACS), 8 (1998), pp. 3–30.[25]
C. B. Moler , Random number generators, Mersenne Twiser . In
Cleve’s Corner: Cleve Moleron Mathematics and Computing . Available at https://blogs.mathworks.com/cleve/2015/04/17/random-number-generator-mersenne-twister/, 2015.[26]
C. D. Olds , Continued fractions , vol. 18, Random House New York, 1963.[27]
C. D. Olds , The simple continued fraction expansion of e , American Mathematical Monthly,(1970), pp. 968–974.[28]
I. Simon, L. Glasser, and H. A. Scheraga , Calculation of protein conformation as an as-sembly of stable overlapping segments: application to bovine pancreatic trypsin inhibitor. ,Proceedings of the National Academy of Sciences, 88 (1991), pp. 3661–3665.[29]
B. P. Stein , Dice rolls are not completely random [30] L. M. Wahl and T. Pattenden , Prophage provide a safe haven for adaptive exploration intemperate viruses , Genetics, 206 (2017), pp. 407–416.[31]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli , Image quality assessment: fromerror visibility to structural similarity , IEEE trans. image process., 13 (2004), pp. 600–612.[32]