Grammatical Evolution with Restarts for Fast Fractal Generation
AAcceleration of a procedure to generate fractalcurves of a given dimension through theprobabilistic analysis of execution time
Manuel Cebri´an, Manuel Alfonseca and Alfonso Ortega ∗ Abstract
In a previous work, the authors proposed a Grammatical Evolutionalgorithm to automatically generate Lindenmayer Systems which repre-sent fractal curves with a pre-determined fractal dimension. This papergives strong statistical evidence that the probability distributions of theexecution time of that algorithm exhibits a heavy tail with an hyperbolicprobability decay for long executions, which explains the erratic perfor-mance of different executions of the algorithm. Three different restartstrategies have been incorporated in the algorithm to mitigate the prob-lems associated to heavy tail distributions: the first assumes full knowl-edge of the execution time probability distribution, the second and thirdassume no knowledge. These strategies exploit the fact that the proba-bility of finding a solution in short executions is non-negligible and yielda severe reduction, both in the expected execution time (up to one orderof magnitude) and in its variance, which is reduced from an infinite to afinite value.
Keywords:
Fractal Generation, Grammatical Evolution, Randomized Algo-rithm, Heavy Tail Distribution, Restart Strategy.
In the last decades, genetic algorithms, which emulate biological evolution incomputer software, have been applied to ever wider fields of research and de-velopment and have given rise to a few astounding successes, together with acertain mount of disappointment, frequently related to the apparently inherentslowness of the procedure. This is not a surprise, as biological evolution, whichserves as the source for most of the ideas used by the research in genetic algo-rithms, makes a extremely slow and difficult to experiment field, where actualprocesses require millions of years in many cases. This slowness is in part aconsequence of the fact that randomness is a basic underlying of the searchperformed by genetic algorithms. For this reason, the discovery and proposalof procedures to accelerate their execution time is one of the most interestingopen questions in this field. ∗ The authors are with the Departamento de Ingenier´ıa Inform´atica, Escuela Polit´ecnicaSuperior, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain, fax number: (+34) 914972 235, e-mail: { manuel.cebrian, manuel.alfonseca, alfonso.ortega } @uam.es. a r X i v : . [ c s . N E ] O c t he procedure we propose in this paper has made it possible to increaseby an order of magnitude the performance of at least one application of ge-netic algorithms: the use of grammatical evolution to generate fractal curves ofa given dimension. It is probable that the application of the same proceduremay be useful to accelerate many other applications of similar techniques, al-though there are cases where it cannot provide any improvement. The paperoffers ways to predict the situations where this procedure may be useful, andrecognize those where it will not provide any improvement, by analyzing thestatistical distributions of the execution time of the algorithms. In fact, thefamily of heavy-tail distributions embodies those applications where the bestimprovement can be attained by the application of re-start techniques, whileanother family (leptokurtic distributions) also offer a significant acceleration.The remainder of this introduction contains a simple introduction to thethree main fields affecting the experiment we have used as the template for theexperimentation of the acceleration techniques: the family of genetic algorithmswe are testing (grammatical evolution); fractal curves and their dimension; andL systems, which provide an easy way to represent the former and making theircomputation straightforward.Section 2 summarizes an algorithm we have developed and described in aprevious publication, which makes it possible to compute the dimension of afractal curve from its equivalent L system. Section 3 describes the concrete casewe have used as the benchmark for our acceleration techniques: a genetic algo-rithm which generates a fractal curve with a given dimension. This algorithmhas also been previously published in the scientific literature.Section 4 describes the families of heavy tail and leptokurtic distributions,where the acceleration techniques proposed in this paper are most useful. Sec-tion 5 proves that the experiment described in section 3 gives rise to executiontime distributions belonging to those families. Section 6 describes the restartstrategy whose use significantly accelerates the execution time of our algorithmand all others with a distribution in the same families. Finally, section 7 offersthe conclusions of the paper and proposes several lines of future work. Evolutionary Automatic Programming (EAP) refers to those systems that useevolutionary computation to automatically generate computer programs. EAPtechniques can be classified according to the way the programs are represented:tree-based systems, which work with the derivation trees of the programs,or string-based systems, which represent them as strings of symbols. Thebest known tree-based system is genetic programming (GP), proposed by Koza(1992), which automatically generates LISP programs to solve given tasks.Tree-based systems do not make an explicit distinction between genotypeand phenotype. String-based systems may do it. Grammatical evolution (GE,O’Neill & Ryan 2003) is the latest, most promising string-based approach. GEis an EAP algorithm based on strings, independent of the language used. Geno-types are represented by strings of integers (each of which is called a codon )and the context-free grammar of the target programming language is used todeterministically map each genotype into a syntactically correct phenotype (aprogram). In this way, GE avoids one of the main difficulties in EAP, as theresults of applying genetic operators to the individuals in a population are guar-2nteed to be syntactically correct. The following scheme shows the way in whichGE combines traditional genetic algorithms with genotype-to-phenotype map-ping.1. A random initial population of genotypes is generated.2. Each member of the population is translated into its corresponding phe-notype.3. The genotype population is sorted by their fitness (computed from thephenotypes).4. If the best individual is a solution, the process ends.5. The next generation is created: the mating-pool is chosen with a fitness-proportional parent selection strategy; the genetically modified offspringis generated, and the worst individuals in the population are replaced bythem.6. Go to step 2.Figure 1: Graphical scheme of a GE processThis procedure is similar in many respects to biological evolution. There arethree different levels. Figure 1 shows a graphical scheme of the process in theparticular case studied in this paper: the automatic generation of fractal curveswith a given dimension. • The genotype (nucleic acids), is represented in GE by vectors of integers.3
The intermediate level (proteins), is represented in GE by words in agiven alphabet, which in our case describe an L system (see below). Thetranslation from the genotype to the intermediate level is performed bymeans of a fixed grammar (the equivalent of the fixed genetic code). • The phenotypic (organisms), in our case represented by the fractal curvesobtained from the intermediate-level words by means of a graphical inter-pretation.
The concept of dimension is very old and seems easy and evident: sometimes itcan be clearly and elegantly defined as the number of directions in which move-ment is allowed: with this interpretation, dimensions are consecutive integers:0 (a point), 1 (a line), 2 (a surface), 3 (a volume), with no doubtful cases. Thisis called a topological dimension . However, as Mandelbrot & Wheeler (1983)describe in his seminal article, some doubtful cases exist: depending on the sizeof the observer, a ball of thread can be considered as a point (dimension 0,for a large observer), a sphere (dimension 3, for an observer comparable to theball), a twisted line (dimension 1, for a smaller observer), a twisted cylinder(dimension 2, for an even smaller observer), and so forth.There is a class of apparently one-dimensional curves for which the con-cept of dimension is tricky: in 1890, Giuseppe Peano defined a curve whichgoes through every point in a square, and therefore can be considered as two-dimensional. In 1904, Helge von Koch devised another, whose shape remindsa snowflake and whose longitude is infinite, although the surface it encloses islimited. Von Koch’s snowflake does not fill a surface, therefore its dimensionshould be greater than 1 but less than 2. In 1919, Hausdorff proposed a newdefinition of dimension, applicable to such doubtful cases: curves such as thosejust described may have a fractional dimension, between 1 and 2. Peano’s curvehas a Hausdorff dimension of 2; Von Koch’s snowflake has a Hausdorff dimensionof 1.2618595071429... Other alternative definitions of dimension were proposedduring the twentieth century, such as the Hausdorff-Besicovitch dimension, theMinkowsky dimension, or the boxcounting dimension (see Falconer 1990, Yam-aguti, et al. 1997). They differ only in details and are known as fractal dimen-sions .The name fractal was introduced in 1975 by Mandelbrot and applies to ob-jects with some special properties, such as a fractal dimension different fromtheir integer topological dimension, self-similarity (containing copies of them-selves), and/or non-differentiability at every point.Fractal curves have been generated or represented by different means, suchas fractional Brownian movements, recursive mathematical families of equations(such as those that generate the Mandelbrot set), and recursive transformations(generators) applied to an initial shape (the initiator). They have found ap-plications in antenna design, the generation of natural-looking landscapes forartistic purposes, and many other fields. The generation of fractals with a givendimension can be useful for some of these applications.This paper discusses only the initiator-generator family of fractals.4 .3 L systems
L systems, devised by Lindenmayer (1968), also called parallel-derivation gram-mars, differ from Chomsky grammars because derivation is not performed se-quentially (a single rule is applied at every step) but in parallel (every symbolis replaced by a string at every step). L systems are appropriate to repre-sent fractal curves obtained by means of recursive transformations (Culik II &Dube 1993). The initiator maps to the axiom of the L system; the generator be-comes the set of production rules; recursive applications of the generator to theinitiator correspond to subsequent derivations of the axiom. The fractal curveis the limit of the word derived from the axiom when the number of derivationstends to infinity.Something else is needed: a graphic interpretation which makes it possibleto convert the words generated by the L system into visible graphic objects.Two different families of graphic interpretations of L systems have been used:turtle graphics and vector graphics. In a previous paper, we have proved anequivalence theorem between two families of L systems, one associated witha turtle graphics interpretation, the other with vector graphics (Alfonseca &Ortega 1997). Our theorem makes it possible to focus only on turtle graphicswithout a significant loss of generality.The turtle graphics interpretation was first proposed by Papert (1980) asthe trail left by an invisible turtle , whose state at every instant is defined by itsposition and the direction in which it is looking. The state of the turtle changesas it moves a step forward or as it rotates by a given angle in the same position.Turtle graphics interpretations may exhibit different levels of complexity. Weuse here the following version: • The angle step of the turtle is α = (2 kπ/n ), where k and n are two integers. • The alphabet of the L system is expressed as the union of the four dis-joint subsets: N (non-graphic symbols), D (visible graphic symbols, whichmove the turtle one step forward, in the direction of its current angle, leav-ing a visible trail), M (invisible graphic symbols, which move the turtleone step forward, in the direction of its current angle, leaving no visibletrail) and extra symbols such as { + , −} , which increase/decrease the tur-tle angle by α , or a parenthesis pair, which are used in conjunction with astack to add branches to the images. These symbols usually are associatedwith L system trivial rules such as + ::= +. In the following, the trivialrules will be omitted but assumed to be present.A string is said to be angle-invariant with a turtle graphics interpretationif the directions of the turtle at the beginning and the end of the string are thesame. In this paper we only consider angle-invariant D0L systems (where D0Ldescribes a deterministic context-free L system), i.e. the set of D0L systemssuch that the right-hand side of all of their rules is an angle-invariant string.Summarizing: a fractal curve can be represented by means of two compo-nents: an L system and a turtle graphics interpretation, with a given angle step.The length of the moving step (the scale) is reduced at every derivation in theappropriate way, so that the curve always occupies the same space.5 An algorithm to determine the dimension ofa fractal curve from its equivalent L system
Several classic techniques make it possible to estimate the dimension of a fractalcurve. All attempt to measure the ratio between how much the curve grows inlength, and how much it advances. The ruler dimension estimation computesthe dimension of a fractal curve as a function of two measurements taken while walking the curve in a number of discrete steps. The first measurement is the pitch length ( p l ), the length of the step used, which is constant during the walk.The second is the number of steps needed to reach the end of the walk by walkingaround the fractal curve, N ( p l ). The fractal dimension, D p l , is defined as D p l = lim p l → + − log N ( p l )log p l (1)In a previous work (Alfonseca & Ortega 2001) we presented an algorithmthat reaches the same result by computing directly from the L system thatrepresents the fractal curve, without performing any graphical representation.The L system is assumed to be an angle-invariant D0L system with a single drawsymbol. The production set consists of a single rule, apart from trivial rules forsymbols +, − , (, and ). Informally, the algorithm takes advantage of the factthat the right side of the only applicable rule provides a symbolic descriptionof the fractal generator, which can be completely described by a single string.The algorithm computes two numbers: the length N of the visible walk followedby the fractal generator (equal in principle to the number of draw symbols inthe generator string), and the distance d in a straight line from the start to theendpoint of the walk, measured in turtle step units (this number can also bededuced from the string). The fractal dimension is: D = log N log D (2)The scale is reduced at every derivation in such a way that the distancebetween the origin and the end of the graphical representation of the strings isalways the same. For instance, the D0L scheme associated with the rule F ::= F + F − − F + F with axiom F − − F − − F and a turtle graphic interpretation, where F isa visible graphic symbol and the step angle is 60, represents the fractal whosefifth derivation appears in figure 2.The string F + F − − F + F describes the fractal generator. The numberof steps along the walk ( N ) is the number of draw symbols in the string, 4 inthis case. The distance d between the extreme points of the generator, com-putable from the string by applying the turtle interpretation, is 3. Therefore,the dimension is D = log 4log 3 = 1 . . . . (3)This is the same dimension obtained with other methods, as specified in(Mandelbrot & Wheeler 1983, p. 42).This algorithm can be easily extended to fractals whose L systems containmore than one draw symbol and more than one rule, if all the rules preserve6igure 2: Von Koch snowflake curve.the ratio between N and d in the previous expression. Most of the initiator-iterator fractals found in the literature can be represented by angle-invariantD0L systems whose draw symbols-contribute to the dimension in this way. Thealgorithm was also refined to successfully include fractal curves which overlap,either in the generator itself, or after subsequent derivations. In those cases, thedefinition of the fractal dimension is replaced by D = lim log N log d (4)where the limit is taken when the number of derivations goes to infinity. Ouralgorithm computes this case by computing the dimension of a certain numberof derivations until the quotient converges. Designing fractal curves with a given dimension is relatively easy for certainvalues of the desired dimension (for instance, 1.261858... or log 4 / log 3), butvery difficult for others (the reader can try to hand design a fractal curve witha dimension of 1.255). To do it, one has to find two integer numbers, a and b ,such that 1.255 = log a/ log b . Then one has to design a geometrical iteratorsuch that it would take a steps to advance a distance equal to b .This problem can be solved automatically by means of grammatical evolu-tion. Our genetic algorithm acts on genotypes consisting of vectors of integersand makes use of a fixed grammar to translate the genotypes into an interme-diate level, which can be interpreted as a single rule for an L system which,together with a turtle graphic interpretation, generates the final phenotype: afractal curve with a dimension as approximate as desired to the desired value.The algorithm can be described as follows:7. Generate a random population of 64 vectors of eight integers in the [0 , , F, + , − as indicated below.3. Using the algorithm described in section 3, compute the dimension of thefractal curve represented by the D0L system which uses the precedingword as a generator.4. Compute the fitness of every genotype as (target − dimension) − .5. Order the 64 genotypes from higher to lower fitness.6. If the highest-fitness genotype has a fitness higher than the target fitness,stop and return its phenotype.7. From the ordered list of 64 genotypes created in step 5, remove the 16genotypes with least fitness (leaving 48) and take the 16 genotypes withmost fitness. Pair these 16 genotypes randomly to make eight pairs. Eachpair generates another pair, a copy of their parents, modified accordingto four genetic operations (see below). The new 16 genotypes are addedto the remaining population of 48 to make again 64, and their fitness iscomputed as in steps 2 to 4.8. Go to step 5.The algorithm has three input parameters: the target dimension (used instep 4), the target minimum fitness (used in step 6) and the angle step for theturtle graphics interpretation (used in step 3).In step 2, the following grammar is used to translate the genotype of oneindividual into its equivalent intermediate form (the generator for an L systemrepresenting a fractal curve):0 : F ::= F F ::= F F F ::= F +3 : F ::= F − F ::= + F F ::= − F F ::= F + F F ::= F − F F ::= +9 : F ::= −
10 : F ::= λ The translation is performed according to the following developmental algo-rithm:1. The axiom (the start word) of the grammar is assumed to be F .8. As many elements from the remainder of the genotype are taken (andremoved) from the left of the genotype as the number of times the letter F appears in the current word. If there remain too few elements in thegenotype, the required number is completed circularly.3. Each F in the current word is replaced by the right-hand side of the rulewith the same number as the integers obtained by the preceding step.With genetic code degeneracy, the remainder of each integer modulo 11 isused instead. In any derivation, the trivial rules + ::= + and − ::= − arealso applied.4. If the genotype is empty, the algorithm stops and returns the last derivedword.5. If the derived word does not contain a letter F , the whole word is replacedby the axiom.6. Go to step 2.The four genetic operations mentioned in step 7 of the genetic algorithm arethe following: • Recombination (applied to all the generated genotypes). Given a pair ofgenotypes, ( x , x , ..., x n ) and ( y , y , ..., y m ), a random integer is gener-ated in the interval [0 , min( n, m )]. Let it be i . The resulting recombinedgenotypes are ( x , x , ..., x i − , y i , y i +1 , ..., y m ) and( y , y , ..., y i − , x i , x i +1 , ..., x n ). • Mutation , applied to n per cent of the generated genotypes if both parentsare equal, to n per cent if they are different. It consists of replacing arandom element of the genotype vector by a random integer in the sameinterval. • Fusion , applied to n per cent of the generated genotypes. The genotypeis replaced by a catenation of itself with a piece randomly broken fromeither itself or its brother’s genotype. (In some tests, the whole genotypewas used, rather than a piece of it.) • Elision , applied to 5 per cent of the generated genotypes. One integer ina random position of the vector is eliminated. The last two operationsmake it possible to generate longer or shorter genotypes from the originaleight element vectors.
Heavy tail distributions are probabilistic distributions which exhibit an asymp-totic hyperbolic decrease, usually represented asPr {| X | > x } ∼ Cx − α , (5)where α is a positive constant. Distributions with this property have been usedto model ramdom variables whose extreme values are observed with a relativelyhigh probability. 9ork on these probability distributions can be traced to Pareto’s (1965)work on the earning distribution or to Levy’s (1957) work on the properties ofstable distributions. A fundamental advance in the use of heavy tail distribu-tions for was provided by Mandelbrot’s work (1960, 1963) on the applicationof fractal behavior and self-similarity to the modeling of real-world phenom-ena, which he used to introduce stable distributions to model price changes inthe stock exchange. Heavy tail distributions have also been used in areas suchas statistical physics, wheather prediction, earthquake prediction, econometricsand risk theory (Embrechts, et al. 1997, Mandelbrot & Wheeler 1983). In morerecent times, these distributions have been used to model waiting times in theWorld Wide Web (Willinger, et al. 1995) or the computational cost of ran-dom algorithms (Gomes 2003, Gomes, et al. 2000a, Gomes, et al. 1998, Gomes,et al. 1997).For many purposes, the only relevant parameter of a heavy tail distributionis its characteristic exponent α , which determines the ratio of decrease of thetail and the probability of occurrence of extreme events. In this work we onlyconsider heavy tail distributions where α belongs to the (0 ,
2) interval, withpositive support (Pr { ≤ X < ∞} = 1).The existence or inexistence of the different moments of a distribution is fullydetermined by the behavior of its tail: α can also be regarded as the exponentof the maximum finite moment of the distribution, in the sense that momentsof X of order less than α are finite, while moments of order equal or greater areinfinite. For instance, when α = 1 .
5, the distribution has a finite average andan infinite variance, while for α = 0 . Many procedures have been used to estimate α (Hughey 1991, Adler, et al. 2000,Crato 2000). Two of them have received the most extensive usage. The firstuses a maximum likelihood estimator, the second applies a simple regressionmethod.An important issue while estimating α is how to tackle censored observationswhen extreme data are not available. Consider, for instance, physical phenom-ena such as wind velocity or earthquake magnitude, where heavy tail distribu-tions have been considered appropriate. In these cases, extreme measures arenon-observable, since very strong hurricanes or highly destructive earthquakeswill damage the measuring instruments. In the process of financial data, such asstock exchange rates, heavy tail models have also been used (de Lima 1997). Inmoments of high volatility, when extreme data usually appear, many stock ex-change markets introduce rules to limit transactions or even close the market, toprevent them from taking place. Consider finally the case of random algorithms:the computational costs of some problems are so high, that the algorithms haveno alternative but to interrupt the execution and start again with a differentrandom seed. In those cases, computational costs are not observable beyond acertain threshold (Gomes, et al. 2000b). Thus the censorship of extreme valuesneeds to be considered by available estimators.Let X n ≤ X n ≤ . . . X nn be the ordered statistics, i.e. the ordered values inthe sample X , X , . . . , X n . Let r < n be the truncation value which separatesnormal from extreme observations. 10he adapted Hill-Hall estimator for censored observations is:ˆ α r,u = ( 1 r r − (cid:88) j =1 ln X n,n − r + j + u + 1 r ln X n,n − u + rr ln X n,n − r ) − . (6)In this notation, n is the number of observed data, r + 1 is the number of largerobservations selected and u is the number of non-observed extreme values. Ifall the data are observable, u = 0 and equation (6) becomes the classic Hill-Hallestimator.In heavy tail distributions, the ratio of decay of the estimated density ishyperbolic (slower than a exponential decay). Thus the one-complement of theaccumulated distribution function, F ( · ), also shows a hyperbolic decay. F ( x ) = 1 − F ( x ) = Pr { X > x } ∼ Cx − α . (7)Therefore, for a heavy-tail variable, a log-log graph of the frequency of observedvalues larger than x should show an approximately linear decay in the tail.Moreover, the slope of the linear decaying graph is in itself an estimation of α .This can be contrasted with a exponentially decaying tail, where a log-log graphshows a faster-than-linear tail decay.This simple property, besides giving visual evidence of the presence of aheavy tail, also gives place to a natural regression estimator based on equation7, the least-squares estimator (Adler et al. 2000), which can be expressed interms of a selected number of extreme observations. Assume that we havea sample of k = n + u independent identically distributed random variables.Assume also that we only observe the n smallest values of random variable X and therefore have the ordered statistics X n ≤ X n ≤ . . . ≤ X nn . Assumethat, for X n,n − r ≤ X ≤ X nn , the tail distribution has a hyperbolic decay. Theleast-square regression estimator for the α exponent isˆ α = − (cid:80) l i log X ni − (cid:80) l i (cid:80) log X ni / ( r + 1) (cid:80) (log X ni ) − (cid:80) (log X ni ) / ( r + 1) , (8)where l i = log n + u − in + u and the sums go from i = n − r to i = n . If all the valuesin the sample k = n + u can be observed, then u = 0 and k = n . The name heavy tail, applied to a class of distributions, expresses their mainproperty: the large proportion of total probability mass concentrated in the tail,which reflects its (hyperbolic) slow decay and is the reason why all the momentsof a heavy tail distribution are infinite, starting at a given order.The concept of kurtosis is also related to the tail heaviness. The kurtosis ofa distribution is the amount µ /µ , where µ and µ are the second and fourthcentralized moments ( µ is the variance). The kurtosis is independent of thelocalization and scale parameters of a distribution. Kurtosis is high, in general,for a distribution with a high central peak and long tails.The kurtosis of the standard normal distribution is 3. A distribution witha kurtosis higher than 3 is called leptokurtic as opposite to platokurtic (see fig.3) . In a similar way to heavy tail distributions, a leptokurtic distribution haslong tails with a considerable concentration of probability. However, the tail of11igure 3: Low kurtosis vs. high kurtosis. The probability density function onthe right has a higher kurtosis than the left: its center part has a higher peakand its tails are heavier.a leptokurtic distribution decays faster than that of a heavy tail distribution:all the moments in a leptokurtic distribution can be finite, in a strong contrastwith a heavy tail distribution where, at most, the first two moments are finite. Randomized algorithms with a high execution time variability are suspetc ofhiding a heavy tail distribution. In the present section we provide empiricalevidence that our GE algorithm for the automatic generation of fractal curvesmay exhibit a heavy tail behavior which can be exploited to improve the per-formance.Ortega, et al. (2003) provides data about different executions of the samealgorithm to generate fractal curves with the same dimensions, using differentrandom seeds. The numbers of generations needed to reach the target differ inup to two orders of magnitude (see table 1).Figure 4 shows the empirical distribution of the number of generationsneeded to find a solution, i.e. F ( x ) = Pr { number of generations to reach a solution ≤ x } (9)for four different fractal dimensions: 1.3, 1.5, 1.8 and 2. The empirical distri-bution functions have been obtained by running 1,000 executions with 1,000different independent random seeds. At the end of each execution, the numberof generations needed to reach a solution is recorded. We took a censorship value equal to τ = 5 ,
000 generations, meaning that, if an execution needs over5,000 generations, it is stopped and marked as non-observable.Table 2 shows the percentages of non-observable executions in our experi-ments. This percentage is quite high, specially for dimensions 1.8 and 1.3. Theempirical distribution functions may be used to test whether the distributionhas a heavy tail.In the previous section (definition 5) we saw that a random variable has aheavy tail behavior if it shows an asymptotic hyperbolic decay, although thatbehavior can also be shown in its whole support. In the figures displayed inthis section, only the extreme values are shown, therefore we had to choose aparameter r to truncate the non-extreme observations. Usually r takes values12imension angle (degrees) τ = 5 ,
000 generations.in the [1% , { } ,as recommended by Crato (2000).Figure 5 shows the log-log graphs of the distribution tails for fractal dimen-sions 1.3, 1.5, 1.8 and 2. Notice the linear decay of function log F ( x ), in contrastwith exponential decay distributions, where the decay of log F ( x ) is faster thanlinear.The averages for dimensions 1.3, 1.5 and 1.8 are E ( X . ) = 1 , E ( X . ) =1 ,
108 and E ( X . ) = 1 , box-and-whisker graphs, which give rise to three re-markable conclusions: • The median (the dashed line within the box in fig. 6) is much smaller thanthe average (the cross ‘+’ within the box) for dimensions 1.3, 1.5 and 1.8.This suggests that the average of these distributions is biased by the sizeof the sample, which means that they may have an infinite asymptoticaverage typical of heavy tail distributions.13igure 4: Empirical distribution function of the number of generations neededto reach a solution for several fractal dimensions: 1.3, 1.5, 1.8 and 2. • The sample distribution is strongly biased towards high execution times,indicating a right-hand-side heavy tail. This can be seen in the fact thatthe lower interquartilic distance (the difference between the first quartile- the lower segment of the box - and the the median - the green line) isshorter than the upper interquartilic distance (the difference between themedian and the third quartile - the upper segment of the box). Besidesthis, the distance between the minimum and the first quartile is much lessthan the distance between the maximum (the highest point of the graph)and the third quartile.
The preceding section provides visual evidence for a heavy tail behavior indimensions 1.3, 1.5, 1.8. Evidence for this behavior is weaker in dimension 2,but also present in, for instance, the linear decay observed in figure 5. In thissection we estimate the characteristic exponent for these distributions, usingthe estimators presented in section 4.First we compute the Hill-Hall estimator adapted for censored observations,(equation 6).Table 3 confirms that these distributions are heavy tailed, since all the valuesin the table are less than 2, the limit for heavy tail distributions.For dimension 1.3, all the estimations (for all values of r ) are less than 1,which means that this distribution does not have neither a finite average nor a14igure 5: Log-log graph of the tail of ( r =20%) distributions for dimensions 1.3,1.5, 1.8 and 2.finite variance. The same happens for dimension 1.8 even in a stronger way, asthe values of α are even smaller (all are below 0.7).Dimensions 1.5 and 2 provide examples of heavy tail distributions with acharacteristic exponent α between 1 and 2. These distributions have a finiteaverage, but an infinite variance, indicating that their right heavy tail is lighterthan in the other two distributions.Figure 7 displays the erratic behavior of the sample average as a function ofthe sample size.To verify the reliability of our characteristic exponent estimation, table 4shows the estimations obtained using the regression estimator described in aprevious section (equation 8), which is considered slightly less robust than themaximum likelihood estimator (adapted Hill-Hall). The results of this estimatorcan be seen to be consistent with those of the adapted Hill-Hall estimator. As mentioned before, in practice one has to select the GE maximum numberof generations for specially difficult problems. In other words, an appropriatecensorship value τ must be chosen, so that the algorithm does not becomestagnated in the extreme values of the distribution tail. As a consequence, thetail is truncated. The selection of the value of τ depends on the problem andthe algorithm. Ideally, only a small portion of tail should be truncated, but this15igure 6: Box-and-whisker type graphs for dimensions 1.3, 1.5, 1.8 and 2.dimension r
1% 2.5% 5% 10% 15% 20%1.3 0.7827 0.6796 0.8312 0.7953 0.7634 0.70841.5 1.1765 1.2400 1.0952 1.0595 0.9952 0.94181.8 0.3649 0.4855 0.6746 0.5759 0.5657 0.57052 0.7656 0.6043 1.0403 1.0463 0.7732 1.0309Table 3: Estimations of α for dimensions 1.3, 1.5, 1.8 and 2 using the adaptedHill-Hall estimator.may be prohibitive from the computational point of view.If the truncation is set at a small number of generations, it will be harder todistinguish between heavy tail and leptokurtic distributions. From a practicalpoint of view, this is not a problem, if there are strong indications that thetail exhibits at least one of the two behaviors. A heavy tail behavior is not anecessary condition to accelerate randomized search methods. In fact, it hasbeen proved that the efficiency of the search in leptokurtic distributions can beimproved by randomized backtracking (Gomes 2003). However, with a heavytail distribution, the occurrence of long executions will be more frequent thanwith a leptokurtic distribution, making it possible to obtain a higher potentialacceleration.Table 5 shows the kurtosis for the 4 fractal dimensions considered. Remem-ber that, if this value is greater than 3 (the kurtosis for a normal distribution)the distribution is leptokurtic (with abrupt peaks and heavy tails), otherwise itis platokurtic (with smooth peaks and light tails). In our case, fractal dimen-16igure 7: Evolution of the sample average as a function of the sample size fordimensions 1.3, 1.5, 1.8 and 2.sions 1.3, 1.5 and 2 are seen to be leptokurtic, while dimension 1.8 is platokurtic.Figure 8 shows the histograms built for the execution samples for dimensions1.3, 1.5, 1.8 and 2.This section ends with the conclusion that there exists an application of GE,automatic fractal generation, whose distributions exhibit a heavy tail behavior,besides being leptokurtic in many cases. In the next section we show that it ispossible to take advantage of this probabilistic characterization to increase theperformance of GE and yield a fast fractal generation algorithm. We have shown that our algorithm may give rise to computational efforts witha leptokurtic or heavy tail distribution. This may be due to the fact that thealgorithm makes bad choices more frequently than expected, leading the searchto a dead-end in the search space, where no solution of the required fitnessexists.The algorithm seems to be more efficient at the beginning of the search,which suggests that a sequence of short executions, compared to a single longexecution, may give rise to a better use of the computational resources. In thissection we show that the algorithm may be accelerated by the use of several restart strategies . 17imension r
1% 2.5% 5% 10% 15% 20%1.3 0.6952 0.7528 0.7715 0.7904 0.7692 0.73451.5 1.1318 1.3790 1.0886 0.9664 0.9786 0.97211.8 0.3310 0.5220 0.7285 0.6424 0.5762 0.57622 ≈ ≈ α for fractal dimensions 1.3, 1.5, 1.8 and 2, using theregression estimator.dimension 1.3 1.5 1.8 2 µ µ x ) 3.4575 4.1642 2.4817 197.1335Table 5: Kurtosis computation for dimensions 1.3, 1.5, 1.8 and 2. Figure 9 displays the result of a restart strategy with a fixed threshold appliedto the generation of a fractal curve with dimension 1.3. This is the simpleststrategy: once the algorithm has been working for a predefined number of gen-erations θ , without reaching the desired goal, a new execution is started with adifferent random seed. As the figure shows, the failure rate after 500 generationsis 70% ( F (500) = 0 . θ = 10 generations.Such an improvement is typical of heavy tail distributions. The fact thatthe experimental curve has been so dramatically moved towards the beginningof the support is a clear indication that the heavy tail character of the originaldistribution has disappeared in the modified algorithm.Figures 10, 11 and 12 clearly show that the restarts make the tail of thedistributions lighter , thus providing a mechanism to handle heavy tail and lep-tokurtic distributions.Different fixed thresholds give rise to different average times needed to reacha solution. Table 6 and figure 13 show that the threshold value θ = 6 minimizesthe expected cost for fractal dimension 1.3, making it the optimal threshold θ ∗ .For threshold values larger than the optimal, the heavy tail behavior at the rightof the median dominates the average cost, while below the optimal value thesuccess percentage is too small and too many restarts are required. Anywhere,many non-optimal choices provide a considerable acceleration of the algorithm.It has been proven that the use of a fixed restart threshold θ with a heavytail distribution eliminates this behavior in such a way that all the moments ofthe new distribution become finite (Gomes et al. 2000a). The idea of a fixed threshold comes from theoretical results by Luby, et al.(1993), which describe optimal restart policies. It can be proven that if thetime distribution of the execution is completely known and therefore θ ∗ can be18igure 8: Histograms with the execution samples obtained for fractal dimensions1.3, 1.5, 1.8 and 2.calculated a priori, restarting every θ ∗ generations yields the minimum averageexecution time.Luby et al. (1993) also provide a strategy (a universal strategy applicableto every distribution) to minimize the expected cost of random procedures inthe case where no a priori knowledge is available. It consists of sequences ofexecutions whose values are powers of two. After two executions with a giventhreshold, the threshold is changed to its double value. Let t i be the number ofgenerations of the i-th execution; the universal strategy is defined as: t i = (cid:40) k − if i = 2 k − t i − k − +1 , if 2 k − ≤ i < k − , yielding strategies of the form(1 , , , , , , , , , , , , , , , , . . . ) . Luby et al. (1993) presents two theorems which together prove the asymp-totic optimality of this procedure for an unknown distribution.Table 7 summarizes the results of the application of both strategies. Theaverage time using restarts with the universal strategy is approximately twicethe time needed using fixed restarts with the optimal threshold. Both yield aconsiderable acceleration against the algorithm without restarts.19igure 9: Function F ( x ) for several values of the restart threshold θ ∈{ , , , ∞} applied to fractal dimension 1.3.In several problems whose execution times had heavy tail distributions, theuniversal strategy was found to grow ‘too slowly.’ This happens because, in thoseproblems, the restart sequence takes too many iterations to reach a value near θ ∗ (Cebri´an & Cantador 2007, Kautz, et al. 2002). A correction was proposedby Walsh (1999), with a new restart strategy which was applied successfully toconstraint satisfaction problems. In this simple strategy, each new restart is aconstant factor γ greater than the preceding value:(1 , γ, γ , γ , . . . ) , γ > . (10)This strategy has a high probability of success when the restart value t i = γ i − is near the optimal restart threshold value. Increasing the restart thresh-old geometrically makes sure that the optimal value will be reached in a fewgenerations. The solution is expected to be found within a few restarts afterthe value of t i has surpassed the optimal. This strategy has the advantage ofbeing less sensitive to the actual distribution it is applied to.Figure 8 displays the average execution times using Walsh strategy for severalvalues of γ . It can be seen that γ = 1 . F ( x ) for several values of the restart threshold θ ∈{ , , , ∞} applied to fractal dimension 1.5. Heavy tail probability distributions have been used to model several real worldphenomena, such as weather patterns or delays in large communication net-works. In this paper we have shown that these distributions may be also suitableto model the execution time of an algorithm which uses Grammatical Evolutionfor automatic fractal generation. Heavy tail distributions help to explain theerratic behavior of the mean and variance of this execution time and the largetails exhibited by the distribution.We have proved that restart strategies mitigate the inconveniences associ-ated with heavy tail distributions and yield a considerable acceleration on theprevious algorithm. These strategies exploit the non-negligible probability offinding a solution in short executions, thus reducing the variance of the execu-tion time and the possibility that the algorithm fails, which improves the overallperformance.We have given evidence that several restart strategies are of practical value,even in scenarios with no a priori knowledge about the probability distributionof the execution time.So far, we have considered situations of complete or inexistent knowledge. Inreal situations, the execution time or the resources are bounded, so that some partial knowledge about the execution time is available. In this scenario, wesuspect that our algorithm would take advantage of dynamic restart strategies based on predictive models, which have been used successfully to tackle decision21igure 11: Function F ( x ) for several values of the restart threshold θ ∈{ , , , ∞} applied to fractal dimension 1.8.and combinatorial problems (Horvitz, et al. 2001, Kautz et al. 2002, Ruan,et al. 2002). Further research along this line would be focused on pinpointingthe real time knowledge about the behavior of the algorithm which would makeit possible to build predictive models for its execution time, thus providing afurther acceleration.Finding the conditions for the execution time of a particular GrammaticalEvolution experiment to exhibit a heavy tail distribution would also make aninteresting research line: is the fractal generation optimization exhibiting atypical behavior or just an exception? Acknowledgements
This work has been sponsored by the Spanish Ministry of Education and Science(MEC), project number TSI2005-08225-C07-06.
References
R. Adler, et al. (2000). ‘A Practical Guide to Heavy Tails: Statistical Techniquesfor Analyzing Heavy Tailed Distributions, Birka¨user’ .22igure 12: Function F ( x ) for several values of the restart threshold θ ∈{ , , , ∞} applied to fractal dimension 2.M. Alfonseca & A. Ortega (1997). ‘A study of the representation of fractalcurves by L systems and their equivalences’. IBM Journal of Research andDevelopment (6):727–736.M. Alfonseca & A. Ortega (2001). ‘Determination of fractal dimensions fromequivalent L systems’. IBM Journal of Research and Development (6):797–805.M. Cebri´an & I. Cantador (2007). ‘Exploiting Heavy Tails in Training Timesof Multilayer Perceptrons. A Case Study with the UCI Thyroid DiseaseDatabase’. arXiv:0704.2725.N. Crato (2000). ‘Estimation Of The Maximal Moment Exponent With Cen-sored Data’. Communications in Statistics–Simulation and Computation,Vol29 No4 pp. 1239–1254.K. Culik II & S. Dube (1993). ‘L-systems and mutually recursive functionsystems’.
Acta Informatica (3):279–302.P. de Lima (1997). ‘On the robustness of nonlinearity tests to moment conditionfailure’. Journal of Econometrics (1):251–280.P. Embrechts, et al. (1997). Modelling Extremal Events for Insurance and Fi-nance . Springer. 23 % solved average cost2 100% 382.67404 100% 277.57308 100% 207.824016 100% 271.398032 100% 345.268064 100% 407.2460128 100% 621.1770256 99.8% 830.4220512 98.5% 9851024 96.4% 1,3672048 93.7% 1,909Table 6: Percentage solved and average cost for several threshold values in thefractal dimension 1.3 experiment.Figure 13: The effect of restarts with fixed θ on the solution costs for fractaldimension 1.3. 24imension no restart optimal fixed threshold universal1.3 1,173 164.9655 ( θ ∗ = 6) 294.8671.5 1,108 374.2069 ( θ ∗ = 10) 622.1811.8 1,443 248.5263 ( θ ∗ = 17) 625.3342 5.4360 1.1655 ( θ ∗ = 1) 1.1701Table 7: A comparison between average execution times for each dimensionwithout restarts, with an optimal fixed threshold strategy and with the universalstrategy.dimension Walsh γ = 1 . γ = 1 . γ = 1 . γ = 1 . γ = 21.3 441.5138 639.0714 846.0743 773.4603 898.46301.5 654.5434 773.9020 845.0908 905.0780 938.37901.8 3,115 2,695 2,527 2,437 2,3722 1.167 1.1827 1.1729 1.1979 1.1783Table 8: Average execution times using the Walsh strategy for several values of γ .K. Falconer (1990). ‘Fractal Geometry: Mathematical Foundations and Appli-cations’. Mathematical Foundations and Applications .C. Gomes (2003).
Constraint and Integer Programming: Toward a UnifiedMethodology , chap. Complete randomized backtrack search, pp. 233–283.Kluwer Academics.C. Gomes, et al. (1997). ‘Heavy-tailed distributions in combinatorial search’.
Principles and Practice of Constraint Programming pp. 121–135.C. Gomes, et al. (2000a). ‘Heavy-Tailed Phenomena in Satisfiability and Con-straint Satisfaction Problems’.
Journal of Automated Reasoning (1–2):67–100.C. Gomes, et al. (2000b). ‘Heavy-Tailed Phenomena in Satisfiability and Con-straint Satisfaction Problems’. Journal of Automated Reasoning (1):67–100.C. Gomes, et al. (1998). ‘Boosting combinatorial search through randomization’.In Proceedings of the 15th National Conference onon Artificial Intelligence ,pp. 431–437.E. Horvitz, et al. (2001). ‘A Bayesian approach to tackling hard computationalproblems’.
Proceedings the 17th Conference on Uncertainty in Artificial In-telligence (UAI-2001) .R. Hughey (1991). ‘A survey and comparison of methods for estimating extremeright tail-area quantiles’.
Communications in statistics. Theory and methods (4):1463–1496.H. Kautz, et al. (2002). ‘Dynamic restart policies’. In Proceedings of the 18thAmerican Association on Artificial Intelligence , pp. 674–681.25. Koza (1992).
Genetic Programming: on the programming of computers bymeans of natural selection . Bradford Books.P. Levy (1957). ‘Theorie de lAddition des Variables Aleatoires’.
Gauthier-Villars, Paris .A. Lindenmayer (1968). ‘Mathematical models for cellular interactions in de-velopment. I. Filaments with one-sided inputs.’.
J Theor Biol (3):280–99.M. Luby, et al. (1993). ‘Optimal speedup of Las Vegas algorithms’. In Proceed-ings of the 2nd Israel Symposium on the Theory and Computing Systems , pp.128–133.B. Mandelbrot (1960). ‘The Pareto-L´evy Law and the Distribution of Income’.
International Economic Review :79–106.B. Mandelbrot (1963). ‘The Variation of Certain Speculative Prices’. TheJournal of Business (4):394–419.B. Mandelbrot & J. Wheeler (1983). ‘The Fractal Geometry of Nature’. Amer-ican Journal of Physics :286.M. O’Neill & C. Ryan (2003). Grammatical Evolution: Evolutionary AutomaticProgramming in an Arbitrary Language . Kluwer Academic Publishers.A. Ortega, et al. (2003). ‘Grammatical evolution to design fractal curves with agiven dimension’.
IBM Journal of Research and Development (4):483–493.S. Papert (1980). ‘Mindstorms: children, computers, and powerful ideas’ BasicBooks, New York.V. Pareto (1965). ‘´Ecrits sur la courbe de la r´epartition de la richesse’. LibrairieDroz .Y. Ruan, et al. (2002). ‘Restart policies with dependence among runs: A dy-namic programming approach’.
Proceedings of the Eighth International Con-ference on Principles and Practice of Constraint Programming (CP-2002) pp.573–586.T. Walsh (1999). ‘Search in a Small World’. In
Proceedings of the 16th Inter-national Joint Conference on Artificial Intelligence , pp. 1172–1177.W. Willinger, et al. (1995). ‘Self-Similarity in High-Speed Packet Traffic: Anal-ysis and Modeling of Ethernet Traffic Measurements’.