[PDF] Applying machine learning to the problem of choosing a heuristic to select the variable ordering for cylindrical algebraic decomposition

Abstract

Cylindrical algebraic decomposition(CAD) is a key tool in computational algebraic geometry, particularly for quantifier elimination over real-closed fields. When using CAD, there is often a choice for the ordering placed on the variables. This can be important, with some problems infeasible with one variable ordering but easy with another. Machine learning is the process of fitting a computer model to a complex function based on properties learned from measured data. In this paper we use machine learning (specifically a support vector machine) to select between heuristics for choosing a variable ordering, outperforming each of the separate heuristics.

Full PDF

aa r X i v : . [ c s . S C ] A p r Applying machine learning to the problem ofchoosing a heuristic to select the variableordering for cylindrical algebraic decomposition

Zongyan Huang , Matthew England , David Wilson ,James H. Davenport , Lawrence C. Paulson and James Bridge University of Cambridge Computer Laboratory, Cambridge CB3 0FD, U.K. University of Bath, Department of Computer Science, Bath, BA2 7AY, U.K. {zh242, lp15, jpb65}@cam.ac.uk,{J.H.Davenport, M.England, D.J.Wilson}@bath.ac.uk

Abstract.

Cylindrical algebraic decomposition(CAD) is a key tool incomputational algebraic geometry, particularly for quantiﬁer eliminationover real-closed ﬁelds. When using CAD, there is often a choice for theordering placed on the variables. This can be important, with some prob-lems infeasible with one variable ordering but easy with another. Machinelearning is the process of ﬁtting a computer model to a complex func-tion based on properties learned from measured data. In this paper weuse machine learning (speciﬁcally a support vector machine) to selectbetween heuristics for choosing a variable ordering, outperforming eachof the separate heuristics.

Keywords: machine learning, support vector machine, symbolic com-putation, cylindrical algebraic decomposition, problem formulation

Cylindrical algebraic decomposition (CAD) is a key tool in real algebraic geom-etry. It was ﬁrst introduced by Collins [18] to implement quantiﬁer eliminationover the reals, but has since been applied to applications including robot motionplanning [49], programming with complex valued functions [22], optimisation [28]and epidemic modelling [15]. Decision methods for real closed ﬁelds are of greatuse in theorem proving [25].

MetiTarski [1], for example, decides the truth ofstatements about special functions using CAD and rational function bounds.When using CAD, we often have a choice over which variable ordering to use.It is well known that this choice is very important and can dramatically aﬀectthe feasibility of a problem. In fact, Brown and Davenport [14] presented a classof problems in which one variable ordering gave output of double exponentialcomplexity in the number of variables and another output of a constant size.Heuristics have been developed to help with this choice, with Dolzmann et al. [23]giving the best known study. However, in CICM last year [8], it was shown thateven the best known heuristic could be misled. Although that paper providedn alternative heuristic, this had its own shortcomings, and it now seems likelythat no one heuristic is suitable for all problems.Our thesis is that the best heuristic to use is dependent upon the problemconsidered. However, the relationship between the problems and heuristics is farfrom obvious and so we investigate whether machine learning can help with thesechoices. Machine learning is a branch of artiﬁcial intelligence. It uses statisticalmethods to infer information from supplied data which is then used to makepredictions for previously unseen data [2]. We have applied machine learning(speciﬁcally a support vector machine) to the problem of selecting a variableordering for both CAD itself and quantiﬁer elimination by CAD, using the nlsatdataset [50] of fully existentially quantiﬁed problems. Our results show that thechoices made by machine learning are on average superior to both any individualheuristic and to picking a heuristic at random. The results also provide somenew insight on the heuristics themselves. This appears to be the ﬁrst applicationof machine learning to problem formulation for computer algebra, although itfollows recent application to theorem proving [10, 31].We conclude the introduction with background theory on CAD and machinelearning. Then in Sections 2, 3 and 4 we describe our experiment, its resultsand how they may be extended in the future. Finally in Section 5 we give ourconclusions and ideas for future work.

Let Q i ∈ {∃ , ∀} be quantiﬁers and φ be some quantiﬁer free formula. Then given Φ ( x , . . . , x k ) := Q k +1 x k +1 . . . Q n x n φ ( x , . . . , x n ) , quantiﬁer elimination (QE) is the problem of producing a quantiﬁer free formulae ψ ( x , . . . , x k ) equivalent to Φ . In the case k = 0 this reduces to the decision prob-lem , is Φ true? Tarski proved that QE was possible for semi-algebraic formulae(polynomials and inequalities) over R [47]. However, the complexity of Tarski’smethod is non-elementary (indescribable as a ﬁnite tower of exponentials) and soCAD was a major breakthrough when introduced, despite complexity doubly ex-ponential in the number of variables. For some problems QE is possible throughalgorithms with better complexity (see for example the survey by Basu [5]), butCAD implementations remain the best general purpose approach.Collins’ algorithm [3] works in two stages. First, projection calculates sets ofprojection polynomials S i in variables ( x , . . . , x i ) . This is achieved by repeatedlyapplying a projection operator onto a set of polynomials, producing a set withone variable fewer. We start with the polynomials from φ and eliminate variablesthis way until we have the set of univariate polynomials S .Then in the lifting stage, decompositions of real space in increasing dimen-sions are formed according to the real roots of those polynomials. First, the realline is decomposed according to the roots of the polynomials in S . Then overeach cell c in that decomposition, the bivariate polynomials S are taken at asample point and a decomposition of c × R is produced according to their roots.aking the union gives the decomposition of R and we proceed this way toa decomposition of R n . The decompositions are cylindrical (projections of anytwo cells onto their ﬁrst i coordinates are either identical or disjoint) and eachcell is a semi-algebraic set (described by polynomial relations). Collins’ originalalgorithm used a projection operator which guaranteed CADs of R n on whichthe polynomials in φ had constant sign, and thus φ constant truth value, on eachcell. Hence only a single sample point from each cell needed to be tested and theequivalent quantiﬁer free formula ψ could be generated from the semi-algebraicsets deﬁning the cells in the CAD of R k for which Φ is true.Since the publication of the original algorithm, there have been numerousimprovements, optimisations and extensions of CAD (with a summary of the ﬁrst20 years given by Collins [19]). Of great importance is the improvement to theprojection operator used. Hong [29] proved that a reﬁnement of Collins’ operatorwas suﬃcient and then McCallum [37] presented a further reﬁnement whichcould only be used for input that was well-oriented and was in turn improvedby Brown [11]. Further reﬁnements are possible by removing the need for sign-invariance of polynomials while maintaining truth-invariance of a formula, withMcCallum [38] presenting an operator for use when an equational constraintis present (an equation logically implied by a formula) and Bradford et al . [7]extending this to the case of multiple formulae. Collins and Hong [20] describedPartial CAD for QE, where lifting over a cell is aborted if there already existssuﬃcient information to determine the truth of φ on that cell. Other recent CADdevelopments of particular note include the use of symbolic-numeric techniquesin the lifting stage [33, 45] and the alternative to projection and lifting oﬀeredby decompositions of complex space via regular chains technology [17].When using CAD we have to assign an ordering to the variables (the labels i on the x i in the discussion above). This dictates the order in which the variablesare eliminated during projection and thus the sub-spaces for which CADs areproduced en route to a CAD of R n . For some applications this order is ﬁxedbut for others there may be a free or constrained choice. When using CAD forQE we must project quantiﬁed variables before unquantiﬁed ones. Further, thequantiﬁed variables should be projected in the order they occur, unless successiveones have the same quantiﬁer in which case they may be swapped. The orderingcan have a big eﬀect on the output and performance of CAD [8, 14, 23]. Machine learning [2] deals with the design of programs that can learn rules fromdata. This is often a very attractive alternative to manually constructing themwhen the underlying functional relationship is very complex. Machine learningtechniques have been widely used in many ﬁelds, such as web searching [6], textcategorization [42], robotics [44], expert systems [27] and many others.Various machine learning techniques have been developed. McCulloch andPitts [39] created the ﬁrst computational model for neural networks called thresh-old logic . Following that, Rosenblatt [40] proposed the perceptron as an iterativealgorithm for supervised classiﬁcation of an input into one of several possibleon-binary outputs. A later development was the decision tree [2], which is asimple representation for classifying examples. The main idea here is to applyserial classiﬁcations which reﬁne the output state. At the same time as the deci-sion tree was being developed, the multi-layer perceptron [30] was explored. It isa modiﬁcation of the standard linear perceptron and can distinguish data thatare non-linearly separable.In the last decade, the use of machine learning has spread rapidly followingthe invention of the

Support Vector Machine (SVM) [41]. This was a develop-ment of the perceptron approach and gives a powerful and robust method forboth classiﬁcation and regression.

Classiﬁcation refers to the assignment of inputexamples into a given set of classes (the output being the class labels).

Regression refers to a supervised pattern analysis in which the output is real-valued. TheSVM technology can deal eﬃciently with high-dimensional data, and is ﬂexiblein modelling diverse sources of data. The standard SVM classiﬁer takes a set ofinput data and predicts one of two possible classes from the input. Given a setof examples, each marked as belonging to one of two classes, an SVM trainingalgorithm builds a model that assigns new examples into one of the classes. Theexamples used to ﬁt the model are called training examples.An important concept in the SVM theory is the use of a kernel function [43],which maps data into a high dimensional kernel-deﬁned feature space and thenseparates samples in the transformed space. Kernel functions enable operationsin feature space without ever computing the coordinates of the data in thatspace. Instead they simply compute the inner products between all pairs of datavectors. This operation is generally computationally cheaper than the explicitcomputation of the coordinates.The machine learning experiment described in this paper uses

SVM-Light (see Joachims [34]) which is an implementation of SVMs in C. The

SVM-Light software consists of two programs:

SVM learn and

SVM classify . SVMlearn ﬁts the model parameters based on the training data and user inputs(such as the kernel function and the parameter values).

SVM classify uses thegenerated model to classify new samples. It calculates a hyperplane of the n -dimensional transformed feature space, which is an aﬃne subspace of dimension n − dividing the space into two corresponding to the two distinct classes. SVMclassify outputs margin values which are a measure of how far the sampleis from this separating hyperplane. Hence the margins are a measure of theconﬁdence in a correct prediction. A large margin represents high conﬁdence ina correct prediction. The accuracy of the generated model is largely dependenton the selection of the kernel functions and parameter values.

For the machine learning experiment we decided to focus on a single CAD im-plementation,

Qepcad [12]. We note that other CAD implementations are avail-able, as discussed further in Section 4. epcad is an interactive command line program written in C for performing Q uantiﬁer E limination with P artial CAD . It was chosen as it is a competitiveimplementation of both CAD and QE that also allows the user some controland information during its execution. We used

Qepcad with its default settingswhich implement McCallum’s projection operator [37] and partial CAD [20]. Itcan also makes use of an equational constraint automatically (via the projectionoperator [38]) when one is explicit in the formula, (where explicit means theformula is a conjunction of the equational constraint with a sub-formula).In the experiment we used three existing heuristics for picking a CAD variableordering:

Brown:

This heuristic chooses a variable ordering according to the followingcriteria, starting with the ﬁrst and breaking ties with successive ones:(1) Eliminate a variable ﬁrst if it has lower overall degree in the input.(2) Eliminate a variable ﬁrst if it has lower (maximum) total degree of thoseterms in the input in which it occurs.(3) Eliminate a variable ﬁrst if there is a smaller number of terms in theinput which contain the variable.It is labelled after Brown who suggested it [13]. sotd:

This heuristic constructs the full set of projection polynomials for eachpermitted ordering and selects the ordering whose corresponding set hasthe lowest sum of total degrees for each of the monomials in each of thepolynomials. It is labelled sotd for sum of total degree and was suggestedby Dolzmann, Seidell and Sturm [23], whose study found it to be a goodheuristic for both CAD and QE by CAD. ndrr:

This heuristic constructs the full set of projection polynomials for eachordering and selects the ordering whose set has the lowest number of distinctreal roots of the univariate polynomials within. It is labelled ndrr for numberof distinct real roots and was suggested by Bradford et al . [8]. Ndrr was shownto assist with examples where sotd failed.Brown’s heuristic has the advantage of being very cheap, since it acts only on theinput and checks only simple properties. The ndrr heuristic is the most expensive(requiring real root isolation), but is the only one to explicitly consider the realgeometry of the problem (rather than the geometry in complex space).All three heuristics may identify more than one variable ordering as a suitablechoice. In this case we took the heuristic’s choice to be the ﬁrst of these afterthey had been ordered lexicographically. This ﬁnal choice may depend on the convention used for displaying the variableordering.

Qepcad and the notes where Brown introduces his heuristic [13] use theconvention of ordering variables from left to right so that the last one is projectedﬁrst. On the other hand,

Maple and the papers introducing sotd and ndrr [8, 23]use the opposite convention. The heuristics were implemented in

Maple and soties were broken by picking the ﬁrst lexicographically on the second convention.This corresponds to picking the ﬁrst under a reverse lexicographical order under the

Qepcad convention. The important point is that all three heuristics had ties brokenunder the same convention and so were treated fairly. .2 Problem data

Problems were taken from the nlsat dataset [50], chosen over more traditionalCAD problem sets (such as Wilson et al . [48]) as these did not have suﬃcientnumbers of problems for machine learning. 7001 three-variable CAD problemswere extracted for our experiment. The number of variables was restricted fortwo reasons. First to make it feasible to test all possible variable orderings andsecond to avoid the possibility that

Qepcad will produce errors or warningsrelated to well-orientedness with the McCallum projection [37].Two experiments were undertaken, applying machine learning to CAD itselfand to QE by CAD. QE is clearly very important throughout engineering andthe sciences, but increasingly CAD has been applied outside of this context,as discussed in the introduction. We performed separate experiments since forquantiﬁed problems

Qepcad can use the partial CAD techniques to stop thelifting process early if the outcome is already determined, while the full processis completed for unquantiﬁed ones and the two outputs can be quite diﬀerent.The problems from the nlsat dataset are all fully existential (satisﬁability orSAT problems). A second set of problems for the quantiﬁer free experiment wasobtained by simply removing all quantiﬁers. An example of the

Qepcad inputfor a SAT problem is given in Figure 1 with the corresponding input for theunquantiﬁed problem in Figure 2. Of course, for such quantiﬁed problems thereare better alternatives to building a CAD (see for example the work of Jovanovicand de Moura [36]). However, our decision to use only SAT problems was basedon availability of data rather than it being a requirement of the technology, andso we focus on CAD only here and discuss how we might generalise our data inSection 4. For both experiments, the problems were randomly split into trainingsets (3545 problems in each), validation sets (1735 problems in each) and testsets (1721 problems in each) . Since each problem has three-variables and all the quantiﬁers are the same, allsix possible variable orderings are admissible. For each ordering we had

Qepcad build a CAD and measured the number of cells. The best ordering was deﬁned asthe one resulting in the smallest cell count, (and if more than one ordering givesthe minimal both orderings are considered the best). The decision to focus oncell counts (rather than say computation time) was made so that our experimentcould validate the use of machine learning to CAD theory, rather than just the

Qepcad implementation. Further, it is usually the case that cell counts andtimings are strongly correlated.The heuristics (Brown, sotd and ndrr) have been implemented in

Maple (aspart of the freely available

ProjectionCAD package [26]) and for each problemthe orderings suggested by the heuristics were recorded and compared to the cell The data is available at ∼ zh242/data . ig. 1: Sample

Qepcad input for a quantiﬁed problem. (x0,x1,x2)0(Ex0)(Ex1)(Ex2)[[((x0 x0) + ((x1 x1) + (x2 x2))) = 1]].gogogod-statgofinish

Fig. 2:

Sample

Qepcad input for a quantiﬁer free problem. (x0,x1,x2)3[[((x0 x0) + ((x1 x1) + (x2 x2))) = 1]].gogod-proj-factorsd-proj-polynomialsgod-fpc-statgo counts produced by

Qepcad . Note that all three heuristics do not discriminateon the structure of any quantiﬁers. As discussed above, some heuristics are moreexpensive than others. However, since none of the costs were prohibitive for ourdata set they are not considered here.Machine learning was applied to predict which of the three heuristics willgive an optimal variable ordering for a given problem, where optimal means thelowest cell count of the selected CADs. Note that in the quantiﬁed case Qepcad can collapse stacks when suﬃcient truth values for the constituent cells havebeen discovered to determine a truth value for the base cell. Hence, since ourproblems are all fully existential, the output for all quantiﬁed problems is alwaysa single cell: true or false. Therefore, in these cases it was not the number of cellsin the output that was used but instead the number of cells constructed duringthe process (hence the statistics commands in Figures 1 and 2 diﬀer).

To apply machine learning, we need to identify features of the CAD problemsthat might be relevant to the correct choice of the heuristics. A feature is an When comparing care must be taken when changing between the diﬀerent variableordering conventions (see Footnote 1). spect or measure of the problem that may be expressed numerically. Table 1shows the 11 features that we identiﬁed, where ( x , x , x ) are the three variablelabels used in all our problems. The number of features is quite small, comparedto other machine learning experiments. They were chosen as easily computablefeatures of the problems which could aﬀect the performances of the heuristics.Other features were considered (such as the maximum coeﬃcient and the pro-portion of constraints that were equations) but were not found to be useful.Further investigation into feature selection may be a topic of our future work. Table 1:

Description of the features used. The proportion of a variable occurring inpolynomials is the number of polynomials containing the variable divided by total num-ber of polynomials. The proportion of a variable occurring in monomials is the numberof terms containing the variable divided by total number of terms in polynomials.Feature number Description1 Number of polynomials.2 Maximum total degree of polynomials.3 Maximum degree of x among all polynomials.4 Maximum degree of x among all polynomials.5 Maximum degree of x among all polynomials.6 Proportion of x occurring in polynomials.7 Proportion of x occurring in polynomials.8 Proportion of x occurring in polynomials.9 Proportion of x occurring in monomials.10 Proportion of x occurring in monomials.11 Proportion of x occurring in monomials. Each feature vector in the training set was associated with a label, +1 (pos-itive examples) or − (negative examples), indicating in which of two classes itwas placed. To take Brown’s heuristic as an example, a corresponding trainingset was derived with each problem labelled +1 if Brown’s heuristic suggested avariable ordering with the lowest number of cells, or − otherwise.The features could all be easily calculated from the problem input using Maple . For example. if the input formula is deﬁned using the set of polynomials {− x − x − , x x + 9 x , x + x − x x − } then the problem will have the feature vector (cid:20) , , , , , , , , , , (cid:21) . After the feature generation process, the training data (feature vectors) werenormalized so that each feature had zero mean and unit variance across the set.The same normalization was then also applied to the validation and test sets. .5 Parameter Optimization

SVM-Light was used to do the classiﬁcation for this experiment. As statedin Section 1.2, SVMs use kernel functions to map the data into higher dimen-sional spaces where the data may be more easily separated.

SVM-Light hasfour standard kernel functions: linear, polynomial, sigmoid tanh and radial basisfunction. For each kernel function, there are associated parameters which mustbe set. An earlier experiments applying machine learning to an automated the-orem prover [9] found the radial basis function (RBF) kernel performed well inﬁnding a relation between the simple algebraic features and the best heuristicchoice. Hence the same kernel was selected for this experiment (other kernelfunctions may be tested in future work). The RBF function is deﬁned as: K ( x, x ′ ) = exp (cid:0) − γ || x − x ′|| (cid:1) where K is the kernel function, x and x ′ are feature vectors. There is a singleparameter γ in the RBF kernel function. Besides the parameter γ , two otherparameters are involved in the SVM ﬁtting process. The parameter C governsthe trade-oﬀ between margin and training error, and the cost factor j is usedto correct imbalance in the training set and we set it equal to the ratio betweennegative and positive samples. Given a training set, we can easily compute thevalue of parameter j by looking at the sign of the samples. However, it is notthat trivial to ﬁnd the optimal values of γ and C .In machine learning, Matthew’s correlation coeﬃcient (MCC) [4] is often usedto evaluate the performance of the binary classiﬁcations. It takes into accounttrue and false positives and negatives:

MCC = TP ∗ TN − FP ∗ FN p (TP + FP)(TP + FN)(TN + FP)(TN + FN) In this equation, TP is the number of true positives, TN is the number of truenegatives, FP is the number of false positives and FN is the number of falsenegatives. The denominator is set to if any sum term is zero. This measure hasthe value if perfect prediction is attained, if the classiﬁer is performing as arandom classiﬁer, and − if the classiﬁer exactly disagrees with the data.A grid-search optimisation procedure was used with the training and valida-tion set, involving a search over a range of ( γ, C ) values to ﬁnd the pair whichwould maximize MCC. We tested a commonly used range of value of γ (variedbetween − , − , − , . . . , ) and C (varied between − , − , − , . . . , )in our grid search process [32]. Following the completion of the grid-search, thevalues for kernel function and model parameters giving optimal MCC resultswere selected for each individual CAD heuristic classiﬁer. We also performed asimilar calculation, selecting parameters to maximise the F -score [35], but theresults using MCC were superior.The classiﬁers with optimal ( γ, C ) were applied to the test set to output themargin values [21]. In an ideal case, only one classiﬁer would return a positiveresult for any problem, where selecting a best heuristic is just a case of observinghich classiﬁer returns a positive result. However, in practice, more than oneclassiﬁer will return a positive result for some problems, while no classiﬁers mayreturn a positive for others. Thus, instead we used the relative magnitudes of theclassiﬁers in our experiment. The classiﬁer with most positive (or least negative)margin was selected to indicate the best decision procedure for the selection. The experiment was run as described in Section 2. We use the number of prob-lems for which a selected variable ordering is optimal to measure the eﬃcacy ofeach heuristic separately, and of the heuristic selected by machine learning.Table 2 breaks down the results into a set of mutually exclusive outcomesthat describe all possibilities. The column headed ‘Machine Learning’ indicatesthe heuristic selected by the machine learned model with the next three columnsindicating each of the ﬁxed heuristics tested. For each of these four heuristics, wemay ask the question “Did this heuristic select the optimal variable ordering?” A‘Y’ in the table indicates yes and an ‘N’ indicates no, with each of the 13 caseslisted covering all possibilities. Note that at least one of the ﬁxed heuristics musthave a ‘Y’ since, by deﬁnition, the optimal ordering is obtained by at least oneheuristic while if they all have a Y it is not possible for machine learning tofail. For each of these cases we list the number of problems for which this caseoccurred for both the quantiﬁer free and quantiﬁed experiments.

Table 2:

Categorising the problems into a set of mutually exclusive cases characterisedby which heuristics were successful.Case Machine Learning sotd ndrr Brown Quantiﬁer Free Quantiﬁed1 Y Y Y Y 399 5732 Y Y Y N 146 963 N Y Y N 39 244 Y Y N Y 208 2325 N Y N Y 35 436 Y N Y Y 64 577 N N Y Y 7 118 Y Y N N 106 669 N Y N N 106 7510 Y N Y N 159 10111 N N Y N 58 8912 Y N N Y 230 20813 N N N Y 164 146 or many problems more than one heuristic selects the optimal variable or-dering and the probability of a randomly selected heuristic giving the optimalordering depends on how many pick it. For example, a random selection wouldbe successful / of the time if one heuristic gives the optimal ordering or / of the time if two heuristics do so.In Table 2, case is where machine learning cannot make any diﬀerence asall heuristics are equally optimal. We compare the remaining cases pairwise. Foreach pair, the behaviour of the ﬁxed heuristics are identical and the diﬀerenceis whether or not machine learning picked a winning heuristic (one of the oneswith a Y). We see that in each case machine learning succeeds far more oftenthan fails. For each pair we can compare with a random heuristic selection. Forexample, consider cases 2 and 3 where sotd and ndrr are successful heuristics andBrown is not. A random selection would be successful / of the time. For thequantiﬁer free examples, machine learned selection is successful / (146 + 39) or approximately 79% of the time, which is signiﬁcantly better.We repeated this calculation for the quantiﬁed case and the other pairs, asshown in Table 3. In each case the values have been compared to the chance ofsuccess when picking a random heuristic, and so there are two distinct sets inTable 3: those where only one heuristic was optimal and those where two are. Wesee that machine learning did better for some classes of problems than others.For example in quantiﬁer free examples, when only one heuristic is optimalmachine learning does considerably better if that one is ndrr, while if only one isnot optimal machine learning does worse if is Brown. Nevertheless, the machinelearning selection is better than random in every case in both experiments. Table 3:

Proportion of examples where machine learning picks a successful heuristic.sotd ndrr Brown Quantiﬁer Free QuantiﬁedY Y N 79% (>67%) 80% (>67%)Y N Y 86% (>67%) 84% (>67%)N Y Y 90% (>67%) 84% (>67%)Y N N 50% (>33%) 47% (>33%)N Y N 73% (>33%) 53% (>33%)N N Y 58% (>33%) 59% (>33%)

By summing the numbers in Table 2 in which Y appears in a row for themachine learned selection and each individual heuristic, we get Table 4. Thiscompares, for both the quantiﬁer free and quantiﬁed problem sets, the learnedselection with each of the CAD heuristics on their own.Of the three heuristics, Brown seems to be the best, albeit by a small margin.Its performance is a little surprising, both because the Brown heuristic is not sowell known (having never been formally published) and because it requires littlecomputation (taking only simple measurements on the input). able 4:

Total number of problems for which each heuristic picks the best ordering.Machine Learning sotd ndrr BrownQuantiﬁer free 1312 1039 872 1107Quantiﬁed 1333 1109 951 1270

For the quantiﬁer free problems there were 399 problems where every heuris-tic picked the optimal, 499 where two did and 823 where one did. Hence for thisproblem set the chances of picking a successful heuristic at random is (cid:0)

399 + 499 ∗ + 823 ∗ (cid:1) ≃ which compares with ∗ / ≃ for machine learning. For the quan-tiﬁed problems the ﬁgures are and . Hence machine learning performssigniﬁcantly better than a random choice in both cases. Further, if we were touse only the heuristic that performed the best on this data, the Brown heuristic,then we would pick a successful ordering for approximately of the quanti-ﬁer free problems and of the quantiﬁed problems. So we see that a machinelearned choice is also superior to using any one heuristic. Although a large data set of real world problems was used, we note that in someways the data was quite uniform. A key area of future work is experimentationon a wider data set to see if these results, both the beneﬁt of machine learningand the superiority of Brown’s heuristic, are veriﬁed more generally. An initialextension would be to relax the parameters used to select problems from thenlsat dataset, for example by allowing problems with more variables.One key restriction with this dataset is that all problems have one block of ex-istential quantiﬁers. Note that our restriction to this case followed the availabilityof data rather than any technical limitation of the machine learning. Possibleways to generalise the data include randomly applying quantiﬁers to the theexisting problems, or randomly generating whole problems. However, this wouldmean the problems no longer originate from real applications, and it has beennoted in the past that random problems for CAD can be unrepresentative.We do not suggest SVM as the only suitable machine learning method for thisexperiment, but overall a SVM with the RBF kernel worked well here. It wouldbe interesting to see if other machine learning methods could oﬀer similar or evenbetter selections. Further improvements may also come from more work on thefeature selection. The features used here were all derived from the polynomialsinvolved in the input. One possible extension would be to consider also the typeof relations present and how they are connected logically (likely to be particularlybeneﬁcial if problems with more variables or more varied quantiﬁers are allowed). key extension for future work will be the testing of other heuristics. Forexample the greedy sotd heuristic [23] which chooses an ordering one variable ata time based on the sotd of new projection polynomials or combined heuristics,(where we narrow the selection with one and then breaking the tie with another).We also note that there are other questions of CAD problem formulation besidesvariable ordering [8] for which machine learning might be of beneﬁt.Finally, we note that there are other CAD implementations. In addition to

Qepcad there is

ProjectionCAD [26],

RegularChains [17] and

SyNRAC [33] in

Maple , Mathematica [46] and

Redlog [24] in

Reduce . Each implementationhas its own intricacies and often diﬀerent underlying theory so it would be inter-esting to test if machine learning can assist with these as it does with

Qepcad . We have investigated the use of machine learning for making the choice of whichheuristic to use when selecting a variable ordering for CAD, and quantiﬁer elim-ination by CAD. The experimental results conﬁrmed our thesis, drawn frompersonal experience, that no one heuristic is superior for all problems and thecorrect choice will depend on the problem. Each of the three heuristics testedhad a substantial set of problems for which they were superior to the others andso the problem was a suitable application for machine learning.Using machine learning to select the best CAD heuristic yielded better re-sults than choosing one heuristic at random, or just using any of the individualheuristics in isolation, indicating there is a relation between the simple algebraicfeatures and the best heuristic choice. This could lead to the development of anew individual heuristic in the future.The experiments involved testing heuristics on 1721 CAD problems, certainlythe largest such experiment that the authors are aware of. For comparison, thebest known previous study on such heuristics [23] tested with six examples. Weobserved that Brown’s heuristic is the most competitive for our example set, andthis is despite it involving less computation than the others. This heuristic waspresented during an ISSAC tutorial in 2004 (see Brown [13]), but does not seemto be formally published. It certainly deserves to be better known.Finally, we note that CAD is certainly not unique amongst computer algebraalgorithms in requiring the user to make such a choice of problem formulation.More generally, computer algebra systems (CASs) often have a choice of possiblealgorithms to use when solving a problem. Since a single formulation or algorithmis rarely the best for the entire problem space, CASs usually use meta-algorithms to make such choices, where decisions are based on some numerical parameters[16]. These are often not as well documented as the base algorithms, and may berather primitive. To the best of our knowledge, the present paper appears to bethe ﬁrst applying machine learning to problem formulation for computer algebra.The positive results should encourage investigation of similar applications in theﬁeld of symbolic computation. cknowledgements

This work was supported by the EPSRC grant: EP/J003247/1 and the ChinaScholarship Council (CSC). The authors thank the anonymous referees for usefulcomments which improved the paper.

References

1. B. Akbarpour and L. Paulson. MetiTarski: An automatic theorem prover for real-valued special functions.

Journal of Automated Reasoning , 44(3):175–205, 2010.2. E. Alpaydin.

Introduction to machine learning . MIT Press, 2004.3. D. Arnon, G. Collins, and S. McCallum. Cylindrical algebraic decomposition I:The basic algorithm.

SIAM Journal of Computing , 13:865–877, 1984.4. P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. Assessing theaccuracy of prediction algorithms for classiﬁcation: an overview.

Bioinformatics ,16(5):412–424, 2000.5. S. Basu. Algorithms in real algebraic geometry: A survey. Available from: ∼ sbasu/raag_survey2011_final.pdf , 2011.6. J. Boyan, D. Freitag, and T. Joachims. A machine learning architecture for op-timizing web search engines. In AAAI Workshop on Internet Based InformationSystems , pages 1–8, 1996.7. R. Bradford, J. Davenport, M. England, S. McCallum, and D. Wilson. Cylindricalalgebraic decompositions for boolean combinations. In

Proc. ISSAC ’13 , pages125–132. ACM, 2013.8. R. Bradford, J. Davenport, M. England, and D. Wilson. Optimising problem for-mulations for cylindrical algebraic decomposition. In

Intelligent Computer Math-ematics (LNCS 7961), pages 19–34. Springer Berlin Heidelberg, 2013.9. J. P. Bridge. Machine learning and automated theorem proving. University ofCambridge Computer Laboratory Technical Report UCAM-CL-TR-792, 2010.Available from: .10. J. Bridge, S. Holden, and L. Paulson. Machine learning for ﬁrst-order theoremproving.

Journal of Automated Reasoning , pages 1–32, 2014.11. C. Brown. Improved projection for cylindrical algebraic decomposition.

Journalof Symbolic Computation , 32(5):447–465, 2001.12. C. Brown. QEPCAD B: A program for computing with semi-algebraic sets usingCADs.

ACM SIGSAM Bulletin , 37(4):97–108, 2003.13. C. Brown. Companion to the Tutorial: Cylindrical algebraic decomposition, pre-sented at ISSAC ’04. Available from: , 2004.14. C. Brown and J. Davenport. The complexity of quantiﬁer elimination and cylin-drical algebraic decomposition. In

Proc. ISSAC ’07 , pages 54–60. ACM, 2007.15. C. Brown, M. E. Kahoui, D. Novotni, and A. Weber. Algorithmic methods forinvestigating equilibria in epidemic modelling.

Journal of Symbolic Computation ,41:1157–1173, 2006.16. J. Carette. Understanding expression simpliﬁcation. In

Proc. ISSAC ’04 , pages72–79. ACM, 2004.17. C. Chen, M. M. Maza, B. Xia, and L. Yang. Computing cylindrical algebraicdecomposition via triangular decomposition. In

Proc. ISSAC ’09 , pages 95–102.ACM, 2009.8. G. Collins. Quantiﬁer elimination for real closed ﬁelds by cylindrical algebraicdecomposition. In

Proc. 2nd GI Conference on Automata Theory and FormalLanguages , pages 134–183. Springer-Verlag, 1975.19. G. Collins. Quantiﬁer elimination by cylindrical algebraic decomposition – 20 yearsof progress. In

Quantiﬁer Elimination and Cylindrical Algebraic Decomposition ,Texts & Monographs in Symbolic Computation, pages 8–23. Springer-Verlag, 1998.20. G. Collins and H. Hong. Partial cylindrical algebraic decomposition for quantiﬁerelimination.

Journal of Symbolic Computation , 12:299–328, 1991.21. N. Cristianini and J. Shawe-Taylor.

An introduction to support vector machinesand other kernel-based learning methods . Cambridge University Press, 2000.22. J. Davenport, R. Bradford, M. England, and D. Wilson. Program veriﬁcation in thepresence of complex numbers, functions with branch cuts etc. In

Proc. SYNASC’12 , pages 83–88. IEEE, 2012.23. A. Dolzmann, A. Seidl, and T. Sturm. Eﬃcient projection orders for CAD. In

Proc. ISSAC ’04 , pages 111–118. ACM, 2004.24. A. Dolzmann and T. Sturm. REDLOG: Computer algebra meets computer logic.

SIGSAM Bulletin , 31(2):2–9, 1997.25. A. Dolzmann, T. Sturm, and V. Weispfenning. Real quantiﬁer elimination inpractice. In

Algorithmic Algebra and Number Theory , pages 221–247. Springer,1998.26. M. England. An implementation of CAD in Maple utilising problem formula-tion, equational constraints and truth-table invariance. University of Bath De-partment of Computer Science Technical Report 2013-04, 2013. Available from: http://opus.bath.ac.uk/35636/ ,27. R. Forsyth and R. Rada.

Machine learning: Applications in expert systems andinformation retrieval . Halsted Press, 1986.28. I. Fotiou, P. Parrilo, and M. Morari. Nonlinear parametric optimization using cylin-drical algebraic decomposition. In

Decision and Control, 2005 European ControlConference. CDC-ECC ’05. , pages 3735–3740, 2005.29. H. Hong. An improvement of the projection operator in cylindrical algebraic de-composition. In

Proc. ISSAC ’90 , pages 261–264. ACM, 1990.30. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks areuniversal approximators.

Neural Networks , 2(5):359–366, 1989.31. Z. Huang and L. Paulson. An application of machine learning to rcf decisionprocedures. In

Proc. 20th Automated Reasoning Workshop , 2013.32. C. Hsu, C. Chang, and C. Lin. A practical guide to support vector classiﬁcation.200333. H. Iwane, H. Yanami, H. Anai, and K. Yokoyama. An eﬀective implementation ofa symbolic-numeric cylindrical algebraic decomposition for quantiﬁer elimination.In

Proc. SNC ’09 , pages 55–64, 2009.34. T. Joachims. Making large-scale SVM learning practical. In

Advances in KernelMethods - Support Vector Learning , pages 169–184. MIT Press, 1999.35. T. Joachims. A support vector method for multivariate performance measures. In

Proc. 22nd Intl. Conf. on Machine learning , pages 377–384. ACM, 2005.36. D. Jovanovic and L. de Moura. Solving non-linear arithmetic. In

AutomatedReasoning: 6th International Joint Conference (IJCAR) (LNCS 7364), pages 339–354. Springer, 2012.37. S. McCallum. An improved projection operation for cylindrical algebraic decompo-sition. In

Quantiﬁer Elimination and Cylindrical Algebraic Decomposition , Texts& Monographs in Symbolic Computation, pages 242–268. Springer-Verlag, 1998.8. S. McCallum. On projection in CAD-based quantiﬁer elimination with equationalconstraint. In

Proc. ISSAC ’99 , pages 145–149. ACM, 1999.39. W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervousactivity.

The Bulletin of Mathematical Biophysics , 5(4):115–133, 1943.40. F. Rosenblatt. The perceptron: a probabilistic model for information storage andorganization in the brain.

Psychological Review , 65(6):386, 1958.41. B. Schölkopf, K. Tsuda, and J.-P. Vert.

Kernel methods in computational biology .MIT Press, 2004.42. F. Sebastiani. Machine learning in automated text categorization.

ACM ComputingSurveys (CSUR) , 34(1):1–47, 2002.43. J. Shawe-Taylor and N. Cristianini.

Kernel methods for pattern analysis . Cam-bridge University Press, 2004.44. P. Stone and M. Veloso. Multiagent systems: A survey from a machine learningperspective.

Autonomous Robots , 8(3):345–383, 2000.45. A. Strzeboński. Cylindrical algebraic decomposition using validated numerics.

Journal of Symbolic Computation , 41(9):1021–1038, 2006.46. A. Strzeboński. Solving polynomial systems over semialgebraic sets represented bycylindrical algebraic formulas. In

Proc. ISSAC ’12 , pages 335–342. ACM, 2012.47. A. Tarski. A decision method for elementary algebra and geometry. In

QuantiﬁerElimination and Cylindrical Algebraic Decomposition , Texts and Monographs inSymbolic Computation, pages 24–84. Springer-Verlag, 1998.48. D. Wilson, R. Bradford, and J. Davenport. A repository for CAD examples.

ACMCommunications in Computer Algebra , 46(3):67–69, 2012.49. D. Wilson, J. Davenport, M. England, and R. Bradford. A “piano movers” problemreformulated. In

Proc. SYNASC ’13 . IEEE, 2013.50. The benchmarks used in solving nonlinear arithmetic. New York University, 2012.Available from: http://cs.nyu.edu/ ∼∼