[PDF] Improved cross-validation for classifiers that make algorithmic choices to minimise runtime without compromising output correctness

Abstract

Our topic is the use of machine learning to improve software by making choices which do not compromise the correctness of the output, but do affect the time taken to produce such output. We are particularly concerned with computer algebra systems (CASs), and in particular, our experiments are for selecting the variable ordering to use when performing a cylindrical algebraic decomposition of n -dimensional real space with respect to the signs of a set of polynomials. In our prior work we explored the different ML models that could be used, and how to identify suitable features of the input polynomials. In the present paper we both repeat our prior experiments on problems which have more variables (and thus exponentially more possible orderings), and examine the metric which our ML classifiers targets. The natural metric is computational runtime, with classifiers trained to pick the ordering which minimises this. However, this leads to the situation were models do not distinguish between any of the non-optimal orderings, whose runtimes may still vary dramatically. In this paper we investigate a modification to the cross-validation algorithms of the classifiers so that they do distinguish these cases, leading to improved results.

Full PDF

aa r X i v : . [ c s . S C ] N ov Improved cross-validation for classiﬁers thatmake algorithmic choices to minimise runtimewithout compromising output correctness

Dorian Florescu and Matthew England

Faculty of Engineering, Environment and Computing,Coventry University, Coventry, CV1 5FB, UK { Dorian.Florescu, Matthew.England } @coventry.ac.uk Abstract.

Our topic is the use of machine learning to improve soft-ware by making choices which do not compromise the correctness of theoutput, but do aﬀect the time taken to produce such output. We areparticularly concerned with computer algebra systems (CASs), and inparticular, our experiments are for selecting the variable ordering to usewhen performing a cylindrical algebraic decomposition of n -dimensionalreal space with respect to the signs of a set of polynomials.In our prior work we explored the diﬀerent ML models that could beused, and how to identify suitable features of the input polynomials. Inthe present paper we both repeat our prior experiments on problemswhich have more variables (and thus exponentially more possible or-derings), and examine the metric which our ML classiﬁers targets. Thenatural metric is computational runtime, with classiﬁers trained to pickthe ordering which minimises this. However, this leads to the situationwere models do not distinguish between any of the non-optimal orderings,whose runtimes may still vary dramatically. In this paper we investigatea modiﬁcation to the cross-validation algorithms of the classiﬁers so thatthey do distinguish these cases, leading to improved results. Keywords: machine learning; cross-validation; computer algebra; sym-bolic computation; cylindrical algebraic decomposition

Machine Learning (ML), that is statistical techniques to give computer systemsthe ability to learn rules from data, is a topic that has found great success in adiverse range of ﬁelds over recent years. ML is most attractive when the under-lying functional relationship to be modelled is complex or not well understood.With regards to the creation of software itself, while ML has a history of use fortesting and security analysis [26] it is less often used in the actual algorithms.On the surface, this would be especially true for software that prizes mathemat-ical correctness, such as computer algebra systems (CASs). Here, a thoroughunderstanding of the underlying relationships would seem to be a pre-requisite.

D. Florescu and M. England

However, CAS developers would acknowledge that their software actuallycomes with a range of options that, while having no eﬀect on the correctness ofthe end result, can have a great eﬀect on how long it takes to produce the resultand exactly what form that result takes. These choices range from the low level(in what order to perform a search that may terminate early) to the high (whichof a set of competing exact algorithms to use for this problem instance).A well-known example is the choice of monomial ordering for a Gr¨obner Basis.This choice is actually quite abnormal in that there has been much study devotedto it and there are some clear pieces of advice to follow (e.g. that degrevlex ordering is the easiest to compute, and that if a lex ordering is needed it wouldbe best to ﬁrst compute a degrevlex basis and then convert). A better exampleof the choices we consider would be the underlying variable order that is requiredto deﬁne any monomial ordering, for which there exists no such clear advice.In practice these less understood choices are usually either left entirely tothe user, taken by human-made heuristics based on some experimentation (e.g.[19]), or made according to magic constants where crossing a single thresholdchanges system behaviour [11]. Our main thesis is that many of these decisionscould be improved by allowing ML algorithms to analyse the data.

Our experiments concern variable orderings for another prominent symbolic com-putation algorithm: Cylindrical Algebraic Decomposition (CAD). CAD is an ex-pensive procedure, with the choice of ordering aﬀecting not only computationtime but often the tractability of even considering a problem. We introduce thenecessary background on CAD and its orderings in Section 2. We describe ourprior work using ML to make this choice [29], [28], [23], [25] in Section 3 whichincludes experimenting with a range of ML models, and developing techniques togenerate suitable features from the input data. This prior work was all conductedon a large dataset of 3-variable problems (a choice from 6 orderings).The new contributions of the present paper are two-fold. First, we have ap-plied our prior methodology to a dataset of 4-variable problems (choice from 24orderings) and we report on how it handled this increased complexity. Secondly,we examine and improve the training goal of our ML classiﬁers. The naturalmetric for this problem is runtime, and our old classiﬁers are trained to pickthe ordering which minimises this for a given CAD input. However, this meantour training did not distinguish between any of the non-optimal orderings eventhough the diﬀerence between these could be huge. In Section 4 we report on anew cross-validation approach for our classiﬁers which aims to make them awareof these diﬀerent shades of wrong and thus make choices which reduce the overallruntime even if the number of problems where the classiﬁers pick the absolutebest runtime is unchanged.In Section 5 and 6 we describe the methodology and results respectivelyfor our new experiments on choosing the variable ordering for 4-variable CADproblems, both with and without the new cross-validation approach. We alsocompare against the best known human-made heuristics. mproved cross-validation for classiﬁers that make algorithmic choices 3 A Cylindrical Algebraic Decomposition (CAD) is a decomposition of ordered R n space into cells arranged cylindrically : the projections of any pair of cells withrespect to the variable ordering are either equal or disjoint. I.e. the projectionsall lie within cylinders over the cells of an induced CAD of the lower dimensionalspace. All these cells are (semi)-algebraic meaning each can be described with aﬁnite sequence of polynomial constraints.A CAD is usually produced to be truth-invariant for a logical formula, mean-ing the formula is either true or false on each cell. Such a decomposition can thenbe used to analyse the formula, and for example, perform Quantiﬁer Elimination(QE) over the reals. I.e. given a quantiﬁed Tarski formula in prenex normal formwe can ﬁnd an equivalent quantiﬁer free formula over the reals by building a CADfor the quantiﬁer-free part of the formula, querying a ﬁnite number of samplepoints (one from each cell), and then using the corresponding cell descriptions.For example, QE could transform ∃ x, ax + bx + c = 0 ∧ a = 0 to the equivalentunquantiﬁed statement b − ac ≥ x, a, b, c ). In practice,the quantiﬁer free equivalent would come as the conjunction of several parts (onefrom each cell) which logically simplify to the stated result.CAD was introduced by Collins in 1975 [16] and works relative to a set ofpolynomials. Collins’ CAD produces a decomposition so that each polynomialhas constant sign on each cell (thus truth-invariant for any formula built withthose polynomials). The algorithm ﬁrst projects the polynomials into smallerand smaller dimensions; and then uses these to lift − to incrementally builddecompositions of larger and larger spaces according to the polynomials at thatlevel. There have been a great many developments in the theory and imple-mentation of CAD since Collins’ original work which we do not describe here.The collection [12] summarises the work up to the mid-90s while the second au-thor’s journal articles [5] [21] attempt summaries of CAD progress since in theirintroduction and background sections. CAD is the backbone of all QE imple-mentations as it is the only implemented complete procedure for the problem.QE has numerous applications throughout science and engineering [38] whichwould in turn beneﬁt from faster CAD. Our work also speeds up independentapplications of CAD, such as reasoning with multi-valued functions [18], motionplanning [40], and identifying multistationarity in biological networks [3], [4]. The deﬁnition of cylindricity and both stages of the CAD algorithm are relativeto an ordering of the variables. For example, given polynomials in variables or-dered as x n ≻ x n − ≻ . . . , ≻ x ≻ x we ﬁrst project away x n and so on until weare left with polynomials univariate in x . We then start lifting by decomposing Recently even economics too [35], [36]. D. Florescu and M. England the x − axis, and then the ( x , x ) − plane and so so on. The cylindricity condi-tion refers to projections of cells in R n onto a space ( x , . . . , x m ) where m < n .As noted above there have been numerous advances to CAD since its inceptionbut the need for a ﬁxed variable ordering remains.Depending on the application, the variable ordering may be determined, con-strained, or free. QE, requires that quantiﬁed variables are eliminated ﬁrst andthat variables are eliminated in the order in which they are quantiﬁed. However,variables in blocks of the same quantiﬁer (and the free variables) can be swapped,so there is partial freedom. In the example discussed in Section 2.1 we may useany variable ordering that projects the quantiﬁed variable x ﬁrst to perform theQE and discover the discriminant. A CAD for the quadratic polynomial underordering a ≺ b ≺ c has only 27 cells, but we need 115 for the reverse ordering.This choice of variable ordering can have a great eﬀect on the time andmemory use of CAD, and the number of cells in the output (how course or ﬁne thedecomposition is). In fact, Brown and Davenport presented a class of problemsin which one variable ordering gave output of double exponential complexity inthe number of variables and another output of a constant size [10].Heuristics have been developed to choose a variable ordering, with Dolzmann et al. [19] giving the best known study. After analysing a variety of metrics theyproposed a heuristic, sotd , which constructs the full set of projection polyno-mials for each permitted ordering and selects the ordering whose correspondingset has the lowest s um o f t otal d egrees for each of the monomials in each of thepolynomials. The second author demonstrated examples for which that heuristiccould be misled in [6]; and then later showed that tailoring to an implementationcould improve performance [22]. These heuristics all involved potentially costlyprojection operations on the input polynomials.Another human-made heuristic was proposed by Brown in his ISSAC 2004tutorial notes [9]. This chooses a variable ordering according to the followingcriteria, starting with the ﬁrst and breaking ties with successive ones.1) Eliminate a variable ﬁrst if it appears with the lowest overall individual degreein the input.2) For each variable calculate the maximum total degree (i.e. sum of the indi-vidual degrees) for the set of terms in the input in which it occurs. Eliminateﬁrst the variable for which this is lowest.3) Eliminate a variable ﬁrst if there is a smaller number of terms in the inputwhich contain the variable.The Brown heuristic is far cheaper than the sotd heuristic (because the latterperforms projections before measuring degrees). Surprisingly, our experimentson CAD problems in 3-variables all suggest that the Brown heuristic makesbetter choices than sotd (even before one considers the time taken to run theheuristic itself). This counter-intuitive ﬁnding does not generalise into our 4-variable problem set, as discussed later. mproved cross-validation for classiﬁers that make algorithmic choices 5 The ﬁrst application of ML for choosing a CAD variable ordering was [29] whichused a support vector machine to select which of three human-made heuristics tofollow. The SVM considered 11 simple algebraic features of the input polynomials(mostly diﬀerent measures of degree and variable occurrence). The experimentswere on 3-variable CAD problems and although the Brown heuristic was found tomake the best choices on average, the experiments identiﬁed substantial subsetsof examples for which each of the three heuristics outperformed the others. Thekey conclusion was that the machine learned choice did signiﬁcantly better thanany one heuristic overall.

The present authors revisited these experiments earlier this year in [23]. We usedthe same dataset but this time ML was used to predict directly the variableordering for CAD, rather than choosing a heuristic. The motivation for pickinga heuristic in [29] was that if the methodology were applied to problems withmore variables it would still mean making a choice from 3 possibilities ratherthan an exponentially growing number. However, upon investigation there weremany problems where none of the human-made heuristics made good choicesand so savings could be made by considering all possible orderings .In [23] we also considered a more diverse selection of ML methods than[29]. We experimented with four common ML classiﬁers: K − Nearest Neighbours(KNN); Multi-Layer Perceptron (MLP); Decision Tree (DT); and Support VectorMachine (SVM) with RBF kernel, all using the same set of 11 features from [29].The results showed that all three of the new models performed better sub-stantially better than the SVM (the only classiﬁer to be tried before); and thatall four classiﬁers outperformed the human-made heuristics.

We next considered how to extract further information from the input data. The11 features used in [29], [23] were inspired by Brown’s heuristic [9] (e.g. measuresof variable degree and frequency of occurrence). In particular, they can all becheaply extracted from polynomials.In [25] a new feature generation procedure was presented, based on the obser-vation that the original features can be formalised mathematically using a small Of course, this methodology will have to be changed to deal with higher numbersof variables but since CAD is rarely tractable with more than 5 variables this isnot a particularly pressing concern. We note that there are several meta-algorithmsthat may be applicable to sample the possible ordering without evaluating them all.For example, a Monte Carlo tree searched was used in [33] to sample the possiblemultivariate Horner schemes and pick an optimal one in the CAS FORM. D. Florescu and M. England number of basic functions (average, sign, maximum) evaluated on the degrees ofthe variables in either one polynomial or the whole system. Considering all pos-sible combinations of these functions led to 78 useful and independent featuresfor our 3-variable dataset. The experiments were repeated with these, with theresults showing that all four ML classiﬁers improved their predictions.Using these new features the choices of the best performing classiﬁer allowedCAD to solve all problems in the testing set with a runtime only 6% more thanthe best possible (i.e. the time taken if the optimal ordering were used for everyproblem). Using only the original features, the choices of the best ML classiﬁerled to 14% more than the minimum runtime. Following the choices of Brown’sheuristic led to runtimes 27% more than the minimum.

The work described above is the only published work on ML for choosing a CADvariable ordering. There are only a handful of other examples of ML within CASs:[27], [28] on the question of whether to precondition CAD with Gr¨obner Bases;[31] on deciding the order of sub-formulae solving for a QE procedure; and [33] onchoosing a multivariate Horner scheme. Other areas of mathematical softwarehave made more use of ML. For example, in the mathematical logic commu-nity the ML-selected portfolio SAT solver

SATZilla [41] is well-known, whilemore recently

MapleSAT views solver branching as an optimisation problemto be tackled with ML [34]. There are also several examples of ML within theautomated reasoning community (see e.g. [39], [32], [8]). A survey on ML formathematical software was presented at ICMS 2018 [20].

In all of the authors previous ML experiments for CAD [29], [23], [25], the modelswere optimised simply to predict which of the possible variable orderings leadsto the smallest computing time for CAD. This is not an ideal approach: – First, runtimes for CAD, like all software, will contain a degree of noise fromvarious hardware and software factors. While it is common for a given CADproblem to have a wide range of possible runtimes depending on the ordering,that does not mean that all orderings give runtimes distinct from the others.The runtimes commonly appear in clusters. Thus it is often the case that thesmallest runtime be only slightly lower than the second smallest, and thatdiﬀerence could well be down to noise. Thus when training to target onlythe very quickest runtime we risk exaggerating the eﬀects of such noise. – Second, during training, when a model makes an incorrect prediction thiscould mean selecting an ordering that produces a runtime very close to theoptimal or another that is signiﬁcantly larger. The training would not dis-tinguish between these cases − there is no distinction between picking an mproved cross-validation for classiﬁers that make algorithmic choices 7 “almost good” ordering and a “very bad” ordering. However, from the pointof view of a user judging these selections there is a big diﬀerence!One of the traditional metrics used to evaluate an ML classiﬁer is accuracy ,deﬁned as the number of test examples for which the classiﬁer makes the correctchoice. In our context, correct means picking the optimal variable ordering fromthe n ! possibilities. We recognised that for our application this deﬁnition ofaccuracy is not suﬃcient to judge the classiﬁers and so in our prior work wealso presented the total CAD runtime for the testing set when using the variableorderings of a classiﬁer (which we referenced in the summary above).The anonymous referees of our earlier papers commented that perhaps ac-curacy could be redeﬁned into something more appropriate for our application.For example, judge a classiﬁer as being correct for a problem instance if it picksan ordering which produces a runtime within x % of the minimum runtime thatcan be achieved for that instance . This led us to consider whether the trainingalgorithms could be adapted to take account of this more nuanced deﬁnition ofaccuracy. We decided to introduce this in the stage of the methodology wherecross-validation is used for hyperparameter selection: a single technique that isused for all of the diﬀerent ML classiﬁers we work with. We describe ﬁrst the typical procedure of cross-validation used when preparinga ML classiﬁer which sets the parameters and hyperparameters of a model.The parameters are variables that can be ﬁne-tuned so that the predictionerror reaches a local or global minimum in the parameter space. For example,the weights in an artiﬁcial neural network or the support vectors in an SVM.The hyperparameters are model conﬁgurations selected before training. They areoften speciﬁed by the practitioner based on experience or a heuristic, e.g. thenumber of layers in a neural network or the value of k in a k -nearest neighbourmodel. The connection between the hyperparameters and the model predictionis more complex, and thus, typically, these are tuned using grid search in thehyperparameter space to minimise the prediction error.To prevent the situation where the model returns poor results on new datasetsnot used in training, also known as overﬁtting, the hyperparameters and param-eters are tuned on diﬀerent datasets. The typical approach is cross-validation.In G -fold cross-validation (see for example the introduction of [1]), the datais split into G groups of equal size M : D = n f ( k ) , . . . , f ( k M ) o ... (1) D G = n f ( k G ) , . . . , f ( k MG ) o , In Section 5 we use x = 20 but we are still debating the most appropriate value. D. Florescu and M. England where each group entry is a vector of features for a problem instance: f ( k mg ) = h f ( k mg )1 , . . . , f ( k mg ) n f i , g = 1 , . . . , G, m = 1 , . . . , M. Each entry in such a vector is a scalar number and n f denotes the number offeatures we derive for each instance. See [25] for details of the features we useand how they are generated from the polynomials.Let c ( k mg ) denote the target class corresponding to data point f ( k mg ) . An MLclassiﬁer with parameters θ is modelled as a function M θ : R n f → { , , . . . , n c } ,where n c denotes the number of classes. In our context the number of classes isthe number of CAD variable orderings acceptable for the underlying application.Typically, the classiﬁer also depends on a number of hyperparameters thatcan each take a ﬁnite number of values. Here, we will denote by H the number ofall possible hyperparameter combinations, such that M θ h , h = 1 , . . . , H, denotesthe classiﬁer with parameters θ and hyperparameters deﬁned by index h . Thetypical cross-validation procedure trains the parameters θ of classiﬁers (cid:8) M θ h (cid:9) Hh =1 on each combination of G − G · H models.Let ˆ c ( k mg ) h denote M θ g h (cid:16) f ( k mg ) (cid:17) , the class prediction of a classiﬁer whose pa-rameters were trained on the dataset D ∪ · · · ∪ D g − ∪ D g +1 ∪ · · · ∪ D G . Thenthe optimal h opt is computed by maximising the following quantity: h opt = argmax h G G X g =1 score gh ! , (2)where score gh = score (cid:16) ˆ c ( k mg ) h , c ( k mg ) (cid:17) , and score( · , · ) denotes the F1-score of group G for the model prediction [15]. In other words, the typical cross-validationprocedure identiﬁes the hyperparameters that maximise the performance of themodel at predicting the very best ordering. I.e. it does not take into account theactual computing time of the prediction − just whether it was the quickest. Our change to the cross-validation procedure is to instead calculate h opt as h opt = argmax h G G X g =1 − ctime gh ! , (3)where ctime gh = M P m ctime (cid:16) k mg , ˆ c ( k mg ) (cid:17) , and ctime (cid:16) k mg , ˆ c ( k mg ) (cid:17) denotes therecorded time for computing CAD on data point f ( k mg ) using the variable or-dering given by class prediction ˆ c ( k mg ) . By evaluating the computing time for alldata points, this cross-validation method penalises the variable orderings leadingto very large computing times, but does not penalise the ones close to the opti-mum. Thus we do not expect the change to aﬀect how often a classiﬁer choosesthe optimal ordering, but it should improve the choices made in cases where theoptimum is missed. mproved cross-validation for classiﬁers that make algorithmic choices 9 We describe a ML experiment to choose the variable ordering for CAD. Themethodology used is similar to that of our recent paper [25] except that (a) weuse a dataset of 4-variable problems instead of 3-variable ones; and (b) we ranthe classiﬁers with both the original and the adapted cross-validation procedure.

We are working with the nlsat dataset produced to evaluate the work in [30],thus the problems are all fully existentially quantiﬁed. Although there are CADalgorithms that reduce what is being computed based on the quantiﬁcation struc-ture (e.g. Partial CAD [17]), the conclusions we draw are likely to generalise.We selected the 2080 problems with 4 variables, meaning each has a choice of24 diﬀerent variable orderings. We extracted only the polynomials involved, andrandomly divided into two datasets for training (1546) and testing (534). Onlythe former is used to tune the ML model parameters and hyperparameters. We work with the CAD routine

CylindricalAlgebraicDecompose : part of the

RegularChains

Library for

Maple . It builds decompositions ﬁrst of C n beforereﬁning to a CAD of R n [14], [13], [2]. We ran the code in Maple 2018 but used anupdated version of the RegularChains sympy package v1.3for Python 2.7. The sotd heuristic was implemented in

Maple as part of the

ProjectionCAD package [24]. Training and evaluation of the ML models wasdone using the scikit-learn package [37] v0.20.2 for Python 2.7. In order toimplement our adapted cross-validation procedure we had to rewrite a numberof the standard commands within the package to both use the redeﬁned h opt in(3), and to access the data it requires during the cross-validation. Each individual CAD was constructed by a Maple script called separately fromPython (to avoid any Maple caching of results). The target variable orderingfor ML was deﬁned as the one that minimises the computing time for a givenproblem. All CAD function calls included a time limit. For the training datasetan initial time limit of 16 s was used, which was doubled if all orderings timed out(a target variable ordering could be assigned for all problems using time limitsno bigger than 32 s). The problems in the testing dataset were all processed witha single larger time limit of 64 s for all orderings, with any problems that timedout having their runtime recorded as 64s. Freely available from http://cs.nyu.edu/ ∼ dejan/nonlinear/0 D. Florescu and M. England We computed algorithmically the set of features for 4 variables { f ( i ) } n f i =1 where n f = 1440, using the procedure introduced in [25].Given the set of problems { P r , . . . , P r N } , N = 1546 , some of the features f ( i ) turn out to be constant, i.e. f ( i ) ( P r ) = f ( i ) ( P r ) = · · · = f ( i ) ( P r N ) . Suchfeatures will have no beneﬁt for ML and are removed. Further, other featuresmay be repetitive, i.e. f ( i ) ( P r n ) = f ( j ) ( P r n ) , ∀ n = 1 , . . . , N, and are mergedinto one single feature. After this step, we are left with 105 features. We choose commonly used deterministic ML models for this experiment (fordetails on the methods see e.g. the textbook [1]). – The K − Nearest Neighbours (KNN) classiﬁer [1, § – The Decision Tree (DT) classiﬁer [1, § – The Multi-Layer Perceptron (MLP) classiﬁer [1, § – The Support Vector Machine (SVM) classiﬁer with Radial Basis Function(RBF) kernel [1, § The ML models will be compared on two metrics:

Accuracy , deﬁned as thepercentage of problems where a model’s predicted variable ordering led to acomputing time closer than 20% of the time it took the optimal ordering; and

Time deﬁned as the total time taken to evaluate all problems in the test setusing that model’s predictions for variable ordering. We note that Accuracy isdeﬁned diﬀerently in our prior work [23], [25] where we measured only how oftena heuristic picked the very best ordering.We will also test the two best-known human constructed heuristics [9], [19]described in Section 2.2. Unlike the ML models, these can end up predictingseveral variable orderings (when they cannot discriminate). In practice if thiswere to happen the heuristic would select one randomly (or perhaps lexico-graphically), however that ﬁnal pick is not meaningful. To accommodate this weevaluate these heuristics as follows: mproved cross-validation for classiﬁers that make algorithmic choices 11

Table 1.

The ML hyperparameters optimised on the training dataset using the stan-dard cross-validation (CV) routine and the new CV routine.

Model Hyperparameter Value (standard CV)

Value (new CV)Decision Tree Criterion Entropy Gini impurityMaximum tree depth 6 14K-Nearest Train instances weighting Inversely proportional Inversely proportionalNeighbours to distance to distanceAlgorithm Ball Tree Ball TreeNumber of neighbours 13 14SVM Regularization para. C .

41 1 . γ . . α · − · − – For each problem, the prediction accuracy of such a heuristic is judged tobe the the percentage of its predicted variable orderings that are also targetorderings (i.e. within 20% of the minimum). The average of this percentageover all problems in the testing dataset represents the prediction accuracy. – Similarly, the computing time for such methods is assessed as the averagecomputing time over all predicted orderings, and it is this that is summedup for all problems in the testing dataset.

The results are presented in Table 2. Each ML model appears twice in the toptable via its acronym with each of the following appended: − O: for one trained with the original (and typical) ML cross-validation methodbased on (2) as was used in our prior work [23], [25]. − N: for one trained by the new cross-validation approach described in Section4.3 which is based on computing time as in (3).The bottom table details the two human-constructed heuristics along withthe outcome of a random choice between the 24 orderings. We might expect arandom choice to be correct once in 24 times of the time but it is higher as forsome problems there were multiple variable orderings with equally fast timings.The minimum total computing time, achieved if we select an optimal orderingfor every problem, is 2 , Virtual BestHeuristic . Choosing at random would take 8 , VirtualWorst Heuristic ), is 22 , , heuristics. The prediction time for the heuristics was 286 s for sotd and 23 s for Brown . In contrast, the total time taken by the ML to make predictions was lessthan one second for all models.

For each ML model the performance when trained with the new cross-validationwas better (measured using either of our metrics) than when trained with theoriginal procedure. The scale of the improvement varied: the timings of thedecision tree reduced by 9.8% but those of the KNN classiﬁer only by 1.6%.Thus we can conclude the new methodology to be beneﬁcial. However, wenote that it is still the case that our two metrics do not agree on the best model:DT-N achieved the lowest times but KNN-N the highest accuracy. The latter isbetter at picking a good (within 20% of the minimal) ordering but when it failsto do so it makes mistakes of greater magnitude. So there is scope for furtherwork to make our ML models take into account the full range of possibilities. Itmay be that this requires a tailored approach to the training of parameters ineach diﬀerent classiﬁer.

Of the two human-made heuristics,

Brown performed far worse than sotd . Thisis the opposite of the ﬁndings in [29], [23], [25] for 3-variable problems. This isnot necessarily in conﬂict: the added information taken by sotd will grow insize exponentially with the variables, and thus we would expect the predictiveinformation it carries to be more valuable. However, the cost of sotd will alsobe increasing rapidly denting this value. The time taken by sotd to make all thepredictions is 286 s , while the time for Brown is less than 10% of that at 23 s .For this dataset at least, it is well worth paying the price of sotd as the savingsover Brown’s heuristic are far more substantial. Table 2.

Performance on the testing dataset of the ML classiﬁers (using both thestandard and new cross-validation routines), the human-made heuristics, and a randomchoice. The virtual best and worst solvers show the range of possibilities.DT-O DT-N KNN-O KNN-N MLP-O MLP-N SVM-O SVM-N

Accuracy .

7% 54 .

3% 53 .

9% 54 .

5% 53 .

6% 56 .

9% 53 .

9% 54 . Time (s) ,

022 3 ,

627 3 ,

808 3 ,

748 3 ,

972 3 ,

784 3 ,

795 3 , Brown sotd

Accuracy .

0% 20 .

1% 47 . Time (s) ,

177 22 ,

735 8 ,

291 8 ,

292 4 , All heuristics (ML and human-made) are further away from the optimum onthis 4-variable dataset than they were on the three variable one, to be expectedgiven we are choosing from 24 rather than 6 orderings. Our best performingmodel achieves timings 67% greater than the minimum (it was 6% for 3-variableproblems). However, the best human-made heuristic had timings 98% greater.In fact, every ML model outperformed both the human constructed heuris-tics in regards to both metrics, and when using either the original or the newcross-validation approach. So we can easily conclude that our ML methodologygeneralises to 4-variable problems. However, it also clear that there is much morescope for future improvement.

We have demonstrated that our methodology of ML for choosing a CAD variableordering may be applied to 4-variable problems where it continues its dominanceover human-made heuristics. We have also presented an addition to the MLtraining methodology to better reﬂect our application domain and demonstratedthe beneﬁt of this experimentally. This new methodology could be applied to anyML application which seeks to make a choice to minimise computational runtime.

Acknowledgements

This work is funded by EPSRC Project EP/R019622/1:

Embedding Machine Learning within Quantiﬁer Elimination Procedures . References

1. Bishop, C.: Pattern Recognition and Machine Learning. Springer (2006)2. Bradford, R., Chen, C., Davenport, J., England, M., Moreno Maza, M., Wilson,D.: Truth table invariant cylindrical algebraic decomposition by regular chains. In:Gerdt, V., Koepf, W., Seiler, W., Vorozhtsov, E. (eds.) Computer Algebra in Scien-tiﬁc Computing, Lecture Notes in Computer Science, vol. 8660, pp. 44–58. SpringerInternational Publishing (2014), http://dx.doi.org/10.1007/978-3-319-10515-4 43. Bradford, R., Davenport, J., England, M., Errami, H., Gerdt, V., Grigoriev, D.,Hoyt, C., Koˇsta, M., Radulescu, O., Sturm, T., Weber, A.: A case study on theparametric occurrence of multiple steady states. In: Proceedings of the 2017 ACMInternational Symposium on Symbolic and Algebraic Computation. pp. 45–52.ISSAC ’17, ACM (2017), https://doi.org/10.1145/3087604.30876224. Bradford, R., Davenport, J., England, M., Errami, H., Gerdt, V., Grig-oriev, D., Hoyt, C., Koˇsta, M., Radulescu, O., Sturm, T., Weber, A.:Identifying the parametric occurrence of multiple steady states for somebiological networks. Journal of Symbolic Computation , 84–119 (2020),https://doi.org/10.1016/j.jsc.2019.07.0085. Bradford, R., Davenport, J., England, M., McCallum, S., Wilson, D.: Truth tableinvariant cylindrical algebraic decomposition. Journal of Symbolic Computation , 1–35 (2016), http://dx.doi.org/10.1016/j.jsc.2015.11.0024 D. Florescu and M. England6. Bradford, R., Davenport, J., England, M., Wilson, D.: Optimising problem formu-lations for cylindrical algebraic decomposition. In: Carette, J., Aspinall, D., Lange,C., Sojka, P., Windsteiger, W. (eds.) Intelligent Computer Mathematics, LectureNotes in Computer Science, vol. 7961, pp. 19–34. Springer Berlin Heidelberg (2013),http://dx.doi.org/10.1007/978-3-642-39320-4 27. Bridge, J.: Machine learning and automated theorem proving. Tech. Rep. UCAM-CL-TR-792, University of Cambridge, Computer Laboratory (2010)8. Bridge, J., Holden, S., Paulson, L.: Machine learning for ﬁrst-ordertheorem proving. Journal of Automated Reasoning , 299–328 (1991),https://doi.org/10.1016/S0747-7171(08)80152-618. Davenport, J., Bradford, R., England, M., Wilson, D.: Program veriﬁcation in thepresence of complex numbers, functions with branch cuts etc. In: 14th InternationalSymposium on Symbolic and Numeric Algorithms for Scientiﬁc Computing. pp.83–88. SYNASC ’12, IEEE (2012), http://dx.doi.org/10.1109/SYNASC.2012.6819. Dolzmann, A., Seidl, A., Sturm, T.: Eﬃcient projection orders forCAD. In: Proceedings of the 2004 International Symposium on Sym-bolic and Algebraic Computation. pp. 111–118. ISSAC ’04, ACM (2004),https://doi.org/10.1145/1005285.1005303mproved cross-validation for classiﬁers that make algorithmic choices 1520. England, M.: Machine learning for mathematical software. In: Davenport, J.,Kauers, M., Labahn, G., Urban, J. (eds.) Mathematical Software – Proc. ICMS2018. Lecture Notes in Computer Science, vol. 10931, pp. 165–174. Springer Inter-national Publishing (2018), https://doi.org/10.1007/978-3-319-96418-8 2021. England, M., Bradford, R., Davenport, J.: Cylindrical algebraic decomposition withequational constraints. Journal of Symbolic Computation Accepted (In Press) (2019), https://doi.org/10.1016/j.jsc.2019.07.01922. England, M., Bradford, R., Davenport, J., Wilson, D.: Choosing a variable order-ing for truth-table invariant cylindrical algebraic decomposition by incrementaltriangular decomposition. In: Hong, H., Yap, C. (eds.) Mathematical Software –ICMS 2014. Lecture Notes in Computer Science, vol. 8592, pp. 450–457. SpringerHeidelberg (2014), http://dx.doi.org/10.1007/978-3-662-44199-2 6823. England, M., Florescu, D.: Comparing machine learning models to choose thevariable ordering for cylindrical algebraic decomposition. In: Kaliszyk, C., Brady,E., Kohlhase, A., Sacerdoti, C. (eds.) Intelligent Computer Mathematics. LectureNotes in Computer Science, vol. 11617, pp. 93–108. Springer International Pub-lishing (2019), https://doi.org/10.1007/978-3-030-23250-4 724. England, M., Wilson, D., Bradford, R., Davenport, J.: Using the Regular ChainsLibrary to build cylindrical algebraic decompositions by projecting and lifting.In: Hong, H., Yap, C. (eds.) Mathematical Software – ICMS 2014. LectureNotes in Computer Science, vol. 8592, pp. 458–465. Springer Heidelberg (2014),http://dx.doi.org/10.1007/978-3-662-44199-2 6925. Florescu, D., England, M.: Algorithmically generating new algebraic featuresof polynomial systems for machine learning. In: Abbott, J., Griggio, A. (eds.)Proceedings of the 4th Workshop on Satisﬁability Checking and SymbolicComputation ( SC (4) (2017), https://doi.org/10.1145/309256627. Huang, Z., England, M., Davenport, J., Paulson, L.: Using machine learningto decide when to precondition cylindrical algebraic decomposition with Groeb-ner bases. In: 18th International Symposium on Symbolic and Numeric Al-gorithms for Scientiﬁc Computing (SYNASC ’16). pp. 45–52. IEEE (2016),https://doi.org/10.1109/SYNASC.2016.02028. Huang, Z., England, M., Wilson, D., Bridge, J., Davenport, J., Paul-son, L.: Using machine learning to improve cylindrical algebraic de-composition. Mathematics in Computer Science (4), 461–488 (2019),https://doi.org/10.1007/s11786-019-00394-829. Huang, Z., England, M., Wilson, D., Davenport, J., Paulson, L., Bridge, J.: Ap-plying machine learning to the problem of choosing a heuristic to select the vari-able ordering for cylindrical algebraic decomposition. In: Watt, S., Davenport, J.,Sexton, A., Sojka, P., Urban, J. (eds.) Intelligent Computer Mathematics, Lec-ture Notes in Artiﬁcial Intelligence, vol. 8543, pp. 92–107. Springer International(2014), http://dx.doi.org/10.1007/978-3-319-08434-3 830. Jovanovic, D., de Moura, L.: Solving non-linear arithmetic. In: Gramlich, B., Miller,D., Sattler, U. (eds.) Automated Reasoning: 6th International Joint Conference(IJCAR), Lecture Notes in Computer Science, vol. 7364, pp. 339–354. Springer(2012), https://doi.org/10.1007/978-3-642-31365-3 276 D. Florescu and M. England31. Kobayashi, M., Iwane, H., Matsuzaki, T., Anai, H.: Eﬃcient subformula ordersfor real quantiﬁer elimination of non-prenex formulas. In: Kotsireas, S., Rump,M., Yap, K. (eds.) Mathematical Aspects of Computer and Information Sciences(MACIS ’15). Lecture Notes in Computer Science, vol. 9582, pp. 236–251. SpringerInternational Publishing (2016), https://doi.org/10.1007/978-3-319-32859-1 2132. K¨uhlwein, D., Blanchette, J., Kaliszyk, C., Urban, J.: MaSh: Machine learningfor sledgehammer. In: Blazy, S., Paulin-Mohring, C., Pichardie, D. (eds.) Interac-tive Theorem Proving, Lecture Notes in Computer Science, vol. 7998, pp. 35–50.Springer Berlin Heidelberg (2013), https://doi.org/10.1007/978-3-642-39634-2 633. Kuipers, J., Ueda, T., Vermaseren, J.: Code optimization inFORM. Computer Physics Communications , 1–19 (2015),https://doi.org/10.1016/j.cpc.2014.08.00834. Liang, J., Hari Govind, V., Poupart, P., Czarnecki, K., Ganesh, V.: An empiricalstudy of branching heuristics through the lens of global learning rate. In: Gaspers,S., Walsh, T. (eds.) Theory and Applications of Satisﬁability Testing – SAT 2017,Lecture Notes in Computer Science, vol. 10491, pp. 119–135. Springer InternationalPublishing (2017), https://doi.org/10.1007/978-3-319-66263-3 835. Mulligan, C., Bradford, R., Davenport, J., England, M., Tonks, Z.: Non-linearreal arithmetic benchmarks derived from automated reasoning in economics. In:Bigatti, A., Brain, M. (eds.) Proceedings of the 3rd Workshop on SatisﬁabilityChecking and Symbolic Computation ( SC32