A machine learning based software pipeline to pick the variable ordering for algorithms with polynomial inputs
AA machine learning based software pipeline topick the variable ordering for algorithms withpolynomial inputs
Dorian Florescu and Matthew England
Faculty of Engineering, Environment and Computing,Coventry University, Coventry, CV1 5FB, UK { Dorian.Florescu, Matthew.England } @coventry.ac.uk Abstract.
We are interested in the application of Machine Learning(ML) technology to improve mathematical software. It may seem that theprobabilistic nature of ML tools would invalidate the exact results prizedby such software, however, the algorithms which underpin the softwareoften come with a range of choices which are good candidates for MLapplication. We refer to choices which have no effect on the mathematicalcorrectness of the software, but do impact its performance.In the past we experimented with one such choice: the variable orderingto use when building a Cylindrical Algebraic Decomposition (CAD). Weused the Python library Scikit-Learn ( sklearn ) to experiment with dif-ferent ML models, and developed new techniques for feature generationand hyper-parameter selection.These techniques could easily be adapted for making decisions otherthan our immediate application of CAD variable ordering. Hence in thispaper we present a software pipeline to use sklearn to pick the variableordering for an algorithm that acts on a polynomial system. The codedescribed is freely available online.
Keywords: machine learning; scikit-learn; mathematical software;cylindrical algebraic decomposition, variable ordering
Mathematical Software, i.e. tools for effectively computing mathematical ob-jects, is a broad discipline: the objects in question may be expressions such aspolynomials or logical formulae, algebraic structures such as groups, or evenmathematical theorems and their proofs. In recent years there have been ex-amples of software that acts on such objects being improved through the useof artificial intellegence techniques. For example, [21] uses a Monte-Carlo treesearch to find the representation of polynomials that are most efficient to evalu-ate; [22] uses a machine learnt branching heuristic in a SAT-solver for formulaein Boolean logic; [18] uses pattern matching to determine whether a pair of el-ements from a specified group are conjugate; and [1] uses deep neural networks a r X i v : . [ c s . S C ] M a y D. Florescu and M. England for premise selection in an automated theorem proving. See the survey article[12] in the proceedings of ICMS 2018 for more examples.Machine Learning (ML), that is statistical techniques to give computer sys-tems the ability to learn rules from data, may seem unsuitable for use in mathe-matical software since ML tools can only offer probabilistic guidance, when suchsoftware prizes exactness. However, none of the examples above risked the cor-rectness of the end-result in their software. They all used ML techniques to makenon-critical choices or guide searches: the decisions of the ML carried no risk tocorrectness, but did offer substantial increases in computational efficiency. Allmathematical software, no matter the mathematical domain, will likely involvesuch choices, and our thesis is that in many cases an ML technique could makea better choice than a human user, so-called magic constants [6], or a traditionalhuman-designed heuristic.
Contribution and outline
In Section 2 we briefly survey our recent work applying ML to improve an al-gorithm in a computer algebra system which acts on sets of polynomials. Wedescribe how we proposed a more appropriate definition of model accuracy andused this to improve the selection of hyper-parameters for ML models; and anew technique for identifying features of the input polynomials suitable for ML.These advances can be applied beyond our immediate application: the fea-ture identification to any situation where the input is a set of polynomials,and the hyper-parameter selection to any situation where we are seeking totake a choice that minimises a computation time. Hence we saw value in pack-aging our techniques into a software pipeline so that they may be used morewidely. Here, by pipeline we refer to a succession of computing tasks that canbe run as one task. The software is freely available as a Zenodo repository here:https://doi.org/10.5281/zenodo.3731703We describe the software pipeline and its functionality in Section 3. Then inSection 4 we describe its application on a dataset we had not previously studied.
Our recent work has been using ML to select the variable ordering to use forcalculating a cylindrical algebraic decomposition relative to a set of polynomials. A Cylindrical Algebraic Decomposition (CAD) is a decomposition of ordered R n space into cells arranged cylindrically , meaning the projections of cells alllie within cylinders over a CAD of a lower dimensional space. All these cellsare (semi)-algebraic meaning each can be described with a finite sequence ofpolynomial constraints. A CAD is produced for either a set of polynomials, ora logical formula whose atoms are polynomial constraints. It may be used to pipeline to pick the variable ordering for algorithms with polynomial input 3 analyse these objects by finding a finite sample of points to query and thusunderstand the behaviour over all R n . The most important application of CADis to perform Quantifier Elimination (QE) over the reals. I.e. given a quantifiedformula, a CAD may be used to find an equivalent quantifier free formula .CAD was introduced in 1975 [10] and is still an active area of research. Thecollection [7] summarises the work up to the mid-90s while the background sec-tion of [13], for example, includes a summary of progress since. QE has numerousapplications in science [2], engineering [25], and even the social sciences [23].CAD requires an ordering of the variables. QE imposes that the orderingmatch the quantification of variables, but variables in blocks of the same quanti-fier and the free variables can be swapped . The ordering can have a great effecton the time / memory use of CAD, the number of cells, and even the underly-ing complexity [5]. Human designed heuristics have been developed to make thechoice [11], [4], [3], [14] and are used in most implementations.The first application of ML to the problem was in 2014 when a support vectormachine was trained to choose which of these heuristics to follow [20], [19]. Themachine learned choice did significantly better than any one heuristic overall. The present authors revisited these experiments in [15] but this time using MLto predict the ordering directly (because there were many problems where noneof the human-made heuristics made good choices and although the number oforderings increases exponentially, the current scope of CAD application meansthis is not restrictive). We also explored a more diverse selection of ML methodsavailable in the Python library scikit-learn ( sklearn ) [24]. All the modelstested outperformed the human made heuristics.The ML models learn not from the polynomials directly, but from features:properties which evaluate to a floating point number for a specific polynomialset. In [20] and [15] only a handful of features were used (measures of degreeand frequency of occurrence for variables). In [16] we developed a new featuregeneration procedure which used combinations of basic functions (average, sign,maximum) evaluated on the degrees of the variables in either one polynomialor the whole system. This allowed for substantially more features and improvedthe performance of all ML models. The new features could be used for any MLapplication where the input is a set of polynomials.The natural metric for judging a CAD variable ordering is the correspondingCAD runtime: in the work above models were trained to pick the ordering whichminimises this for a given input. However, this meant the training did not dis-tinguish between any non-optimal ordering even though the difference betweenthese could be huge. This led us to a new definition of accuracy in [17]: to pickingan ordering which leads to a runtime within x % of the minimum possible. E.g. QE would transform ∃ x, ax + bx + c = 0 ∧ a (cid:54) = 0 into the equivalent b − ac ≥ In Footnote 1 we must decompose ( x, a, b, c )-space with x last, but the other variablescan be in any order. Using a ≺ b ≺ c requires 27 cells but c ≺ b ≺ a requires 115. D. Florescu and M. England We then wrote a new version of the sklearn procedure which uses cross-validation to select model hyper-parameters to minimise the total CAD runtimeof its choices, rather than maximise the number of times the minimal ordering ischosen. This also improved the performance of all ML models in the experimentsof [17]. The new definition and procedure are suitable for any any situation wherewe are seeking to take a choice that minimises a computation time.
The input to our pipeline is given by two distinct datasets used for training andtesting, respectively. An individual entry in the data set is a set of polynomialsthat represent an input to a symbolic computation algorithms, in our case CAD.The output is a corresponding sequence of variable ordering suggestions for eachset of polynomials in the testing dataset.The pipeline is fully automated: it generates and uses the CAD runtimesfor each set of polynomials under each admissible variable ordering; uses theruntimes from the training dataset to select the hyper-parameters with cross-validation and tune the parameters of the model; and evaluates the performanceof those classifiers (along with some other heuristics for the problem) for the setsof polynomials in the testing dataset.We describe these key steps in the pipeline below. Each of the numberedstages can be individually marked for execution or not in a run of the pipeline(avoiding duplication of existing computation). The code for this pipeline, writ-ten all in Python, is freely available at: https://doi.org/10.5281/zenodo.3731703.
I. Generating a model using the training dataset(a) Measuring the CAD runtimes:
The CAD routine is run for each setof polynomials in the training dataset. The runtimes for all possible variableorderings are stored in a different file for each set of polynomials. If the runtimeexceeds a pre-defined timeout, the value of the timeout is stored instead. (b) Polynomial data parsing:
The training dataset is first converted to aformat that is easier to process into features. For this purpose, we chose theformat given by the terms() method from the
Poly class located in the sympy package for symbolic computation in Python.Here, each monomial is defined by a tuple, containing another tuple withthe degrees of each variable, and a value defining the monomial coefficient. Thepolynomials are then defined by lists of monomials given in this format, and apoint in the training dataset consists of a list of polynomials. For example, oneentry in the dataset is the set { x + 42 x , x x − } which is represented as[[((1 , , , , ((0 , , , , [((2 , , , , ((0 , , , − . All the data points in the training dataset are then collected into a singlefile called terms train.txt after being placed into this format. Subsequently, pipeline to pick the variable ordering for algorithms with polynomial input 5 the file y train.txt is created storing the index of the variable ordering withthe minimum computing times for each set of polynomials, using the runtimesmeasured in Step I(a). (c) Feature generation:
Here each set of polynomials in the training dataset isprocessed into a fixed length sequence of floating point numbers, called features,which are the actual data used to train the ML models in sklearn . This is donewith the following steps:i.
Raw feature generation
We systematically consider applying all meaningful combinations of thefunctions average , sign , maximum , and sum to polynomials with a givennumber of variables. This generates a large set of feature descriptions asproposed in [16]. The new format used to store the data described aboveallows for an easy evaluation of these features. An example of computingsuch features is given in Figure 1. In [16] we described how the methodprovides 1728 possible features for polynomials constructed with three vari-ables for example. This step generates the full set of feature descriptions,saved in a file called features descriptions.txt , and the correspond-ing values of the features on the training dataset, saved in a file called features train raw.txt . Fig. 1.
Generating feature av p (max m ( d m,p )) from data stored in the format of SectionI(b). Here d m,p denotes the degree of variable x in polynomial number p and monomialnumber m , and av p denotes the average function computed for all polynomials [16]. ii. Feature simplification
After computing the numerical values of the features in Step I(c)i this stepwill remove those features that are constant or repetitive for the dataset inquestion, as described in [16]. The descriptions of the remaining features aresaved in a new file called features descriptions final.txt . D. Florescu and M. England iii.
Final feature generation
The final set of features is computed by evaluating the descriptions in features descriptions final.txt for the training dataset. Even thoughthese were already evaluated in Step I(c)i we repeat the evaluation for thefinal set of feature descriptions. This is to allow the possibility of users en-tering alternative features manually and skipping steps i and ii. As notedabove, any of the named steps in the pipeline can be selected or skipped forexecution in a given run. The final values of the features are saved in a newfile called features train.txt . (d) Machine learning classifier training: i. Fitting the model hyperparameters by cross-validation
The pipeline can apply four of the most commonly used deterministic MLmodels (see [15] for details), using the implementations in sklearn [24]. – The K-Nearest Neighbors (KNN) classifier – The Multi-Layer Perceptron (MLP) classifier – The Decision Tree (DT) classifier – The Support Vector Machine (SVM) classifierOf course, additional models in sklearn and its extensions could be includedwith relative ease. The pipeline can use two different methods for fitting thehyperparameters via a cross-validation procedure on the training set, asdescribed in [17]: – Standard cross-validation: maximizing the prediction accuracy (i.e. thenumber of times the model picks the optimum variable ordering). – Time-based cross-validation: minimizing the CAD runtime (i.e. the timetaken to compute CADs with the model’s choices).Both methods tune the hyperparameterswith cross-validation using the rou-tine
RandomizedSearchCV from the sklearn package in Python (the latteran adapted version we wrote). The cross-validation results (i.e. choice ofhyperparameters) are saved in a file hyperpar D** ** T** **.txt , where
D** ** is the date and
T** ** denotes the time when the file was generated.ii.
Fitting the parameters
The parameters of each model are subsequently fitted using the standardsklearn algorithms for each chosen set of hyperparameters. These are savedin a file called par D** ** T** **.txt . II. Predicting the CAD variable orderings using the testing dataset
The models in Step I are then evaluated according to their choices of variableorderings for the sets of polynomials in the testing dataset. The steps below arelisted without detailed description as they are performed similarly to Step I forthe testing dataset. (a) Polynomial data parsing:
The values generated are saved in a new filecalled terms test.txt . pipeline to pick the variable ordering for algorithms with polynomial input 7 (b) Feature generation: The final set of features is computed by evaluatingthe descriptions in Step I(b)ii for the testing dataset. These values are saved ina new file called features test.txt . (c) Predictions using ML: Predictions on the testing dataset are generatedusing the model computed in Step I(c). The model is run with the data in StepII(a)ii, and the predictions are stored in a file called y D** ** T** ** test.txt . (d) Predictions using human-made heuristics: In our prior papers [15],[16], [17] we compared the performance of the ML models with the human-designed heuristics in [4] and [11]. For details on how these are applied see [15].Their choices are saved in two files entitled y brown test.txt and y sotd test.txt , respectively. (e) Comparative results:
Finally, in order to compare the performance ofthe proposed pipeline, we must measure the actual CAD runtimes on the testingdataset. The results of the comparison is saved in a file with the template: comparative results D** ** T** **.txt . Adapting the pipeline to other algorithms
The pipeline above was developed for choosing the variable ordering for the CADimplementation in Maple’s Regular Chains Library [8], [9]. But it could be usedto pick the variable ordering for other procedures which take sets of polynomialsas input by changing the calls to CAD in Steps I(a) and II(e) to that of anotherimplementation / algorithm. Step II(d) would also have to be edited to call anappropriate competing heuristic.
The pipeline described in the previous section makes it easy for us to repeat ourpast experiments (described in Section 2) for a new dataset. All that is neededto do is replace the files storing the polynomials and run the pipeline.To demonstrate this we test the proposed pipeline on a new dataset of ran-domly generated polynomials. We are not suggesting that it is appropriate totest classifiers on random data: we simply mean to demonstrate the ease withwhich the experiments in [15], [16], [17] that originally took many man-hourscan be repeated with just a single code execution.The randomly generated parameters are: the degrees of the three variablesin each polynomial term, the coefficient of each term, the number of terms in apolynomial and the number of polynomials in a set. The means and standarddeviations of these parameters were extracted from the problems in the nlsat dataset , which was used in our previous work [15] so that the dataset is of a https://cs.nyu.edu/ ∼ dejan/nonlinear/ D. Florescu and M. England Table 1.
The comparative performance of DT, KNN, MLP, SVM, the Brown and sotdheuristics on the testing data for our randomly generated dataset. A random prediction,and the virtual best (VB) and virtual worst (VW) predictions are also included.DT KNN MLP SVM
Brown sotd rand VB VW
Prediction time (s) . · e − .
68 2 . · e − .
99 53 .
01 15 819
Total time (s) comparable scale. We generated 7500 sets of random polynomials, where 5000were used for training, and the remaining 2500 for testing.The results of the proposed processing pipeline, including the comparisonwith the existing human-made heuristics are given in Table 1. The predictiontime is the time taken for the classifier or heuristic to make its predictions forthe problems in the training set. The total time adds to this the time for theactual CAD computations using the suggested orderings. We do not report thetraining time of the ML as this is a cost paid only once in advance. The virtualsolvers are those which always make the best/worst choice for a problem (in zeroprediction time) and are useful to show the range of possible outcomes. We notethat further details on our experimental methodology are given in [15], [16], [17].As with the tests on the original dataset [15], [16] the ML classifiers outper-formed the human made heuristics, but for this dataset the difference comparedto the Brown heuristic was marginal. We used a lower CAD timeout whichmay benefit the Brown heuristic as past analysis shows that when it makessub-optimal choices these tend to much worse. We also note that the relativeperformance of the Brown heuristic fell significantly when used on problemswith more than three variables in [17]. The results for the sotd heuristic are badbecause it had a particularly long prediction time on this random dataset. Wenote that there is scope to parallelize sotd which may make it more competitive.
We presented our software pipeline for training and testing ML classifiers thatselect the variable ordering to use for CAD, and described the results of anexperiment applying it to a new dataset.The purpose of the experiment in Section 4 was to demonstrate that thepipeline can easily train classifiers that are competitive on a new dataset withalmost no additional human effort, at least for a dataset of a similar scale (wenote that the code is designed to work on higher degree polynomials but hasonly been testes on datasets of 3 and 4 variables so far). The pipeline makes itpossible for a user to easily tuning the CAD variable ordering choice classifiersto their particular application area. pipeline to pick the variable ordering for algorithms with polynomial input 9
Further, with only a little modification, as noted at the end of Section 3, thepipeline could be used to select the variable ordering for alternative algorithmsthat act on sets of polynomials and require a variable ordering. We thus expectthe pipeline to be a useful basis for future research and plan to experiment withits use on such alternative algorithms in the near future.
Acknowledgements
This work is funded by EPSRC Project EP/R019622/1:
Embedding Machine Learning within Quantifier Elimination Procedures . We thankthe anonymous referees for their comments.
References
1. Alemi, A., Chollet, F., Een, N., Irving, G., Szegedy, C., Urban, J.: Deepmath − Deep sequence models for premise selection. In: Proc. NIPS ’16, pp. 2243–2251.(2016), https://dl.acm.org/doi/10.5555/3157096.31573472. Bradford, R., Davenport, J., England, M., Errami, H., Gerdt, V., Grigoriev, D.,Hoyt, C., Koˇsta, M., Radulescu, O., Sturm, T., Weber, A.: Identifying the paramet-ric occurrence of multiple steady states for some biological networks. J. SymbolicComputation , pp. 74–93(2016), https://doi.org/10.1016/j.jsc.2015.11.00810. Collins, G.: Quantifier elimination for real closed fields by cylindrical algebraicdecomposition. In: Proc. 2nd GI Conf. utomata Theory and Formal Languages.pp. 134–183. (1975). Reprinted in [7] https://doi.org/10.1007/3-540-07407-4 1711. Dolzmann, A., Seidl, A., Sturm, T.: Efficient projection orders for CAD. In: Proc.ISSAC ’04, pp. 111–118. ACM (2004), https://doi.org/10.1145/1005285.100530312. England, M.: Machine learning for mathematical software. In: MathematicalSoftware (LNCS 10931), pp. 165–174. Springer (2018), https://doi.org/10.1007/978-3-319-96418-8 200 D. Florescu and M. England13. England, M., Bradford, R., Davenport, J.: Cylindrical algebraic decomposition withequational constraints. J. Symbolic Computation , pp. 38–71 (2020), https://doi.org/10.1016/j.jsc.2019.07.01914. England, M., Bradford, R., Davenport, J., Wilson, D.: Choosing a variable order-ing for truth-table invariant cylindrical algebraic decomposition by incrementaltriangular decomposition. In: Mathematical Software (LNCS 8592), pp. 450–457.Springer (2014), http://dx.doi.org/10.1007/978-3-662-44199-2 6815. England, M., Florescu, D.: Comparing machine learning models to choose thevariable ordering for cylindrical algebraic decomposition. In: Intelligent ComputerMathematics (LNCS 11617), pp. 93–108. Springer (2019), https://doi.org/10.1007/978-3-030-23250-4 716. Florescu, D., England, M.: Algorithmically generating new algebraic features ofpolynomial systems for machine learning. In: Proc. SC ’19. CEUR-WS 2460.(2019), http://ceur-ws.org/Vol-2460/17. Florescu, D., England, M.: Improved cross-validation for classifiers that make al-gorithmic choices to minimise runtime without compromising output correctness.In: Mathematical Aspects of Computer and Information Sciences (LNCS 11989),pp. 341–356. Springer (2020), https://doi.org/10.1007/978-3-030-43120-4 2718. Gryak, J., Haralick, R., Kahrobaei, D.: Solving the conjugacy decision problemvia machine learning. Experimental Mathematics, 29:1, pp. 66-78. (2020), https://doi.org/10.1080/10586458.2018.143470419. Huang, Z., England, M., Wilson, D., Bridge, J., Davenport, J., Paulson, L.:Using machine learning to improve cylindrical algebraic decomposition. Math-ematics in Computer Science (4), 461–488 (2019), https://doi.org/10.1007/s11786-019-00394-820. Huang, Z., England, M., Wilson, D., Davenport, J., Paulson, L., Bridge, J.: Apply-ing machine learning to the problem of choosing a heuristic to select the variableordering for cylindrical algebraic decomposition. In: Intelligent Computer Math-ematics (LNAI 8543), pp. 92–107. Springer (2014), http://dx.doi.org/10.1007/978-3-319-08434-3 821. Kuipers, J., Ueda, T., Vermaseren, J.: Code optimization in FORM. Comp. Phys.Comm. , pp. 1–19 (2015), https://doi.org/10.1016/j.cpc.2014.08.00822. Liang, J., Ganesh, V., Poupart, P., Czarnecki, K.: Learning rate based branchingheuristic for SAT solvers. In: Proc. SAT ’16 (LNCS 9710), pp. 123–140. Springer(2016)23. Mulligan, C., Davenport, J., England, M.: TheoryGuru: A Mathematica pack-age to apply quantifier elimination technology to economics. In: MathematicalSoftware (LNCS 10931), pp. 369–378. Springer (2018), https://doi.org/10.1007/978-3-319-96418-8 4424. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. J. Machine Learning Research12