High Dimensional Human Guided Machine Learning
HHigh Dimensional Human Guided Machine Learning
Eric Holloway, Robert Marks II
Dept. of Electrical & Computer EngineeringBaylor UniversityWaco, Texasemail: first last @ baylor.edu
Abstract
Have you ever looked at a machine learning classificationmodel and thought, I could have made that? Well, that is whatwe test in this project, comparing XGBoost trained on humanengineered features to training directly on data. The humanengineered features do not outperform XGBoost trained di-rectly on the data, but they are comparable. This project con-tributes a novel method for utilizing human created classifi-cation models on high dimensional datasets.
Why Human Guided?
In the artificial intelligence, machine learning, and humancomputation fields there is little research into the effective-ness of human generated models. One research project is hu-man guided simple search (Anderson et al. 2000), and tabusearch (Klau et al. 2002). Humans outperform the state ofthe art algorithms when solving complex visual problems,such as the travelling salesman problem (Krolak, Felts, andMarble 1971; Dry et al. 2006; Acu˜na and Parada 2010). Nu-merous machine learning algorithms are NP-Complete orharder, such as the set cover machine (SCM) (Marchandand Taylor 2003). Breakthroughs have been achieved by in-cluding humans-in-the-loop for hard optimization and com-binatorial problems, (Le Bras et al. 2014) and (Khatib et al.2011). With these promising results there is need for furtherinvestigation into human guided machine learning.Machine learning algorithms typically work with high di-mensional datasets, which a human cannot visualize in en-tirity. But the high dimensionality of a dataset is not an in-surmountable obstacle to effectively using a human-in-the-loop.
Approach and Implementation
In this project we use a dimension subset approach to testout human effectiveness in creating classification models.Instead of having a human attempt high dimensional visu-alization, we have humans design models on pairs of dimen-sions. These models are then used to transform the datasetinto a feature space.XGBoost (Chen and Guestrin 2016), short for eXtremeGradient Boosting, is a popular machine learning library
Copyright c (cid:13) that has been used to win multiple Kaggle competitions. AnXGBoost model is trained on the transformed data, and theresults are compared to training XGBoost on the untrans-formed data. We are not restricted to only using XGBoost,other machine learning approaches also work and we havetested linear perceptrons, linear regression and support vec-tor machines.The following process is used to create each model.1. A pair of dimensions are selected and the training dataset( X train ) centered and normalized for those dimensions.The training dataset contains about 100 samples. Thepairs are selected based on low correlation between thetwo dimensions. If there is low correlation, then it is eas-ier to identify clusters of data points. Not all dimensionsin a dataset are used by the workers.2. The worker is given a scatterplot of the two dimensionsand proceeds to draw polygons ( ρ ) to separate the datainto classification regions. Each polygon classifies a sam-ple to one class ( ρ class ). For simplicity, the polygon is arectangle, making the models similar to those producedby SCM (2003).3. The collection of polygons drawn by the worker on a pairof dimensions is a single model ( Φ ). An example of amodel is shown in Figure 1.4. The model is evaluated on a test dataset ( X test ) produc-ing an accuracy score for the classification regions ( Φ acc ),see Equation 1. Only samples contained by a polygon( X ρvalid ) in the model ( X Φ valid ) contribute to the accuracyscore. The test dataset contains about 200 samples. Φ acc = (cid:80) ρ ∈ Φ (cid:80) x ∈ X ρvalid [ x class = ρ class ] | X Φ valid | (1)The sample transformation function is shown in Equation2, which is a weighted sum of model polygons containingthe sample. f ( x, Φ) = [ x ∈ Φ] ∗ Φ acc (2)Then, for M samples and N models we have the following a r X i v : . [ c s . A I] S e p igure 1: Example of polygons drawn by worker. M × N feature matrix. f ( x , Φ ) f ( x , Φ ) . . . f ( x , Φ N ) f ( x , Φ ) f ( x , Φ ) . . . f ( x , Φ N ) ... ... . . . ... f ( x M , Φ ) f ( x M , Φ ) . . . f ( x M , Φ N ) XGBoost is trained on a subset M (cid:48) of the M samples,and then used to classify the remaining samples. To performa fair comparison, only the dimensions used by the workersare included in the untransformed samples. For example, ifthe dataset has D dimensions, but only D (cid:48) dimensions areused, the XGBoost model is trained on an M (cid:48) × D (cid:48) matrix.Thus, one XGBoost model is trained on the untransformedsamples ( M (cid:48) × D (cid:48) data matrix), and another on the trans-formed samples ( M (cid:48) × N feature matrix).The Amazon online service Mechanical Turk (AMT) isused to gather human produced models.1. The AMT job directs the worker to a website where theycan perform the classification task.2. A scatterplot shows X train plotted according to the ran-domly chosen dimension pair, and the worker drawsboxes on the scatterplot.3. A progress bar gives feedback on the accuracy of themodel. Accuracy is calculated on a validation dataset X valid . The validation dataset contains about 100 sam-ples. Only models that achieve an accuracy above 50%are accepted, to provide quality control.4. Once the model has been accepted, the website gives theworker a job completion code.5. Back at the AMT job posting, the worker submits the codefor payment.We use five datasets with binary classification tasks.Datasets consist of one synthetic clustering task, and the restare real world datasets from Kaggle. Most of the datasetsare highly unbalanced, so we balance the datasets to havean equal number of both classes. Additionally, with the ex-ception of the synthetic dataset, the dimensions consist ofboth nominal and continuous variables. A summary of the Name Nom. Int. Cont. NoteMad. 0 500 0 hyper-XOR problemCar. 18 0 14 car auctionHome. 295 0 1 real estateMel. 178 61 11 grant applicationsCredit 0 6 4 credit risksTable 1: Datasets and their characteristics. Nom = nominal.Int = integer. Cont = continuous.Name M’ M-M’ D’ Data N Features
Mad. 2000 600 73 0.650 320 0.655Car. 2000 2230 7 0.521 194 0.481Home. 2000 1806 43 0.795 194 0.723Mel. 2000 2500 9 0.542 64 0.512Credit 2000 18052 8 0.762 156 0.717Table 2: Accuracy results of training XGBoost directly on data , and on features produced by AMT workers. M is thetotal number of samples. M’ is the number of samples in thetraining dataset. M-M’ is the number of samples in the testdataset. D’ is the number of dimensions used by the work-ers. N is the number of models the workers created and thenumber of features generated.datasets is in Table 1, and the dataset sources are the follow-ing. • Madelon (Mad.) (Guyon et al. 2004) • Carvana (Car.) (Carvana 2011) • Homesite (Home.) (Homesite 2015) • Melbourne (Mel.) (of Melbourne 2010) • Credit (Kaggle 2011)
Results and Conclusion
Table 2 demonstrates the results from training XGBoost di-rectly on the data, as well as on the features generated by theAMT workers. XGBoost’s model is parameterized by crossvalidation. The parameters are learning rate (0.01, 0.05, 0.1,0.3), max tree depth (2, 5, 10, 15), and number of rounds(50, 100, 200, 400, 800).We’ve shown that human guided machine learning can becrowd sourced through workers drawing polygons on scat-terplots. These models do not outperform standard algorith-mic approaches, but are comparable. The contribution ofthis project is human model creation on high dimensionaldatasets.Future research will discover if and when human pro-duced models outperform purely algorithmic approaches. Inthis research, human produced models did not outperformalgorithmic approaches likely due to loss of information.Transforming the data using the models reduces the datagranularity. A way ahead is to find a way to preserve granu-larity while using the human produced models. cknowledgements
The researchers thank the AMT workers who contributedtheir valuable insight.
References [Acu˜na and Parada 2010] Acu˜na, D. E., and Parada, V. 2010.People efficiently explore the solution space of the computa-tionally intractable traveling salesman problem to find near-optimal tours.
PloS one
AAAI/IAAI , 209–216.[Carvana 2011] Carvana. 2011. Don’t get kicked! .[Chen and Guestrin 2016] Chen, T., and Guestrin, C. 2016.Xgboost: A scalable tree boosting system. arXiv preprintarXiv:1603.02754 .[Dry et al. 2006] Dry, M.; Lee, M. D.; Vickers, D.; andHughes, P. 2006. Human performance on visually pre-sented traveling salesperson problems with varying numbersof nodes.
The Journal of Problem Solving
Advances in neural information pro-cessing systems , 545–552.[Homesite 2015] Homesite. 2015. Homesite quoteconversion. .[Kaggle 2011] Kaggle. 2011. Give me somecredit. .[Khatib et al. 2011] Khatib, F.; DiMaio, F.; Cooper, S.;Kazmierczyk, M.; Gilski, M.; Krzywda, S.; Zabranska, H.;Pichova, I.; Thompson, J.; Popovi´c, Z.; et al. 2011. Crystalstructure of a monomeric retroviral protease solved by pro-tein folding game players.
Nature structural & molecularbiology
AAAI/IAAI , 41–47.[Krolak, Felts, and Marble 1971] Krolak, P.; Felts, W.; andMarble, G. 1971. A man-machine approach toward solv-ing the traveling salesman problem.
Communications of theACM
SecondAAAI Conference on Human Computation and Crowdsourc-ing .[Marchand and Taylor 2003] Marchand, M., and Taylor, J. S.2003. The set covering machine.
The Journal of MachineLearning Research