Combining Machine Learning Models using combo Library
CCombining Machine Learning Models Using combo
Library
Yue Zhao , Xuejian Wang , Cheng Cheng , Xueying Ding H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA 15213 USA Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213 [email protected], [email protected], [email protected], [email protected]
Abstract
Model combination, often regarded as a key sub-field of en-semble learning, has been widely used in both academicresearch and industry applications. To facilitate this pro-cess, we propose and implement an easy-to-use Pythontoolkit, combo , to aggregate models and scores undervarious scenarios, including classification, clustering, andanomaly detection. In a nutshell, combo provides a uni-fied and consistent way to combine both raw and pre-trained models from popular machine learning libraries,e.g., scikit-learn, XGBoost, and LightGBM. With accessi-bility and robustness in mind, combo is designed with de-tailed documentation, interactive examples, continuous in-tegration, code coverage, and maintainability check; it canbe installed easily through Python Package Index (PyPI) or https://github.com/yzhao062/combo . Introduction
Recently, model combination has gained much attention inmany real-world tasks, and stayed as the winning solutionin numerous data science competitions such like Kaggle(2007). It is considered as a sub-field of ensemble learn-ing, aiming for achieving better prediction performance(2012). Despite that, it is often beyond the scope of machinelearning—it has been used in other domains such as theexperimental design in clinical trials. Generally speaking,model combination has two key usages: stability improve-ment and performance boost. For instance, practitioners runindependent trials and then average the results to eliminatethe built-in randomness and uncertainty—more reliable re-sults may be obtained. Additionally, even in a non-ideal sce-nario, base models may make independent but complemen-tary errors. The combined model can, therefore, yield betterperformance than any constituent ones.Although model combination is crucial for all sorts oflearning tasks, dedicated Python libraries are absent. Thereare a few packages that partly fulfill this purpose, but estab-lished libraries either exist as single purpose tools like PyOD(2019) and pycobra (2018), or as part of general purpose li-braries like scikit-learn (2011).
Copyright c (cid:13) >>> from combo.models.classifier_dcsimport DCS
Code Snippet 1: Demo of combo
API with DCS combo can fill this gap with four key advantages. Firstly, combo contains more than 15 combination algorithms, in-cluding both classical algorithms like dynamic classifier se-lection (DCS) (1997) and recent advancement like LCSP(2019). It could handle the combination operation for allsorts of tasks like classification, clustering, and anomaly de-tection. Secondly, combo works with both raw and pre-trained learning models from major libraries like scikit-learn, XGBoost, and LightGBM, given certain conditionsare met. Thirdly, the models in combo are designed withunified APIs, detailed documentation , and interactive ex-amples for the easy use. Lastly, all combo models are as-sociated with unit test and being checked by continuous in-tegration tools for robustness; code coverage and maintain-ability check are also enabled for performance and sustain-ability. To our best knowledge, this is the first comprehen-sive framework for combining learning models and scoresin Python, which is valuable for data practitioners, machinelearning researchers, and data competition participants. https://pycombo.readthedocs.io https://mybinder.org/v2/gh/yzhao062/combo/master a r X i v : . [ c s . L G ] N ov igure 1: Comparison of Selected Classifier Combination on Simulated Data Core Scenarios combo models for classification, clustering, and anomalydetection share unified APIs. Inspired by scikit-learn’s APIdesign, the models in combo all come with the followingkey methods: (i) fit function processes the train data andgets the model ready for prediction; (ii) predict functiongenerates labels for the unknown test data once the modelis fitted; (iii) predict proba generates predictions inprobability instead of discrete labels by predict and (iv) fit predict calls fit function first on the input dataand then predicts on it (applicable to unsupervised modelonly). Code Snippet 1 shows the use of above APIs on DCS.Notably, fitted (pretrained) models can be used directly bysetting pre fitted flag; fit process will be skipped.
Classifier Combination aims to aggregate multiple basesupervised classifiers in either parallel or sequential man-ner. Selected classifier combination methods implementedin combo include stacking (meta-learning), dynamic classi-fier selection, dynamic ensemble selection, and a group ofheuristic aggregation methods like averaging and majorityvote. Fig. 1 shows how different frameworks behave on asimulated dataset with 300 points. The leftmost one is a sim-ple k NN model ( k =15), and the other three are the combina-tion of five k NN models with k in range [5 , , , , .Different from classifier combination, Cluster Combina-tion is usually done in an unsupervised manner. The focusis on how to align the predicted labels generated by baseclusterings, as cluster labels are categorical instead of ordi-nal. For instance, [0 , , , , and [1 , , , , are equiv-alent with appropriate alignment. Two classical clusteringcombination methods are therefore implemented to handlethis—clustering combination using evidence accumulation(EAC)(2005) and Clusterer Ensemble (2006). Anomaly De-tection concentrates on identifying the anomalous objectsfrom the general data distribution (2019). The challengesof combining multiple outlier detectors lie in its unsuper-vised nature and extreme data imbalance. Two latest combi-nation frameworks, LSCP (2019) and XGBOD (2018), areincluded in combo for unsupervised and semi-superviseddetector combination.
Score Combination comes with moreflexibility than the above tasks as it only asks for the out-put from multiple models, whichever it is from a group ofclassifiers or outlier detectors. As a general purpose task,score combination methods are easy to use without the need of initializing a dedicated class. Each aggregation method,e.g., average of maximum (AOM), can be invoked directly.
Conclusion and Future Directions combo is a comprehensive Python library to combine themodels from major machine learning libraries. It supportsfour types of combination scenarios (classification, cluster-ing, anomaly detection, and raw scores) with unified APIs,detailed documentation, and interactive examples. As av-enues for future work, we will add the combination frame-works for customized deep learning models (from Tensor-Flow, PyTorch, and MXNet), enable GPU acceleration andparallelization for scalability, and expand to more task sce-narios such as imbalanced learning and regression.
References [2007] Bell, R. M., and Koren, Y. 2007. Lessons from thenetflix prize challenge.
SIGGKDD Explorations
PAMI
JMLR
JMLR
PAMI
IJCNN .[2019] Zhao, Y.; Nasrullah, Z.; Hryniewicki, M. K.; and Li,Z. 2019. LSCP: Locally selective combination in paralleloutlier ensembles. In
SDM , 585–593. SIAM.[2019] Zhao, Y.; Nasrullah, Z.; and Li, Z. 2019. PyOD: Apython toolbox for scalable outlier detection.
JMLR .[2006] Zhou, Z.-H., and Tang, W. 2006. Clusterer ensemble.
KBS