Botao Wang
Northeastern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Botao Wang.
Neurocomputing | 2015
Botao Wang; Shan Huang; Junhao Qiu; Yu Liu; Guoren Wang
In this age of big data, analyzing big data is a very challenging problem. MapReduce is a simple, scalable and fault-tolerant data processing framework that enables us to process a massive volume of data. Many machine learning algorithms have been designed based on MapReduce, but there are only a few works related to parallel extreme learning machine (ELM) which is a fast and accurate learning algorithm.Online sequential extreme learning machine (OS-ELM) is one of improved ELM algorithms to support online sequential learning efficiently. In this paper, we first analyze the dependency relationships of matrix calculations of OS-ELM, then propose a parallel online sequential extreme learning machine (POS-ELM) based on MapReduce.POS-ELM is evaluated with real and synthetic data with the maximum number of training data 1280K and the maximum number of attributes 128. The experimental results show that the training accuracy and testing accuracy of POS-ELM are at the same level as those of OS-ELM and ELM, and it has good scalability with regard to the number of training data and the number of attributes. Compared to original ELM and OS-ELM where the capability to process large scale data is bounded by the limitation of resources within a single processing unit, POS-ELM can deal with much larger scale data. The larger the number of training data is, the higher the speedup of POS-ELM is. It can be concluded that POS-ELM has more powerful capability than both ELM and OS-ELM for large scale learning.
Neurocomputing | 2016
Shan Huang; Botao Wang; Junhao Qiu; Jitao Yao; Guoren Wang; Ge Yu
In this era of big data, analyzing large scale data efficiently and accurately has become a challenging problem. As one of the ELM variants, online sequential extreme learning machine (OS-ELM) provides a method to analyze incremental data. Ensemble methods provide a way to learn from data more accurately. MapReduce, which provides a simple, scalable and fault-tolerant framework, can be utilized for large scale learning. In this paper, we first propose an ensemble OS-ELM framework which supports any combination of bagging, subspace partitioning and cross validation. Then we design a parallel ensemble of online sequential extreme learning machine (PEOS-ELM) algorithm based on MapReduce for large scale learning. PEOS-ELM algorithm is evaluated with real and synthetic data with the maximum number of training data 5120K and the maximum number of attributes 512. The speedup of this algorithm reaches as high as 40 on a cluster with maximum 80 cores. The accuracy of PEOS-ELM algorithm is at the same level as that of ensemble OS-ELM executing on a single machine, which is higher than that of the original OS-ELM.
Neurocomputing | 2010
Guoren Wang; Yuhai Zhao; Xiangguo Zhao; Botao Wang; Baiyou Qiao
Extensive studies have shown that mining gene expression data is important for both bioinformatics research and biomedical applications. However, most existing studies focus only on either co-regulated gene clusters or emerging patterns. Factually, another analysis scheme, i.e. simultaneously mining phenotypes and diagnostic genes, is also biologically significant, which has received relative little attention so far. In this paper, we explore a novel concept of local conserved gene cluster (LC-Cluster) to address this problem. Specifically, an LC-Cluster contains a subset of genes and a subset of conditions such that the genes show steady expression values (instead of the coherent pattern rising and falling synchronously defined by some previous work) only on the subset of conditions but not along all given conditions. To avoid the exponential growth in subspace search, we further present two efficient algorithms, namely FALCONER and E-FALCONER, to mine the complete set of maximal LC-Clusters from gene expression data sets based on enumeration tree. Extensive experiments conducted on both real gene expression data sets and synthetic data sets show: (1) our approaches are efficient and effective, (2) our approaches outperform the existing enumeration tree based algorithms, and (3) our approaches can discover an amount of LC-Clusters, which are potentially of high biological significance.
international conference on data mining | 2014
Shan Huang; Botao Wang; Jingyang Zhu; Guoren Wang; Ge Yu
It has become a challenge to organize and process large scale multi-dimensional data. In this paper, we present R-HBase, a multi-dimensional indexing framework for cloud computing environment. The R-HBase framework consists of storage layer and index layer. The storage layer supports high throughput and the index layer answers query efficiently. R-HBase is evaluated with synthetic data. R-HBase can handle tens of thousands of inserts per second, while efficiently processing multi-dimensional queries. The results also show that R-HBase is faster than MD-HBase and at the same time R-HBase supports multi-dimensional queries with the number of dimensions more than three.
Memetic Computing | 2017
Shan Huang; Botao Wang; Yuemei Chen; Guoren Wang; Ge Yu
In this era of big data, more and more models need to be trained to mine useful knowledge from large scale data. It has become a challenging problem to train multiple models accurately and efficiently so as to make full use of limited computing resources. As one of ELM variants, online sequential extreme learning machine (OS-ELM) provides a method to learn from incremental data. MapReduce, which provides a simple, scalable and fault-tolerant framework, can be utilized for large scale learning. In this paper, we propose an efficient parallel method for batched online sequential extreme learning machine (BPOS-ELM) training using MapReduce. Map execution time is estimated with historical statistics, where regression method and inverse distance weighted interpolation method are used. Reduce execution time is estimated based on complexity analysis and regression method. Based on the estimations, BPOS-ELM generates a Map execution plan and a Reduce execution plan. Finally, BPOS-ELM launches one MapReduce job to train multiple OS-ELM models according to the generated execution plan, and collects execution information to further improve estimation accuracy. Our proposal is evaluated with real and synthetic data. The experimental results show that the accuracy of BPOS-ELM is at the same level as those of OS-ELM and parallel OS-ELM (POS-ELM) with higher training efficiencies.
World Wide Web | 2015
Botao Wang; Pingping Liu; Guoren Wang; Xiangguo Zhao
The number of cycle matchings increases exponentially with the number of subscriptions and the maximum length of cycle matchings, which needs a large amount of space to store intermediate results. Approximate cycle matching aims to store only a small part of intermediate results and find cycle matchings as many as possible. The existing solution prunes the intermediate results by a threshold of probability of a subscription to be matched, where the discrete degree of probabilities is neglected. In this paper, we propose an approximate dynamic cycle matching algorithm based on intermediate results classification using extreme learning machine. We first introduce a method of incorporating probability information into feature vector, and then propose the approximate cycle algorithm. Further, we propose a dynamic classification strategy considering that the data distribution of subscriptions may change as time goes on. The proposed approximate cycle matching algorithm and the dynamic classification strategy are evaluated in a simulated environment. The results show that compared with the approximate cycle matching based on probability threshold, the approximate cycle matching based on ELM classification is faster, and the dynamic classification strategy is more efficient and convenient. ELM is more suitable for approximate dynamic cycle matching than SVM with regards to response time.
IEEE Transactions on Systems, Man, and Cybernetics | 2017
Shizhuo Deng; Botao Wang; Shan Huang; Chuncheng Yue; Jianpeng Zhou; Guoren Wang
In this era of big data, stream data classification which is one of typical data stream applications has become more and more significant and challengeable. In these applications, it is obvious that data classification is much more frequent than model training. The ratio of stream data to be classified is rapid and time-varying, so it is an important problem to classify the stream data efficiently with high throughput. In this paper, we first analyze and categorize the current data stream machine learning algorithms according to their data structures. Then, we propose stream data classification topology (SDC-Topology) on Storm. For the classification algorithms based on the matrix, we propose self-adaptive stream data classification framework (SASDC-Framework) for efficient stream data classification on Storm. In SASDC-Framework, all the data sets arriving at the same unit time are partitioned into subsets with the nearly best partition size and processed in parallel. To select the nearly best partition size for the stream data sets efficiently, we adopt bisection method strategy and inverse distance weighted strategy. Extreme learning machine, which is a fast and accurate machine learning method based on matrix calculating, is used to test the efficiency of our proposals. According to evaluation results, the throughputs based on SASDC-Framework are 8–35 times higher than those based on SDC-Topology and the best throughput is more than 40000 prediction requests per second in our environment.
international conference on big data | 2016
Shan Huang; Botao Wang; Shizhuo Deng; Kaili Zhao; Guoren Wang; Ge Yu
With the development of cloud computing, more and more large scale multi-dimensional data are stored on cloud platforms. Multi-dimensional index is an efficient technique to support processing data efficiently. Designing a multi-dimensional index which supports multi-user concurrent access efficiently has become a challenging problem. In this paper, we propose a multi-version R-tree based on HBase (HMVR-tree) to support multiple concurrent access. HMVR-tree maintains the newest version of tree while keeping all the old versions of the nodes for efficient concurrent update and query access to different nodes. The evaluation results show that MHVR-tree has good scalability and has much higher update throughput and the same level query throughput compared to the original R-tree on HBase.
Archive | 2016
Shan Huang; Botao Wang; Yuemei Chen; Guoren Wang; Ge Yu
With the development of technology and the widespread use of machine learning, more and more models need to be trained to mine useful knowledge from large scale data. It has become a challenging problem to train multiple models accurately and efficiently so as to make full use of limited computing resources. As one of ELM variants, online sequential extreme learning machine (OS-ELM) provides a method to learn from incremental data. MapReduce, which provides a simple, scalable and fault-tolerant framework, can be utilized for large scale learning. In this paper, we propose an efficient batch parallel online sequential extreme learning machine (BPOS-ELM) algorithm for the training of multiple models. BPOS-ELM estimates the Map execution time and Reduce execution time with historical statistics and generates execution plan. BPOS-ELM launches one MapReduce job to train multiple OS-ELM models according to the generated execution plan. BPOS-ELM is evaluated with real and synthetic data. The accuracy of BPOS-ELM is at the same level as those of OS-ELM and POS-ELM. The speedup of BPOS-ELM reaches 10 on a cluster with maximum 32 cores.
Archive | 2012
Botao Wang; Bin Wang; Junchang Xin; Chao Wang