Joshua Zhexue Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua Zhexue Huang is active.

Explore More

Publication

Featured researches published by Joshua Zhexue Huang.

Information Sciences | 2016

Fuzzy nonlinear regression analysis using a random weight network

Yulin He; Xizhao Wang; Joshua Zhexue Huang

Modeling a fuzzy-in fuzzy-out system where both inputs and outputs are uncertain is of practical and theoretical importance. Fuzzy nonlinear regression (FNR) is one of the approaches used most widely to model such systems. In this study, we propose the use of a Random Weight Network (RWN) to develop a FNR model called FNRRWN, where both the inputs and outputs are triangular fuzzy numbers. Unlike existing FNR models based on back-propagation (BP) and radial basis function (RBF) networks, FNRRWN does not require iterative adjustment of the network weights and biases. Instead, the input layer weights and hidden layer biases of FNRRWN are selected randomly. The output layer weights for FNRRWN are calculated analytically based on a derived updating rule, which aims to minimize the integrated squared error between α-cut sets that correspond to the predicted fuzzy outputs and target fuzzy outputs, respectively. In FNRRWN, the integrated squared error is solved approximately by Riemann integral theory. The experimental results show that the proposed FNRRWN method can effectively approximate a fuzzy-in fuzzy-out system. FNRRWN obtains better prediction accuracy in a lower computational time compared with existing FNR models based on BP and RBF networks.

Neurocomputing | 2016

Incremental density-based ensemble clustering over evolving data streams

Imran Khan; Joshua Zhexue Huang; Kamen Ivanov

Abstract The recent advances in smart meter technology have enabled for collecting information about customer power consumption in real time. The measurements are generated continuously and in some cases, e.g. in the industrial smart metering the data exchange rates are highly-fluctuating. The storage, querying, and mining of such smart meter streaming data with a large number of missing and sparse values are highly computationally challenging tasks. To address such matters, we propose a new method called incremental density-based ensemble clustering (IDEStream) for incremental segmentation of various kinds of factories based on their electricity consumption data. It exploits a gamma mixture model to suppress the influence of sparse data units in the data streams that sequentially arrive within a time window and then generates a clustering from the processed data of that window. IDEStream uses a unique incremental ensemble approach to incrementally aggregate the clusterings of subsequent time windows. Experimental results on data streams collected by smart meters from manufacturing factories in Guangdong province of China have shown that the proposed algorithm outperforms several state-of-the-art data stream clustering algorithms. The obtained segmentation can find numerous applications, an exemplar one being to define customer rates in a flexible way.

Journal of data science | 2016

Big data analytics on Apache Spark

Salman Salloum; Ruslan Dautov; Xiaojun Chen; Patrick Xiaogang Peng; Joshua Zhexue Huang

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

international conference on data mining | 2011

A New Markov Model for Clustering Categorical Sequences

Tengke Xiong; Shengrui Wang; Qingshan Jiang; Joshua Zhexue Huang

Clustering categorical sequences remains an open and challenging task due to the lack of an inherently meaningful measure of pair wise similarity between sequences. Model initialization is an unsolved problem in model-based clustering algorithms for categorical sequences. In this paper, we propose a simple and effective Markov model to approximate the conditional probability distribution (CPD) model, and use it to design a novel two-tier Markov model to represent a sequence cluster. Furthermore, we design a novel divisive hierarchical algorithm for clustering categorical sequences based on the two-tier Markov model. The experimental results on the data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.

pacific-asia conference on knowledge discovery and data mining | 2014

Extensions to Quantile Regression Forests for Very High-Dimensional Data

Nguyen Thanh Tung; Joshua Zhexue Huang; Imran Khan; Mark Junjie Li; Graham J. Williams

This paper describes new extensions to the state-of-the-art regression random forests Quantile Regression Forests (QRF) for applications to high-dimensional data with thousands of features. We propose a new subspace sampling method that randomly samples a subset of features from two separate feature sets, one containing important features and the other one containing less important features. The two feature sets partition the input data based on the importance measures of features. The partition is generated by using feature permutation to produce raw importance feature scores first and then applying p-value assessment to separate important features from the less important ones. The new subspace sampling method enables to generate trees from bagged sample data with smaller regression errors. For point regression, we choose the prediction value of Y from the range between two quantiles Q 0.05 and Q 0.95 instead of the conditional mean used in regression random forests. Our experiment results have shown that random forests with these extensions outperformed regression random forests and quantile regression forests in reduction of root mean square residuals.

International Journal of Machine Learning and Cybernetics | 2017

Query ranking model for search engine query recommendation

JianGuo Wang; Joshua Zhexue Huang; Jiafeng Guo; Yanyan Lan

In this paper, we propose a query ranking model to select and order queries for search engine query recommendations. In contrast to existing similarity-based query recommendation methods (Agglomerative clustering of a search engine query log, 2000; The query-flow graph: model and applications, 2008], this model is based on utility, and ranks a query based on the joint probability of events whereby a query is selected by the user, the search results of the query are selected by the user, and the chosen search results satisfy the user’s information needs. We thus define three utilities in our model: a query-level utility representing the attractiveness of a query to the user, a perceived utility measuring the user’s actions given the search results, and a posterior utility measuring the user’s satisfaction with the chosen search results. We propose methods to compute these three utilities from query log data. In experiments involving real query log data, our proposed query ranking model outperformed seven other baseline methods in generating useful recommendations.

Applied Soft Computing | 2017

Random weight network-based fuzzy nonlinear regression for trapezoidal fuzzy number data

Yulin He; Chenghao Wei; Hao Long; Rana Aamir Raza Ashfaq; Joshua Zhexue Huang

Abstract This paper proposes a random weight network (RWN)-based fuzzy nonlinear regression (FNR) model, abbreviated as TraFNR RWN , to solve the FNR problem in which both inputs and outputs are trapezoidal fuzzy numbers. TraFNR RWN is a special single hidden layer feed-forward neural network which does not require any iterative process to train the network weights. The input-layer weights of TraFNR RWN are randomly assigned and its output-layer weights are analytically determined by solving a constrained-optimization problem. In addition, a new strategy is used to construct the fuzzy membership degree function for the predicted fuzzy-out based on the derived output-layer weights of TraFNR RWN . A fuzzification method is developed to fuzzify the crisp numbers of data sets into trapezoidal fuzzy numbers. Twelve fuzzified data sets were used in the experiments to compare the performance of TraFNR RWN with five different FNR models. The experimental results have shown that TraFNR RWN obtained better prediction performance with less training time because it did not require time-consuming weight learning and parameter tuning.

International Journal of Machine Learning and Cybernetics | 2017

Ensemble subspace clustering of text data using two-level features

He Zhao; Salman Salloum; Joshua Zhexue Huang

This paper proposes a new integrated method for ensemble subspace clustering of high dimensional sparse text data. Our method employs two-level feature representation of text data (words and topics) to generate clusters from subspaces. We also use ensemble clustering to increase the robustness of the clusters. This method depends on topic modeling to get the two-level feature representation of text data and to generate different ensemble components. By using both topics and words to cluster text data, we can get more interpretable clusters as we can measure the weight of words and topics in each cluster. In order to evaluate the proposed method, we have conducted several experiments on seven real-life data sets. While some of these data sets are easy to cluster, others are hard, and some others contain unbalanced data. Experimental results on this diversity of data sets show that our method outperforms other methods for ensemble clustering.

pacific-asia conference on knowledge discovery and data mining | 2015

A New Feature Sampling Method in Random Forests for Predicting High-Dimensional Data

Thanh-Tung Nguyen; He Zhao; Joshua Zhexue Huang; Thuy Thi Nguyen; Mark Junjie Li

Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply (p)-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.

Pattern Recognition | 2018

TWCC: Automated Two-way Subspace Weighting Partitional Co-Clustering

Xiaojun Chen; Min Yang; Joshua Zhexue Huang; Zhong Ming

Abstract A two-way subspace weighting partitional co-clustering method TWCC is proposed. In this method, two types of subspace weights are introduced to simultaneously weight the data in two ways, i.e., columns on row clusters and rows on column clusters. An objective function that uses the two types of weights in the distance function to determine the co-clusters of data is defined, and an iterative TWCC co-clustering algorithm to optimize the objective function is proposed, in which the two types of subspace weights are automatically computed. A series of experiments on both synthetic and real-life data were conducted to investigate the properties of TWCC, compare the two-way clustering results of TWCC with those of eight co-clustering algorithms, and compare one-way clustering results of TWCC with those of six clustering algorithms. The results have shown that TWCC is robust and effective for large high-dimensional data.

Explore More