Is this you? Create Your Porfile

Huidong Jin

Australian National University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huidong Jin is active.

Explore More

Publication

Featured researches published by Huidong Jin.

knowledge discovery and data mining | 2009

A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data

Ke Zhang; Marcus Hutter; Huidong Jin

Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data patterns and parameter setting issues. We define a novel Local Distance-based Outlier Factor (LDOF) to measure the outlier-ness of objects in scattered datasets which addresses these issues. LDOF uses the relative location of an object to its neighbours to determine the degree to which the object deviates from its neighbourhood. We present theoretical bounds on LDOFs false-detection probability. Experimentally, LDOF compares favorably to classical KNN and LOF based outlier detection. In particular it is less sensitive to parameter values.

european conference on machine learning | 2010

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Lan Du; Wray L. Buntine; Huidong Jin

Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.

Applied Intelligence | 2010

PutMode: prediction of uncertain trajectories in moving objects databases

Shaojie Qiao; Changjie Tang; Huidong Jin; Teng Long; Shucheng Dai; Yungchang Ku; Michael Chau

Objective: Prediction of moving objects with uncertain motion patterns is emerging rapidly as a new exciting paradigm and is important for law enforcement applications such as criminal tracking analysis. However, existing algorithms for prediction in spatio-temporal databases focus on discovering frequent trajectory patterns from historical data. Moreover, these methods overlook the effect of some important factors, such as speed and moving direction. This lacks generality as moving objects may follow dynamic motion patterns in real life.Methods: We propose a framework for predicating uncertain trajectories in moving objects databases. Based on Continuous Time Bayesian Networks (CTBNs), we develop a trajectory prediction algorithm, called PutMode (Prediction of uncertain trajectories in Moving objects databases). It comprises three phases: (i) construction of TCTBNs (Trajectory CTBNs) which obey the Markov property and consist of states combined by three important variables including street identifier, speed, and direction; (ii) trajectory clustering for clearing up outlying trajectories; (iii) predicting the motion behaviors of moving objects in order to obtain the possible trajectories based on TCTBNs.Results: Experimental results show that PutMode can predict the possible motion curves of objects in an accurate and efficient manner in distinct trajectory data sets with an average accuracy higher than 80%. Furthermore, we illustrate the crucial role of trajectory clustering, which provides benefits on prediction time as well as prediction accuracy.

international conference on data mining | 2010

Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document

Lan Du; Wray L. Buntine; Huidong Jin

Understanding how topics within a document evolve over its structure is an interesting and important problem. In this paper, we address this problem by presenting a novel variant of Latent Dirichlet Allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, {it i.e.}, a document consists of multiple segments ({it e.g.}, chapters, paragraphs), each of which is correlated to its previous and subsequent segments. In our model, a document and its segments are modelled as random mixtures of the same set of latent topics, each of which is a distribution over words, and the topic distribution of each segment depends on that of its previous segment, the one for first segment will depend on the document topic distribution. The progressive dependency is captured by using the nested two-parameter Poisson Dirichlet process (PDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the PDP. Our experimental results on patent documents show that by taking into account the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on books such as Melvilles The Whale.

Knowledge and Information Systems | 2012

Sequential latent Dirichlet allocation

Lan Du; Wray L. Buntine; Huidong Jin; Changyou Chen

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant of latent Dirichletxa0allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i.e. a document consists of multiple segments (e.g. chapters, paragraphs), each of which is correlated to its antecedent and subsequent segments. Such progressive sequential dependency is captured by using the hierarchical two-parameter Poisson–Dirichlet process (HPDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the SeqLDA based on the HPDP. Our experimental results on patent documents show that by considering the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on several books such as Melville’s ‘Moby Dick’.

Applied Intelligence | 2010

KISTCM: knowledge discovery system for traditional Chinese medicine

Shaojie Qiao; Changjie Tang; Huidong Jin; Jing Peng; Darren Davis; Nan Han

Objective: Traditional Chinese Medicine (TCM) provides an alternative method for achieving and maintaining good health. Due to the increasing prevalence of TCM and the large volume of TCM data accumulated though thousands of years, there is an urgent need to efficiently and effectively explore this information and its hidden rules with knowledge discovery in database (KDD) techniques. This paper describes the design and development of a knowledge discovery system for TCM as well as the newly proposed KDD techniques integrated in this system.Methods: A novel Knowledge dIscovery System for TCM (KISTCM) is developed by incorporating several data mining techniques, primarily including a medicine dependency relationship discovery algorithm, an efficacy dimension reduction algorithm based on neural networks, axa0method for exploring the relationships between formulae and syndromes using gene expression programming (GEP), and an approach for discovering the properties in terms of nature, taste and meridian based on the herbal dosage by employing the effect degree function to calculate the effect of each property.Results: Representative experimental cases are used to evaluate the system performance. Encouraging results are obtained, including rules previously unknown to algorithm designers and experiment runners. Experiments demonstrate that KISTCM has powerful knowledge discovery and data analysis capabilities, and is a useful tool for discovering the underlying rules in formulae. Our proposed techniques successfully discover hidden knowledge from TCM data, which is a new direction in knowledge discovery. From TCM experts’ perspective, the accuracy of data analysis for KISTCM is an improvement, and these results compare favorably to other existing TCM data mining techniques. The system could be expected to be useful in the practice of TCM, e.g., assisting TCM physicians in prescribing formulae or automatically distinguishing between minister and assistant herbs in a formula.

intelligence and security informatics | 2008

Constrained k-closest pairs query processing based on growing window in crime databases

Shaojie Qiao; Changjie Tang; Huidong Jin; Shucheng Dai; Xingshu Chen

Spatial analysis in crime databases has recently been an active research topic. To solve the problem of finding the closest pairs of objects within a given spatial region, as required in crime geo-data applications, this paper proposes an efficient constrained k-closest pairs query processing algorithm based on growing window. It expands the window gradually instead of searching the whole workspace for multiple types of spatial objects. It employs a density-based range estimation approach to calculate the square query range and an optimized R-tree to store the index entities. In addition, a distance threshold T for the closest pair of objects is introduced to prune tree nodes. Experiments evaluate the effect of three important factors, i.e., the portion of overlapping between the workspaces of two data sets, the value of k, and the size of buffer. The results show that the new algorithm outperforms the heap-based approach.

intelligence and security informatics | 2008

A latent semantic indexing and WordNet based information retrieval model for digital forensics

Lan Du; Huidong Jin; O. de Vel; Nianjun Liu

It is well known that either domain specific or domain independent knowledge has been adopted in Information retrieval (IR) to improve the retrieval performance. In this paper, we propose a novel IR model for digital forensics by using latent semantic indexing (LSI) and WordNet as an underlying reference ontology to retrieve suspicious emails according to the semantic meaning of an investigatorpsilas query. Our model incorporates corpus independent knowledge from WordNet and corpus dependent knowledge from LSI into query expansion and reduction; and LSI is also adopted to simulate human meaning based judgement of relatedness between investigatorpsilas queries and emails. We compare the performance of the resulting LSI And WordNet based Information retrieval system (LAWIRS) with other three systems we implement, i.e. the LSI system, the Lucene system and the Lucene system with query expansion. Experimental results on several email datasets demonstrate that for short Boolean queries, LAWIRS can successfully capture their meaning and yield substantial improvements in the overall retrieval performance.

australasian joint conference on artificial intelligence | 2010

An Effective Pattern Based Outlier Detection Approach for Mixed Attribute Data

Ke Zhang; Huidong Jin

Detecting outliers in mixed attribute datasets is one of major challenges in real world applications. Existing outlier detection methods lack effectiveness for mixed attribute datasets mainly due to their inability of considering interactions among different types of, e.g., numerical and categorical attributes. To address this issue in mixed attribute datasets, we propose a novel Pattern based Outlier Detection approach (POD). Pattern in this paper is defined to describe majority of data as well as capture interactions among different types of attributes. In POD, the more does an object deviate from these patterns, the higher is its outlier factor. We use logistic regression to learn patterns and then formulate the outlier factor in mixed attribute datasets. A series of experimental results illustrate that POD performs statistically significantly better than several classic outlier detection methods.

australasian joint conference on artificial intelligence | 2008

Knowledge Discovery from Honeypot Data for Monitoring Malicious Attacks

Huidong Jin; Olivier Y. de Vel; Ke Zhang; Nianjun Liu

Owing to the spread of worms and botnets, cyber attacks have significantly increased in volume, coordination and sophistication. Cheap rentable botnet services, e.g., have resulted in sophisticated botnets becoming an effective and popular tool for committing online crime these days. Honeypots, as information system traps, are monitoring or deflecting malicious attacks on the Internet. To understand the attack patterns generated by botnets by virtue of the analysis of the data collected by honeypots, we propose an approach that integrates a clustering structure visualisation technique with outlier detection techniques. These techniques complement each other and provide end users both a big-picture view and actionable knowledge of high-dimensional data. We introduce KNOF (K-nearest Neighbours Outlier Factor) as the outlier definition technique to reach a trade-off between global and local outlier definitions, i.e., K th -Nearest Neighbour (KNN) and Local Outlier Factor (LOF) respectively. We propose an algorithm to discover the most significant KNOF outliers. We implement these techniques in our hpdAnalyzer tool. The tool is successfully used to comprehend honeypot data. A series of experiments show that our proposed KNOF technique substantially outperforms LOF and, to a lesser degree, KNN for real-world honeypot data.

Explore More