Gillian Dobbie
University of Auckland
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gillian Dobbie.
Swarm and evolutionary computation | 2014
Shafiq Alam; Gillian Dobbie; Yun Sing Koh; Patricia Riddle; Saeed Ur Rehman
Abstract Optimization based pattern discovery has emerged as an important field in knowledge discovery and data mining (KDD), and has been used to enhance the efficiency and accuracy of clustering, classification, association rules and outlier detection. Cluster analysis, which identifies groups of similar data items in large datasets, is one of its recent beneficiaries. The increasing complexity and large amounts of data in the datasets have seen data clustering emerge as a popular focus for the application of optimization based techniques. Different optimization techniques have been applied to investigate the optimal solution for clustering problems. Swarm intelligence (SI) is one such optimization technique whose algorithms have successfully been demonstrated as solutions for different data clustering domains. In this paper we investigate the growth of literature in SI and its algorithms, particularly Particle Swarm Optimization (PSO). This paper makes two major contributions. Firstly, it provides a thorough literature overview focusing on some of the most cited techniques that have been used for PSO-based data clustering. Secondly, we analyze the reported results and highlight the performance of different techniques against contemporary clustering techniques. We also provide an brief overview of our PSO-based hierarchical clustering approach (HPSO-clustering) and compare the results with traditional hierarchical agglomerative clustering (HAC), K-means, and PSO clustering.
conference on information and knowledge management | 2001
Ying Guang Li; Stéphane Bressan; Gillian Dobbie; Zoé Lacroix; Mong Li Lee; Ullas Nambiar; Bimlesh Wadhwa
If XML is to play the critical role of the lingua franca for Internet data interchange that many predict, it is necessary to start designing and adopting benchmarks allowing the comparative performance analysis of the tools being developed and proposed. The effectiveness of existing XML query languages has been studied by many, with a focus on the comparison of linguistic features, implicitly reflecting the fact that most XML tools exist only on paper. In this paper, with a focus on efficiency and concreteness, we propose a pragmatic first step toward the systematic benchmarking of XML query processing platforms with an initial focus on the data (versus document) point of view. We propose XOO7, an XML version of the OO7 benchmark. We discuss the applicability of XOO7, its strengths, limitations and the extensions we are considering. We illustrate its use by presenting and discussing the performance comparison against XOO7 of three different query processing platforms for XML.
web information and data management | 2003
Jacky Wan; Gillian Dobbie
Data mining is generally considered the extraction and analysis of information from databases. With the rapid growth of XML data available online, mining XML data from the web is becoming important. In support of this trend, several encouraging attempts at developing methods for mining XML data have been proposed. However, efficiency and simplicity are still a barrier for further development In this paper, we show that any XML document can be mined for association rules using only the query language XQuery without any pre-processing or post-processing.
ieee swarm intelligence symposium | 2008
Shafiq Alam; Gillian Dobbie; Patricia Riddle
Clustering is an important data mining task and has been explored extensively by a number of researchers for different application areas such as finding similarities in images, text data and bio-informatics data. Various optimization techniques have been proposed to improve the performance of clustering algorithms. In this paper we propose a novel algorithm for clustering that we call evolutionary particle swarm optimization (EPSO)-clustering algorithm which is based on PSO. The proposed algorithm is based on the evolution of swarm generations where the particles are initially uniformly distributed in the input data space and after a specified number of iterations; a new generation of the swarm evolves. The swarm tries to dynamically adjust itself after each generation to optimal positions. The paper describes the new algorithm the initial implementation and presents tests performed on real clustering benchmark data. The proposed method is compared with k-means clustering- a benchmark clustering technique and simple particle swarm clustering algorithm. The results show that the algorithm is efficient and produces compact clusters.
data warehousing and olap | 2010
M. Asif Naeem; Gillian Dobbie; Gerald Weber; Shafiq Alam
To fulfill the increasing demand of business for the latest information, current data integration approaches are moving towards real-time updates. One important element in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and propose an improved version called reduced MESHJOIN (R-MESHJOIN). Both algorithms tune the memory, allocating parts of the memory to key components. In MESHJOIN there is a dependency between the size of partitions in an internal queue for the stream data and the number of iterations required to bring the disk-based relation into memory. This dependency hampers the optimal distribution of memory among the join components. In particular the size of the disk-buffer varies with the size of the disk-based relation which is unnecessary. On the other hand the R-MESHJOIN algorithm removes this dependency. This enables an optimal distribution of available memory among the join components. In R-MESHJOIN a change in the size of the disk-based relation does not affect the size of the disk-buffer. An experimental study is conducted in order to validate the arguments.
Information Sciences | 2013
Russel Pears; Yun Sing Koh; Gillian Dobbie; Wai K. Yeap
Association rule mining is an important data mining task that discovers relationships among items in a transaction database. Classical association rule mining approaches make the implicit assumption that an items importance is determined by its support. In contrast, Weighted Association Rule Mining (WARM) attempts to provide a notion of importance, or weight to individual items that are not based solely on item support. Previous approaches to Weighted Association Rule Mining assign item weights in a subjective manner, based on a users specialized knowledge of the underlying domain that is involved. Such approaches are infeasible when millions of items are present in a dataset, or when domain knowledge is unavailable. Furthermore, even when such domain information is available, a weight assignment based on subjective information constrains the knowledge discovered to fit with the weights assigned, thus inhibiting the discovery of new trends in the data. In this research we automate the process of weight assignment by formulating a linear model that captures relationships between items. This approach extends prior research based on the Valency model. We extend the Valency model by expanding the field of interaction beyond immediate neighborhoods and show that this leads to significant improvements in performance on a number of different metrics that we use.
web intelligence | 2010
Shafiq Alam; Gillian Dobbie; Patricia Riddle; M. Asif Naeem
Clustering- an important data mining task, which groups the data on the basis of similarities among the data, can be divided into two broad categories, partitional clustering and hierarchal. We combine these two methods and propose a novel clustering algorithm called Hierarchical Particle Swarm Optimization (HPSO) data clustering. The proposed algorithm exploits the swarm intelligence of cooperating agents in a decentralized environment. The experimental results were compared with benchmark clustering techniques, which include K-means, PSO clustering, Hierarchical Agglomerative clustering (HAC) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The results are evidence of the effectiveness of Swarm based clustering and the capability to perform clustering in a hierarchical agglomerative manner.
data warehousing and knowledge discovery | 2011
Sidney Tsang; Yun Sing Koh; Gillian Dobbie
Most association rule mining techniques concentrate on finding frequent rules. However, rare association rules are in some cases more interesting than frequent association rules since rare rules represent unexpected or unknown associations. All current algorithms for rare association rule mining use an Apriori level-wise approach which has computationally expensive candidate generation and pruning steps. We propose RP-Tree, a method for mining a subset of rare association rules using a tree structure, and an information gain component that helps to identify the more interesting association rules. Empirical evaluation using a range of real world datasets shows that RP-Tree itemset and rule generation is more time efficient than modified versions of FP-Growth and ARIMA, and discovers 92-100% of all the interesting rare association rules.
international acm sigir conference on research and development in information retrieval | 2014
Wei Zhou; Yun Sing Koh; Junhao Wen; Shafiq Alam; Gillian Dobbie
Recommender systems using Collaborative Filtering techniques are capable of make personalized predictions. However, these systems are highly vulnerable to profile injection attacks. Group attacks are attacks that target a group of items instead of one, and there are common attributes among these items. Such profiles will have a good probability of being similar to a large number of user profiles, making them hard to detect. We propose a novel technique for identifying group attack profiles which uses an improved metric based on Degree of Similarity with Top Neighbors (DegSim) and Rating Deviation from Mean Agreement (RDMA). We also extend our work with a detailed analysis of target item rating patterns. Experiments show that the combined methods can improve detection rates in user-based recommender systems.
british national conference on databases | 2011
Muhammad Asif Naeem; Gillian Dobbie; Gerald Weber
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-realtime data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm Mesh Join (MESHJOIN) has been proposed to amortize disk access over fast stream. MESHJOIN makes no assumptions about the data distribution. In real world applications, however, skewed distributions can be found, e.g, certain products are sold more frequently than the remainder of the products. The question arises, how much does MESHJOIN loose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be used by non-adaptive approaches such as MESHJOIN.