Raymond T. Ng
University of British Columbia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Raymond T. Ng.
international conference on management of data | 2000
Markus M. Breunig; Hans-Peter Kriegel; Raymond T. Ng; Jörg Sander
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
IEEE Transactions on Knowledge and Data Engineering | 2002
Raymond T. Ng; Jiawei Han
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.
international conference on management of data | 1998
Raymond T. Ng; Laks V. S. Lakshmanan; Jiawei Han; Alex T. Pang
From the standpoint of supporting human-centered discovery of knowledge, the present-day model of mining association rules suffers from the following serious shortcomings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user interaction in between. We propose, in this paper, an architecture that opens up the black-box, and supports constraint-based, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of constraint constructs, including domain, class, and SQL-style aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association. In this paper, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensurate with the selectivities of the constraints in an association query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: anti-monotonicity and succinctness. We then develop characterizations of various constraints into four categories, according to these properties. Finally, we describe a mining algorithm called CAP, which achieves a maximized degree of pruning for all categories of constraints. Experimental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algorithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the performance guarantee.
very large data bases | 2000
Edwin M. Knorr; Raymond T. Ng; Vladimir Tucakov
Abstract. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g.,
American Journal of Human Genetics | 2007
Kendy K. Wong; Ronald J. deLeeuw; Nirpjit S. Dosanjh; Lindsey R. Kimm; Ze Cheng; Douglas E. Horsman; Calum MacAulay; Raymond T. Ng; Carolyn J. Brown; Evan E. Eichler; Wan L. Lam
k \ge 5
IEEE Transactions on Software Engineering | 2004
Annie T.T. Ying; Gail C. Murphy; Raymond T. Ng; Mark C. Chu-Carroll
); and (ii), outlier detection is a meaningful and important knowledge discovery task.First, we present two simple algorithms, both having a complexity of
Information & Computation | 1992
Raymond T. Ng; V. S. Subrahmanian
O(k \: N^2)
international conference on management of data | 2004
Yuhan Cai; Raymond T. Ng
, k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for
Journal of Clinical Investigation | 2011
Yann Malato; Syed Naqvi; Nina Schürmann; Raymond T. Ng; Bruce Wang; Joan Zape; Mark A. Kay; Dirk Grimm; Holger Willenbring
k \leq 4
international conference on management of data | 1999
Laks V. S. Lakshmanan; Raymond T. Ng; Jiawei Han; Alex T. Pang
. Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for