Baoying Wang
North Dakota State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Baoying Wang.
international conference on data mining | 2004
Dongmei Ren; Baoying Wang; William Perrizo
Outlier detection can lead to discovering unexpected and interesting knowledge, which is critical important to some areas such as monitoring of criminal activities in electronic commerce, credit card fraud, etc. In this paper, we developed an efficient density-based outlier detection method for large datasets. Our contributions are: a) we introduce a relative density factor (RDF); b) based on RDF, we propose an RDF-based outlier detection method which can efficiently prune the data points which are deep in clusters, and detect outliers only within the remaining small subset of the data; c) the performance of our method is further improved by means of a vertical data representation, P-trees. We tested our method with NHL and NBA data. Our method shows an order of magnitude speed improvement compared to the contemporary approaches.
international conference on data mining | 2007
Baoying Wang; Imad Rahal
Market-basket data analysis is an important problem that has been well addressed in the literature especially in the context of finding associations among items in large groups of transactions. Recently, there have been many attempts for clustering market-basket data. However, most of those market-basket clustering methods belong to partitional clustering which require at least one input parameter (e.g., the minimum intra- cluster similarity or the desired number of clusters). In this paper, we propose WC-clustering, a hierarchical clustering approach using vertical data structures. In order to minimize the impact of low support items, we devise a weighted confidence (WC) affinity function to calculate the similarity between clusters (or itemsets). Our experimental results show that WC-clustering produces much more compact results than Apriori and that the proposed weighted confidence affinity measure is more accurate than other contemporary affinity measures in the literature.
Archive | 2014
Baoying Wang; Ruowang Li; William Perrizo
As technology evolves and electronic data becomes more complex, digital medical record management and analysis becomes a challenge. In order to discover patterns and make relevant predictions based on large data sets, researchers and medical professionals must find new methods to analyze and extract relevant health information.Big Data Analytics in Bioinformatics and Healthcare merges the fields of biology, technology, and medicine in order to present a comprehensive study on the emerging information processing applications necessary in the field of electronic medical record management. Complete with interdisciplinary research resources, this publication is an essential reference source for researchers, practitioners, and students interested in the fields of biological computation, database management, and health information technology, with a special focus on the methodologies and tools to manage massive and complex electronic information.
international conference on data mining | 2008
Baoying Wang; Qin Ding; Imad Rahal
Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to decrease the computation time meanwhile keeping communication time at minimum. Load balance issue is always considered especially at data partitioning stage. Our experimental results demonstrate that PH-Clustering speeds up the sequential clustering with a great magnitude. The larger the data size, the more significant the speedup when the number of processors is large. Our results also show that the number of items has more impact on the performance of PH-Clustering than the number of transactions.
european conference on principles of data mining and knowledge discovery | 2003
Fei Pan; Baoying Wang; Yi Zhang; Dongmei Ren; Xin Hu; William Perrizo
Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from sources such as remote sensing, geographical information systems, astronomy, computer cartography, environmental assessment and planning, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for data mining. These approaches have run time complexity of O(nlogn) when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, a unique approach to efficient neighborhood search and a new efficient density based clustering algorithm using EIN-rings are developed. Our approach exploits compressed vertical data structures, Peano Trees (P-trees), and fast P-tree logical operations to accelerate the calculation of the density function within EIN-rings. This approach stands in contrast to the ubiquitous approach of vertically scanning horizontal data structures (records). The average run time complexity of our algorithm for spatial data in d-dimension is \(O(dn\sqrt{n})\). Our proposed method has comparable cardinality scalability with other density methods for small and medium size of data, but superior speed and dimensional scalability.
International Journal of Data Mining, Modelling and Management | 2011
Baoying Wang; Imad Rahal; Aijuan Dong
There have been many attempts for clustering categorical data such as market basket dataset. However, most of categorical clustering approaches belong to partitional clustering which requires at least one input parameter (e.g., the minimum intra-cluster similarity or the desired number of clusters). In this paper, we propose a parallelised hierarchical clustering approach for categorical data (PH-clustering) using vertical data structures. In order to minimise the impact of low support items, we devise a weighted confidence (WC) affinity function to compute the similarity between clusters. Based on our analysis of the major clustering steps, we adopt a partial local and partial global approach to reduce computation time as well as to keep network communication at minimum. Load balance issues are addressed especially during the data partitioning phase. Our experimental results on standardised market basket data show that the proposed weighted confidence affinity measure is more accurate than other contemporary affinity measures in the literature and that our parallel clustering approach provides magnitudes of time improvements over sequential clustering especially over larger data sizes. Our results also indicate that the number of items/attributes in the dataset has a more drastic impact on performance than the number of transactions/tuples.
International Journal of Bioinformatics Research and Applications | 2008
Imad Rahal; Riad M. Rahhal; Baoying Wang; William Perrizo
This paper analyses annotated genome data by applying a very central data-mining technique known as Association Rule Mining (ARM) with the aim of discovering rules and hypotheses capable of yielding deeper insights into this type of data. In the literature, ARM has been noted for producing an overwhelming number of rules. This work proposes a new technique capable of using domain knowledge in the form of queries in order to efficiently mine only the subset of the associations that are of interest to investigators in an incremental and interactive manner.
acm symposium on applied computing | 2005
Baoying Wang; William Perrizo
Data clustering methods have been proven to be a successful data mining technique in analysis of gene expression data and many other types of data. However, some concerns and challenges still remain, e.g., in gene expression clustering. In this paper, we propose an efficient clustering method using attractor trees. The combination of the density-based approach and the similarity-based approach considers clusters with diverse shapes, densities, and sizes. Experiments on gene expression datasets demonstrate that our approach is efficient and scalable with competitive accuracy.
Journal of Information & Knowledge Management | 2009
Imad Rahal; Baoying Wang; Riad M. Rahhal
In this day and age, conducting a biological experiment is presumably a very expensive procedure largely owing to the highly sophisticated and expensive equipment necessitated by the process. Conceivably, being capable of isolating and focusing on a smaller set of imperative genes or gene products that are of high relevance to the experiment, pathway, or biological system under investigation is very desirable largely owing to the potential savings in experimental costs. In this work, we propose an intelligent information system capable of generating a ranked list of genes and gene products that are most pertinent to a given biological pathway, experiment or system (referred to as a biological context henceforth). We assume that the biological context of interest can be described by various textual query terms and phrases from the biological domain which, in turn, relate to various molecular functions, biological processes and cellular components of genes and their products. Intelligent text-based analyses and mining are utilised for this purpose by using the published literature, in the form of publication abstracts downloaded from PubMed, with the intention of ranking genes and gene products having identified relationships to the specified description terms based on the gene ontology (GO) standard. At this stage, our approach is capable of producing promising results given all surrounding restrictions, one of which is the lack of similar work in the literature. For demonstration purposes, we report experimental results on the molting regulation pathway in Drosophila melanogaster (fruit fly).
Journal of Biomedical Informatics | 2004
Fei Pan; Baoying Wang; Xin Hu; William Perrizo