Byron J. Gao
Texas State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Byron J. Gao.
international conference on management of data | 2009
AnHai Doan; Jeffrey F. Naughton; Raghu Ramakrishnan; Akanksha Baid; Xiaoyong Chai; Fei Chen; Ting Chen; Eric Chu; Pedro DeRose; Byron J. Gao; Chaitanya Gokhale; Jiansheng Huang; Warren Shen; Ba-Quy Vuong
Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss in particular developing a declarative IE language, optimizing for this language, generating IE provenance, incorporating user feedback into the IE process, developing a novel wiki-based user interface for feedback, best-effort IE, pushing IE into RDBMSs, and more. Our work suggests that IE in managing unstructured data can open up many interesting research challenges, and that these challenges can greatly benefit from the wealth of work on managing structured data that has been carried out by the database community.
IEEE Transactions on Knowledge and Data Engineering | 2015
Peng Zhang; Chuan Zhou; Peng Wang; Byron J. Gao; Xingquan Zhu; Li Guo
Ensemble learning is a common tool for data stream classification, mainly because of its inherent advantages of handling large volumes of stream data and concept drifting. Previous studies, to date, have been primarily focused on building accurate ensemble models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real-world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can be automatically updated by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Theoretical analysis and empirical studies on both synthetic and real-world data streams demonstrate the performance of our approach.
ACM Transactions on Knowledge Discovery From Data | 2008
Rong Ge; Martin Ester; Byron J. Gao; Zengjian Hu; Binay K. Bhattacharya; Boaz Ben-Moshe
Attribute data and relationship data are two principal types of data, representing the intrinsic and extrinsic properties of entities. While attribute data have been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. It is also common to observe both data types carry complementary information such as in market segmentation and community identification, which calls for a joint cluster analysis of both data types so as to achieve better results. In this article, we introduce the novel Connected k-Center (CkC) problem, a clustering model taking into account attribute data as well as relationship data. We analyze the complexity of the problem and prove its NP-hardness. Therefore, we analyze the approximability of the problem and also present a constant factor approximation algorithm. For the special case of the CkC problem where the relationship data form a tree structure, we propose a dynamic programming method giving an optimal solution in polynomial time. We further present NetScan, a heuristic algorithm that is efficient and effective for large real databases. Our extensive experimental evaluation on real datasets demonstrates the meaningfulness and accuracy of the NetScan results.
knowledge discovery and data mining | 2011
Peng Zhang; Jun Li; Peng Wang; Byron J. Gao; Xingquan Zhu; Li Guo
Ensemble learning has become a common tool for data stream classification, being able to handle large volumes of stream data and concept drifting. Previous studies focus on building accurate prediction models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can automatically update themselves by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Experiments on both synthetic and real-world data streams demonstrate the performance of our approach.
international conference on data mining | 2011
Peng Zhang; Byron J. Gao; Xingquan Zhu; Li Guo
Lazy learning, such as k-nearest neighbor learning, has been widely applied to many applications. Known for well capturing data locality, lazy learning can be advantageous for highly dynamic and complex learning environments such as data streams. Yet its high memory consumption and low prediction efficiency have made it less favorable for stream oriented applications. Specifically, traditional lazy learning stores all the training data and the inductive process is deferred until a query appears, whereas in stream applications, data records flow continuously in large volumes and the prediction of class labels needs to be made in a timely manner. In this paper, we provide a systematic solution that overcomes the memory and efficiency limitations and enables fast lazy learning for concept drifting data streams. In particular, we propose a novel Lazy-tree (Ltree for short) indexing structure that dynamically maintains compact high-level summaries of historical stream records. L-trees are M-Tree [5] like, height-balanced, and can help achieve great memory consumption reduction and sub-linear time complexity for prediction. Moreover, L-trees continuously absorb new stream records and discard outdated ones, so they can naturally adapt to the dynamically changing concepts in data streams for accurate prediction. Extensive experiments on real-world and synthetic data streams demonstrate the performance of our approach.
knowledge discovery and data mining | 2006
Byron J. Gao; Obi L. Griffith; Martin Ester; Steven J.M. Jones
Order-preserving submatrixes (OPSMs) have been accepted as a biologically meaningful subspace cluster model, capturing the general tendency of gene expressions across a subset of conditions. In an OPSM, the expression levels of all genes induce the same linear ordering of the conditions. OPSM mining is reducible to a special case of the sequential pattern mining problem, in which a pattern and its supporting sequences uniquely specify an OPSM cluster. Those small twig clusters, specified by long patterns with naturally low support, incur explosive computational costs and would be completely pruned off by most existing methods for massive datasets containing thousands of conditions and hundreds of thousands of genes, which are common in todays gene expression analysis. However, it is in particular interest of biologists to reveal such small groups of genes that are tightly coregulated under many conditions, and some pathways or processes might require only two genes to act in concert. In this paper, we introduce the KiWi mining framework for massive datasets, that exploits two parameters k and w to provide a biased testing on a bounded number of candidates, substantially reducing the search space and problem scale, targeting on highly promising seeds that lead to significant clusters and twig clusters. Extensive biological and computational evaluations on real datasets demonstrate that KiWi can effectively mine biologically meaningful OPSM subspace clusters with good efficiency and scalability.
ACM Transactions on Intelligent Systems and Technology | 2014
Shuaiqiang Wang; Jiankai Sun; Byron J. Gao; Jun Ma
Collaborative filtering (CF) is an effective technique addressing the information overload problem. CF approaches generally fall into two categories: rating based and ranking based. The former makes recommendations based on historical rating scores of items and the latter based on their rankings. Ranking-based CF has demonstrated advantages in recommendation accuracy, being able to capture the preference similarity between users even if their rating scores differ significantly. In this study, we propose VSRank, a novel framework that seeks accuracy improvement of ranking-based CF through adaptation of the vector space model. In VSRank, we consider each user as a document and his or her pairwise relative preferences as terms. We then use a novel degree-specialty weighting scheme resembling TF-IDF to weight the terms. Extensive experiments on benchmarks in comparison with the state-of-the-art approaches demonstrate the promise of our approach.
IEEE Transactions on Knowledge and Data Engineering | 2012
Byron J. Gao; Obi L. Griffith; Martin Ester; Hui Xiong; Qiang Zhao; Steven J.M. Jones
Order-preserving submatrix (OPSM) has been widely accepted as a biologically meaningful cluster model, capturing the general tendency of gene expression across a subset of experiments. In an OPSM, the expression levels of all genes induce the same linear ordering of the experiments. The OPSM problem is to discover those statistically significant OPSMs from a given data matrix. The problem is reducible to a special case of the sequential pattern mining problem, where a pattern and its supporting sequences uniquely specify an OPSM. Unfortunately, existing methods do not scale well to massive data sets containing thousands of experiments and hundreds of thousands of genes, which are common in todays gene expression analysis. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs in their discovery and are completely pruned off by existing methods. However, it is of particular interest of biologists to determine small groups of genes that are tightly coregulated across many experiments, and some pathways or processes may require as few as two genes to act in concert. In this paper, we study the discovery of deep OPSMs from massive data sets. We propose a novel best effort mining framework Kiwi that exploits two parameters k and w to bound the available computational resources and search a selected search space, and does what it can to find as many as possible deep OPSMs. Extensive biological and computational evaluations on real data sets demonstrate the validity and importance of the deep OPSM problem, and the efficiency and effectiveness of the Kiwi mining framework.
Neurocomputing | 2012
Peng Zhang; Byron J. Gao; Ping Liu; Yong Shi; Li Guo
Data stream classification has drawn increasing attention from the data mining community in recent years. Relevant applications include network traffic monitoring, sensor network data analysis, Web click stream mining, power consumption measurement, dynamic tracing of stock fluctuations, to name a few. Data stream classification in such real-world applications is typically subject to three major challenges: concept drifting, large volumes, and partial labeling. As a result, training examples in data streams can be very diverse and it is very hard to learn accurate models with efficiency. In this paper, we propose a novel framework that first categorizes diverse training examples into four types and assign learning priorities to them. Then, we derive four learning cases based on the proportion and priority of the different types of training examples. Finally, for each learning case, we employ one of the four SVM-based training models: classical SVM, semi-supervised SVM, transfer semi-supervised SVM, and relational k-means transfer semi-supervised SVM. We perform comprehensive experiments on real-world data streams that validate the utility of our approach.
international conference on data mining | 2006
Byron J. Gao; Martin Ester
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different trade-offs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations.