Is this you? Create Your Porfile

Xiaoying Gao

Victoria University of Wellington

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaoying Gao is active.

Explore More

Publication

Featured researches published by Xiaoying Gao.

web intelligence | 2005

Improving Web Clustering by Cluster Selection

Daniel Crabtree; Xiaoying Gao; Peter Andreae

Web page clustering is a technology that puts semantically related Web pages into groups and is useful for categorizing, organizing, and refining search results. When clustering using only textual information, suffix tree clustering (STC) outperforms other clustering algorithms by making use of phrases and allowing clusters to overlap. One problem of STC and other similar algorithms is how to select a small set of clusters to display to the user from a very large set of generated clusters. The cluster selection method used in STC is flawed in that it does not handle overlapping clusters appropriately. This paper introduces a new cluster scoring function and a new cluster selection algorithm to overcome the problems with overlapping clusters, which are combined with STC to make a new clustering algorithm ESTC. This papers experiments show that ESTC significantly outperforms STC and that even with less data ESTC performs similarly to a commercial clustering search engine.

systems man and cybernetics | 2007

A New Crossover Operator in Genetic Programming for Object Classification

Mengjie Zhang; Xiaoying Gao; Weijun Lou

The crossover operator has been considered ldquothe centre of the stormrdquo in genetic programming (GP). However, many existing GP approaches to object recognition suggest that the standard GP crossover is not sufficiently powerful in producing good child programs due to the totally random choice of the crossover points. To deal with this problem, this paper introduces an approach with a new crossover operator in GP for object recognition, particularly object classification. In this approach, a local hill-climbing search is used in constructing good building blocks, a weight called looseness is introduced to identify the good building blocks in individual programs, and the looseness values are used as heuristics in choosing appropriate crossover points to preserve good building blocks. This approach is examined and compared with the standard crossover operator and the headless chicken crossover (HCC) method on a sequence of object classification problems. The results suggest that this approach outperforms the HCC, the standard crossover, and the standard crossover operator with hill climbing on all of these problems in terms of the classification accuracy. Although this approach spends a bit longer time than the standard crossover operator, it significantly improves the system efficiency over the HCC method.

knowledge discovery and data mining | 2007

Exploiting underrepresented query aspects for automatic query expansion

Daniel Crabtree; Peter Andreae; Xiaoying Gao

Users attempt to express their search goals through web search queries. When a search goal has multiple components or aspects, documents that represent all the aspects are likely to be more relevant than those that only represent some aspects. Current web search engines often produce result sets whose top ranking documents represent only a subset of the query aspects. By expanding the query using the right keywords, the search engine can find documents that represent more query aspects and performance improves. This paper describes AbraQ, an approach for automatically finding the right keywords to expand the query. AbraQ identifies the aspects in the query, identifies which aspects are underrepresented in the result set of the original query, and finally, for any particularly underrepresented aspect, identifies keywords that would enhance that aspects representation and automatically expands the query using the best one. The paper presents experiments that show AbraQ significantly increases the precision of hard queries, whereas traditional automatic query expansion techniques have not improved precision. AbraQ also compared favourably against a range of interactive query expansion techniques that require user involvement including clustering, web-log analysis, relevance feedback, and pseudo relevance feedback.

web intelligence | 2006

Query Directed Web Page Clustering

Daniel Crabtree; Peter Andreae; Xiaoying Gao

Web page clustering methods categorize and organize search results into semantically meaningful clusters that assist users with search refinement; but finding clusters that are semantically meaningful to users is difficult. In this paper, we describe a new Web page clustering algorithm, QDC, which uses the users query as part of a reliable measure of cluster quality. The new algorithm has five key innovations: a new query directed cluster quality guide that uses the relationship between clusters and the query, an improved cluster merging method that generates semantically coherent clusters by using cluster description similarity in additional to cluster overlap, a new cluster splitting method that fixes the cluster chaining or cluster drifting problem, an improved heuristic for cluster selection that uses the query directed cluster quality guide, and a new method of improving clusters by ranking the pages by relevance to the cluster. We evaluate QDC by comparing its clustering performance against that of four other algorithms on eight data sets (four use full text data and four use snippet data) by using eleven different external evaluation measurements. We also evaluate QDC by informally analysing its real world usability and performance through comparison with six other algorithms on four data sets. QDC provides a substantial performance improvement over other Web page clustering algorithms

web intelligence | 2003

Learning information extraction patterns from tabular Web pages without manual labelling

Xiaoying Gao; Mengjie Zhang; Peter Andreae

We describe a domain independent approach to automatically constructing information extraction patterns for semistructured Web pages. The approach was tested on three corpora containing a series of tabular Web sites from different domains and achieved a success rate of at least 80%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.

australasian joint conference on artificial intelligence | 2005

Combining contents and citations for scientific document classification

Minh Duc Cao; Xiaoying Gao

This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.

web intelligence | 2005

Standardized Evaluation Method for Web Clustering Results

Daniel Crabtree; Xiaoying Gao; Peter Andreae

Web clustering assists users of a search engine by presenting search results as clusters of related pages. Many clustering algorithms with different characteristics have been developed: but the lack of a standardized Web clustering evaluation method that can evaluate clusterings with different characteristics has prevented effective comparison of algorithms. The paper solves this by introducing a new structure for defining general ideal clusterings and new measurements for evaluating clusterings with different characteristics by comparing them against the general ideal clustering.

knowledge discovery and data mining | 2007

QC4: a clustering evaluation method

Daniel Crabtree; Peter Andreae; Xiaoying Gao

Many clustering algorithms have been developed and researchers need to be able to compare their effectiveness. For some clustering problems, like web page clustering, different algorithms produce clusterings with different characteristics: coarse vs fine granularity, disjoint vs overlapping, flat vs hierarchical. The lack of a clustering evaluation method that can evaluate clusterings with different characteristics has led to incomparable research and results. QC4 solves this by providing a new structure for defining general ideal clusterings and new measurements for evaluating clusterings with different characteristics with respect to a general ideal clustering. The paper describes QC4 and evaluates it within the web clustering domain by comparison to existing evaluation measurements on synthetic test cases and on real world web page clustering tasks. The synthetic test cases show that only QC4 can cope correctly with overlapping clusters, hierarchical clusterings, and all the difficult boundary cases. In the real world tasks, which represent simple clustering situations, QC4 is mostly consistent with the existing measurements and makes better conclusions in some cases.

congress on evolutionary computation | 2014

Multi-view clustering of web documents using multi-objective genetic algorithm

Abdul Wahid; Xiaoying Gao; Peter Andreae

Clustering ensembles are a common approach to clustering problem, which combine a collection of clustering into a superior solution. The key issues are how to generate different candidate solutions and how to combine them. Common approach for generating candidate clustering solutions ignores the multiple representations of the data (i.e., multiple views) and the standard approach of simply selecting the best solution from candidate clustering solutions ignores the fact that there may be a set of clusters from different candidate clustering solutions which can form a better clustering solution. This paper presents a new clustering method that exploits multiple views to generate different clustering solutions and then selects a combination of clusters to form a final clustering solution. Our method is based on Nondominated Sorting Genetic Algorithm (NSGA-II), which is a multi-objective optimization approach. Our new method is compared with five existing algorithms on three data sets that have increasing difficulty. The results show that our method significantly outperforms other methods.

web intelligence | 2006

Data Extraction from Semi-structured Web Pages by Clustering

Le Phong Bao Vuong; Xiaoying Gao; Mengjie Zhang

This paper introduces an approach to the use of clustering for data extraction from semi-structured Web pages. A variant hierarchical agglomerative clustering (HAC) algorithm K-neighbours-HAC is developed which uses the similarities of the data format (HTML tags) and the data content (text string values) to group similar text tokens into clusters. Using these clusters, similar text tokens are identified as data fields and extracted as target information. The approach is examined and compared with a number of existing information extraction systems on two different sets of Web pages and the results suggest that the new approach is effective for Web information extraction and that it outperforms all of the existing approaches on these Web sites

Explore More