Dell Zhang
Birkbeck, University of London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dell Zhang.
international acm sigir conference on research and development in information retrieval | 2003
Dell Zhang; Wee Sun Lee
Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest Neighbors (NN), Naive Bayes (NB), Decision Tree (DT), Sparse Network of Winnows (SNoW), and Support Vector Machines (SVM) using two kinds of features: bag-of-words and bag-of-ngrams. The experiment results show that with only surface text features the SVM outperforms the other four methods for this task. Further, we propose to use a special kernel function called the tree kernel to enable the SVM to take advantage of the syntactic structures of questions. We describe how the tree kernel can be computed efficiently by dynamic programming. The performance of our approach is promising, when tested on the questions from the TREC QA track.
asia-pacific web conference | 2004
Dell Zhang; Yisheng Dong
We propose a Semantic, Hierarchical, Online Clustering (SHOC) approach to automatically organizing Web search results into groups. SHOC combines the power of two novel techniques, key phrase discovery and orthogonal clustering, to generate clusters which are both reasonable and readable. Moreover, SHOC can work for multiple languages: not only English but also oriental languages like Chinese. The main contribution of this paper includes the following. (1) The benefits of using key phrases as Web document features are discussed. A key phrase discovery algorithm based on suffix array is presented. This algorithm is highly effective and efficient no matter how large the language’s alphabet is. (2) The concept of orthogonal clustering is proposed for general clustering problems. The reason why matrix Singular Value Decomposition (SVD) can provide solution to orthogonal clustering is strictly proved. The orthogonal clustering has a solid mathematics foundation and many advantages over traditional heuristic clustering algorithms.
Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining | 2012
Andrius Mudinas; Dell Zhang; Mark Levene
In this paper, we present the anatomy of pSenti --- a concept-level sentiment analysis system that seamlessly integrates into opinion mining lexicon-based and learning-based approaches. Compared with pure lexicon-based systems, it achieves significantly higher accuracy in sentiment polarity classification as well as sentiment strength detection. Compared with pure learning-based systems, it offers more structured and readable results with aspect-oriented explanation and justification, while being less sensitive to the writing style of text. Our extensive experiments on two real-world datasets (CNET software reviews and IMDB movie reviews) confirm the superiority of the proposed hybrid approach over state-of-the-art systems like SentiStrength.
international acm sigir conference on research and development in information retrieval | 2005
Dell Zhang; Xi Chen; Wee Sun Lee
Support Vector Machines (SVMs) have been very successful in text classification. However, the intrinsic geometric structure of text data has been ignored by standard kernels commonly used in SVMs. It is natural to assume that the documents are on the multinomial manifold, which is the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information metric. We prove that the Negative Geodesic Distance (NGD) on the multinomial manifold is conditionally positive definite (cpd), thus can be used as a kernel in SVMs. Experiments show the NGD kernel on the multinomial manifold to be effective for text classification, significantly outperforming standard kernels on the ambient Euclidean space.
international world wide web conferences | 2000
Dell Zhang; Yisheng Dong
Abstract How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis of the multidimensionality and subjectivity of rank algorithms. From a dynamics viewpoint, this paper abstracts a users Web surfing action as a Markov model. Based on this model, we propose a new rank algorithm. The result of our rank algorithm, which synthesizes the relevance, authority, integrativity and novelty of each Web resource, can be computed efficiently not by iteration but through solving a group of linear equations.
international world wide web conferences | 2004
Dell Zhang; Wee Sun Lee
We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. In this paper we attempt to use a powerful classification method, Support Vector Machine (SVM), to attack this problem. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose a method, Cluster Shrinkage (CS), to further enhance the classification by exploiting such implicit knowledge. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.
Computer Networks | 2002
Dell Zhang; Yisheng Dong
Abstract Web usage mining can be very useful to search engines. This paper proposes a novel effective approach to exploit the relationships among users, queries and resources based on the search engines log. How this method can be applied is illustrated by a Chinese image search engine.
international acm sigir conference on research and development in information retrieval | 2017
Jun Wang; Lantao Yu; Weinan Zhang; Yu Gong; Yinghui Xu; Benyou Wang; Peng Zhang; Dell Zhang
This paper provides a unified account of two schools of thinking in information retrieval modelling: the generative retrieval focusing on predicting relevant documents given a query, and the discriminative retrieval focusing on predicting relevancy given a query-document pair. We propose a game theoretical minimax game to iteratively optimise both models. On one hand, the discriminative model, aiming to mine signals from labelled and unlabelled data, provides guidance to train the generative model towards fitting the underlying relevance distribution over documents given the query. On the other hand, the generative model, acting as an attacker to the current discriminative model, generates difficult examples for the discriminative model in an adversarial way by minimising its discrimination objective. With the competition between these two models, we show that the unified framework takes advantage of both schools of thinking: (i) the generative model learns to fit the relevance distribution over documents via the signals from the discriminative model, and (ii) the discriminative model is able to exploit the unlabelled data selected by the generative model to achieve a better estimation for document ranking. Our experimental results have demonstrated significant performance gains as much as 23.96% on Precision@5 and 15.50% on MAP over strong baselines in a variety of applications including web search, item recommendation, and question answering.
knowledge discovery and data mining | 2006
Dell Zhang; Wee Sun Lee
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
international world wide web conferences | 2012
Long Chen; Dell Zhang; Levene Mark
Community Question Answering (CQA) services, such as Yahoo! Answers, are specifically designed to address the innate limitation of Web search engines by helping users obtain information from a community. Understanding the user intent of questions would enable a CQA system identify similar questions, find relevant answers, and recommend potential answerers more effectively and efficiently. In this paper, we propose to classify questions into three categories according to their underlying user intent: subjective, objective, and social. In order to identify the user intent of a new question, we build a predictive model through machine learning based on both text and metadata features. Our investigation reveals that these two types of features are conditionally independent and each of them is sufficient for prediction. Therefore they can be exploited as two views in co-training - a semi-supervised learning framework - to make use of a large amount of unlabelled questions, in addition to the small set of manually labelled questions, for enhanced question classification. The preliminary experimental results show that co-training works significantly better than simply pooling these two types of features together.