Yulan He
Aston University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yulan He.
conference on information and knowledge management | 2009
Chenghua Lin; Yulan He
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet Allocation (LDA), called joint sentiment/topic model (JST), which detects sentiment and topic simultaneously from text. Unlike other machine learning approaches to sentiment classification which often require labeled corpora for classifier training, the proposed JST model is fully unsupervised. The model has been evaluated on the movie review dataset to classify the review sentiment polarity and minimum prior information have also been explored to further improve the sentiment classification accuracy. Preliminary experiments have shown promising results achieved by JST.
IEEE Transactions on Knowledge and Data Engineering | 2012
Chenghua Lin; Yulan He; Richard M. Everson; Stefan M. Rüger
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.
Computer Speech & Language | 2005
Yulan He; Steve J. Young
This paper discusses semantic processing using the Hidden Vector State (HVS) model. The HVS model extends the basic discrete Markov model by encoding context in each state as a vector. State transitions are then factored into a stack shift operation similar to those of a push-down automaton followed by a push of a new preterminal semantic category label. The key feature of the model is that it can capture hierarchical structure without the use of treebank data for training. Experiments have been conducted in the travel domain using the relatively simple ATIS corpus and the more complex DARPA Communicator Task. The results show that the HVS model can be robustly trained from only minimally annotated corpus data. Furthermore, when measured by its ability to extract attribute-value pairs from natural language queries in the travel domain, the HVS model outperforms a conventional finite-state semantic tagger by 4.1% in F-measure for ATIS and by 6.6% in F-measure for Communicator, suggesting that the benefit of the HVS models ability to encode context increases as the task becomes more complex.
Computer Methods and Programs in Biomedicine | 2007
Suryani Lukman; Yulan He; Siu Cheung Hui
Traditional Chinese Medicine (TCM) has been actively researched through various approaches, including computational techniques. A review on basic elements of TCM is provided to illuminate various challenges and progresses in its study using computational methods. Information on various TCM formulations, in particular resources on databases of TCM formulations and their integration to Western medicine, are analyzed in several facets, such as TCM classifications, types of databases, and mining tools. Aspects of computational TCM diagnosis, namely inspection, auscultation, pulse analysis as well as TCM expert systems are reviewed in term of their benefits and drawbacks. Various approaches on exploring relationships among TCM components and finding genes/proteins relating to TCM symptom complex are also studied. This survey provides a summary on the advance of computational approaches for TCM and will be useful for future knowledge discovery in this area.
Information Processing and Management | 2016
Hassan Saif; Yulan He; Miriam Fernández; Harith Alani
We propose a semantic sentiment representation of words called SentiCircle.SentiCircle captures the contextual semantic of words from their co-occurrences.SentiCircle updates the sentiment of words based on their contextual semantics.SentiCircle can be used to perform entity- and tweet-level level sentiment analysis. Sentiment analysis on Twitter has attracted much attention recently due to its wide applications in both, commercial and public sectors. In this paper we present SentiCircles, a lexicon-based approach for sentiment analysis on Twitter. Different from typical lexicon-based approaches, which offer a fixed and static prior sentiment polarities of words regardless of their context, SentiCircles takes into account the co-occurrence patterns of words in different contexts in tweets to capture their semantics and update their pre-assigned strength and polarity in sentiment lexicons accordingly. Our approach allows for the detection of sentiment at both entity-level and tweet-level. We evaluate our proposed approach on three Twitter datasets using three different sentiment lexicons to derive word prior sentiments. Results show that our approach significantly outperforms the baselines in accuracy and F-measure for entity-level subjectivity (neutral vs. polar) and polarity (positive vs. negative) detections. For tweet-level sentiment detection, our approach performs better than the state-of-the-art SentiStrength by 4-5% in accuracy in two datasets, but falls marginally behind by 1% in F-measure in the third dataset.
ieee automatic speech recognition and understanding workshop | 2003
Yulan He; Steve J. Young
The paper presents a purely data-driven spoken language understanding (SLU) system. It consists of three major components, a speech recognizer, a semantic parser, and a dialog act decoder. A novel feature of the system is that the understanding components are trained directly from data without using explicit semantic grammar rules or fully-annotated corpus data. Despite this, the system is nevertheless able to capture hierarchical structure in user utterances and handle long range dependencies. Experiments have been conducted on the ATIS corpus and 16.1% and 12.6% utterance understanding error rates were obtained for spoken input using the ATIS-3 1993 and 1994 test sets. These results show that our system is comparable to existing SLU systems which rely on either handcrafted semantic grammar rules or statistical models trained on fully-annotated training corpora, but it has greatly reduced build cost.
Information Processing and Management | 2011
Yulan He; Deyu Zhou
Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the models predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.
Information Processing and Management | 2002
Yulan He; Siu Cheung Hui
Author co-citation analysis (ACA) has been widely used in bibliometrics as an analytical method in analyzing the intellectual structure of science studies. It can be used to identify authors from the same or similar research fields. However, such analysis method relies heavily on statistical tools to perform the analysis and requires human interpretation. Web Citation Database is a data warehouse used for storing citation indices of Web publications.In this paper,we propose a mining process to automate the ACA based on the Web Citation Database. The mining process uses agglomerative hierarchical clustering (AHC) as the mining technique for author clustering and multidimensional scaling (MDS) for displaying author cluster maps. The clustering results and author cluster map have been incorporated into a citation-based retrieval system known as PubSearch to support author retrieval of Web publications.
Speech Communication | 2006
Yulan He; Steve J. Young
Abstract The Hidden Vector State (HVS) Model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. State transitions are factored into a stack shift operation similar to those of a push-down automaton followed by the push of a new preterminal category label. When used as a semantic parser, the model can capture hierarchical structure without the use of treebank data for training and it can be trained automatically using expectation-maximization (EM) from only-lightly annotated training data. When deployed in a system, the model can be continually refined as more data becomes available. In this paper, the practical application of the model in a spoken language understanding system (SLU) is described. Through a sequence of experiments, the issues of robustness to noise and portability to similar and extended domains are investigated. The end-to-end performance obtained from experiments in the ATIS domain show that the system is comparable to existing SLU systems which rely on either hand-crafted semantic grammar rules or statistical models trained on fully annotated training corpora. Experiments using data which have been artificially corrupted with varying levels of additive noise show that the HVS-based parser is relatively robust, and experiments using data sets from other domains indicate that the overall framework allows adaptation to related domains, and scaling to cover enlarged domains. In summary, it is argued that constrained statistical parsers such as the HVS model allow robust spoken dialogue systems to be built at relatively low cost, and which can be automatically adapted as new data is acquired both to improve performance and extend coverage.
Journal of Biomolecular Structure & Dynamics | 2015
Ruifeng Xu; Jiyun Zhou; Bin Liu; Yulan He; Quan Zou; Xiaolong Wang; Kuo Chen Chou
DNA-binding proteins are crucial for various cellular processes and hence have become an important target for both basic research and drug development. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to establish an automated method for rapidly and accurately identifying DNA-binding proteins based on their sequence information alone. Owing to the fact that all biological species have developed beginning from a very limited number of ancestral species, it is important to take into account the evolutionary information in developing such a high-throughput tool. In view of this, a new predictor was proposed by incorporating the evolutionary information into the general form of pseudo amino acid composition via the top-n-gram approach. It was observed by comparing the new predictor with the existing methods via both jackknife test and independent data-set test that the new predictor outperformed its counterparts. It is anticipated that the new predictor may become a useful vehicle for identifying DNA-binding proteins. It has not escaped our notice that the novel approach to extract evolutionary information into the formulation of statistical samples can be used to identify many other protein attributes as well.