Is this you? Create Your Porfile

Minglei Li

Hong Kong Polytechnic University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Minglei Li is active.

Explore More

Publication

Featured researches published by Minglei Li.

conference on intelligent text processing and computational linguistics | 2015

Web Person Disambiguation Using Hierarchical Co-reference Model

Jian Xu; Qin Lu; Minglei Li; Wenjie Li

As one of the entity disambiguation tasks, Web Person Disambiguation (WPD) identifies different persons with the same name by grouping search results for different persons into different clusters. Most of current research works use clustering methods to conduct WPD. These approaches require the tuning of thresholds that are biased towards training data and may not work well for different datasets. In this paper, we propose a novel approach by using pairwise co-reference modeling for WPD without the need to do threshold tuning. Because person names are named entities, disambiguation of person names can use semantic measures using the so called co-reference resolution criterion across different documents. The algorithm first forms a forest with person names as observable leaf nodes. It then stochastically tries to form an entity hierarchy by merging names into a sub-tree as a latent entity group if they have co-referential relationship across documents. As the joining/partition of nodes is based on co-reference-based comparative values, our method is independent of training data, and thus parameter tuning is not required. Experiments show that this semantic based method has achieved comparable performance with the top two state-of-the-art systems without using any training data. The stochastic approach also makes our algorithm to exhibit near linear processing time much more efficient than HAC based clustering method. Because our model allows a small number of upper-level entity nodes to summarize a large number of name mentions, the model has much higher semantic representation power and it is much more scalable over large collections of name mentions compared to HAC based algorithms.

empirical methods in natural language processing | 2017

A Cognition Based Attention Model for Sentiment Analysis

Yunfei Long; Lu Qin; Rong Xiang; Minglei Li; Chu-Ren Huang

Attention models are proposed in sentiment analysis because some words are more important than others. However,most existing methods either use local context based text information or user preference information. In this work, we propose a novel attention model trained by cognition grounded eye-tracking data. A reading prediction model is first built using eye-tracking data as dependent data and other features in the context as independent data. The predicted reading time is then used to build a cognition based attention (CBA) layer for neural sentiment analysis. As a comprehensive model, We can capture attentions of words in sentences as well as sentences in documents. Different attention mechanisms can also be incorporated to capture other aspects of attentions. Evaluations show the CBA based method outperforms the state-of-the-art local context based attention methods significantly. This brings insight to how cognition grounded data can be brought into NLP tasks.

IEEE Transactions on Affective Computing | 2017

Inferring Affective Meanings of Words from Word Embedding

Minglei Li; Qin Lu; Yunfei Long; Lin Gui

Affective lexicon is one of the most important resource in affective computing for text. Manually constructed affective lexicons have limited scale and thus only have limited use in practical systems. In this work, we propose a regression-based method to automatically infer multi-dimensional affective representation of words via their word embedding based on a set of seed words. This method can make use of the rich semantic meanings obtained from word embedding to extract meanings in some specific semantic space. This is based on the assumption that different features in word embedding contribute differently to a particular affective dimension and a particular feature in word embedding contributes differently to different affective dimensions. Evaluation on various affective lexicons shows that our method outperforms the state-of-the-art methods on all the lexicons under different evaluation metrics with large margins. We also explore different regression models and conclude that the Ridge regression model, the Bayesian Ridge regression model and Support Vector Regression with linear kernel are the most suitable models. Comparing to other state-of-the-art methods, our method also has computation advantage. Experiments on a sentiment analysis task show that the lexicons extended by our method achieve better results than publicly available sentiment lexicons on eight sentiment corpora. The extended lexicons are publicly available for access.

knowledge science, engineering and management | 2017

Representation Learning of Multiword Expressions with Compositionality Constraint

Minglei Li; Qin Lu; Yunfei Long

Representations of multiword expressions (MWE) are currently learned either from context external to MWEs based on the distributional hypothesis or from the representations of component words based on some composition functions using the compositional hypothesis. However, a distributional method treats MWEs as a non-divisible unit without consideration of component words. Distributional methods also have the data sparseness problem, especially for MWEs. On the other hand, a compositional method can fail if a MWE is non-compositional. In this paper, we propose a hybrid method to learn the representation of MWEs from their external context and component words with a compositionality constraint. This method can make use of both the external context and component words. Instead of simply combining the two kinds of information, we use compositionality measure from lexical semantics to serve as the constraint. The main idea is to learn MWE representations based on a weighted linear combination of both external context and component words, where the weight is based on the compositionality of MWEs. Evaluation on three datasets shows that the performance of this hybrid method is more robust and can improve the representation.

international conference on asian language processing | 2016

A regression approach to valence-arousal ratings of words from word embedding

Minglei Li; Yunfei Long; Qin Lu

Traditional affective lexicons are mainly based on discrete classes, such as positive, happiness, sadness, which may limit its expressive power compared to the dimensional representation in which affective meanings are expressed through continuous numerical values on multiple dimensions, such as valence-arousal. Traditional methods for acquiring dimensional lexicons are mainly based on time-consuming manual annotation. In this paper, we propose a regression-based method to automatically infer the valence-arousal ratings of words via their word embedding. This method is based on the assumption that different features in word embedding contribute differently to different affective meanings. Experiments on three valence-arousal lexicons show that our method outperforms the state-of-the-art method on all the lexicons under four different evaluation metrics. Our model also has superior computation advantage over the state-of-the-art model.

conference on information and knowledge management | 2015

A Novel Class Noise Estimation Method and Application in Classification

Lin Gui; Qin Lu; Ruifeng Xu; Minglei Li; Qikang Wei

Noise in class labels of any training set can lead to poor classification results no matter what machine learning method is used. In this paper, we first present the problem of binary classification in the presence of random noise on the class labels, which we call class noise. To model class noise, a class noise rate is normally defined as a small independent probability of the class labels being inverted on the whole set of training data. In this paper, we propose a method to estimate class noise rate at the level of individual samples in real data. Based on the estimation result, we propose two approaches to handle class noise. The first technique is based on modifying a given surrogate loss function. The second technique eliminates class noise by sampling. Furthermore, we prove that the optimal hypothesis on the noisy distribution can approximate the optimal hypothesis on the clean distribution using both approaches. Our methods achieve over 87% accuracy on a synthetic non-separable dataset even when 40% of the labels are inverted. Comparisons to other algorithms show that our methods outperform state-of-the-art approaches on several benchmark datasets in different domains with different noise rates.

Knowledge Based Systems | 2018

Phrase embedding learning based on external and internal context with compositionality constraint

Minglei Li; Qin Lu; Dan Xiong; Yunfei Long

Abstract Different methods are proposed to learn phrase embedding, which can be mainly divided into two strands. The first strand is based on the distributional hypothesis to treat a phrase as one non-divisible unit and to learn phrase embedding based on its external context similar to learn word embedding. However, distributional methods cannot make use of the information embedded in component words and they also face data spareness problem. The second strand is based on the principle of compositionality to infer phrase embedding based on the embedding of its component words. Compositional methods would give erroneous result if a phrase is non-compositional. In this paper, we propose a hybrid method by a linear combination of the distributional component and the compositional component with an individualized phrase compositionality constraint. The phrase compositionality is automatically computed based on the distributional embedding of the phrase and its component words. Evaluation on five phrase level semantic tasks and experiments show that our proposed method has overall best performance. Most importantly, our method is more robust as it is less sensitive to datasets.

workshop on chinese lexical semantics | 2016

Named Entity Recognition for Chinese Novels in the Ming-Qing Dynasties

Yunfei Long; Dan Xiong; Qin Lu; Minglei Li; Chu-Ren Huang

This paper presents a Named Entity Recognition (NER) system for Chinese classic novels in the Ming and Qing dynasties using the Conditional Random Fields (CRFs) method. An annotated corpus of four influential vernacular novels produced during this period is used as both training and testing data. In the experiment, three novels are used as training data and one novel is used as the testing data. Three sets of features are proposed for the CRFs model: (1) baseline feature set, that is, word/POS and bigram for different window sizes, (2) dependency head and dependency relationship, and (3) Wikipedia categories. The F-measures for these four books range from 67% to 80%. Experiments show that using the dependency head and relationship as well as Wikipedia categories can improve the performance of the NER system. Compared with the second feature set, the third one can produce greater improvement.

international conference on big data | 2016

Domain-specific user preference prediction based on multiple user activities

Yunfei Long; Qin Lu; Yue Xiao; Minglei Li; Chu-Ren Huang

Inferring latent user preferences using both structured and unstructured data is an important social computing task. In this paper, we propose a user preference representation based on user activities embedded in unstructured data to better encode the homophily theory. The representation of an individual user is learned using a embedding based method to integrate latent user preferences in social media. The method has the ability to integrate a variety of user activities based cues from user comments, user social network (i.e; follower/followee connections) and user interested topics which are indicated by the topics a user has participated in. Experiments are conducted to evaluate the prediction of each users favorite team as a part of user preferences in a dataset collected from the Hu-pu basketball discussion forum.1 Results clearly indicate that our proposed user representation outperforms other user representation baselines. Integrating user social network and user interested topics with user comments can improve the overall performance of user preference prediction.

China National Conference on Chinese Computational Linguistics | 2016

Towards Scalable Emotion Classification in Microblog Based on Noisy Training Data

Minglei Li; Qin Lu; Lin Gui; Yunfei Long

The availability of labeled corpus is of great importance for emotion classification tasks. Because manual labeling is too time-consuming, hashtags have been used as naturally annotated labels to obtain large amount of labeled training data from microblog. However, the inconsistency and noise in annotation can adversely affect the data quality and thus the performance when used to train a classifier. In this paper, we propose a classification framework which allows naturally annotated data to be used as additional training data and employs a k-NN graph based data cleaning method to remove noise after noisy data has certain accumulations. Evaluation on NLP&CC2013 Chinese Weibo emotion classification dataset shows that our approach achieves 15.8 % better performance than directly using the noisy data without noise filtering. After adding the filtered data with hashtags into an existing high-quality training data, the performance increases 3.7 % compared to using the high-quality training data alone.

Explore More