Maosong Sun
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maosong Sun.
ACM Transactions on Intelligent Systems and Technology | 2011
Zhiyuan Liu; Yuzhou Zhang; Edward Y. Chang; Maosong Sun
Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.
empirical methods in natural language processing | 2009
Zhiyuan Liu; Peng Li; Yabin Zheng; Maosong Sun
Keyphrases are widely used as a brief summary of documents. Since manual assignment is time-consuming, various unsupervised ranking methods based on importance scores are proposed for keyphrase extraction. In practice, the keyphrases of a document should not only be statistically important in the document, but also have a good coverage of the document. Based on this observation, we propose an unsupervised method for keyphrase extraction. Firstly, the method finds exemplar terms by leveraging clustering techniques, which guarantees the document to be semantically covered by these exemplar terms. Then the keyphrases are extracted from the document using the exemplar terms. Our method outperforms sate-of-the-art graph-based ranking methods (TextRank) by 9.5% in F1-measure.
empirical methods in natural language processing | 2014
Xinxiong Chen; Zhiyuan Liu; Maosong Sun
Most word representation methods assume that each word owns a single semantic vector. This is usually problematic because lexical ambiguity is ubiquitous, which is also the problem to be resolved by word sense disambiguation. In this paper, we present a unified model for joint word sense representation and disambiguation, which will assign distinct representations for each word sense. 1 The basic idea is that both word sense representation (WSR) and word sense disambiguation (WSD) will benefit from each other: (1) highquality WSR will capture rich information about words and senses, which should be helpful for WSD, and (2) high-quality WSD will provide reliable disambiguated corpora for learning better sense representations. Experimental results show that, our model improves the performance of contextual word similarity compared to existing WSR methods, outperforms stateof-the-art supervised methods on domainspecific WSD, and achieves competitive performance on coarse-grained all-words WSD.
meeting of the association for computational linguistics | 2016
Shiqi Shen; Yong Cheng; Zhongjun He; Wei He; Hua Wu; Maosong Sun; Yang Liu
We propose minimum risk training for end-to-end neural machine translation. Unlike conventional maximum likelihood estimation, minimum risk training is capable of optimizing model parameters directly with respect to arbitrary evaluation metrics, which are not necessarily differentiable. Experiments show that our approach achieves significant improvements over maximum likelihood estimation on a state-of-the-art neural machine translation system across various languages pairs. Transparent to architectures, our approach can be applied to more neural networks and potentially benefit more NLP tasks.
empirical methods in natural language processing | 2015
Yankai Lin; Zhiyuan Liu; Huanbo Luan; Maosong Sun; Siwei Rao; Song Liu
Representation learning of knowledge bases aims to embed both entities and relations into a low-dimensional space. Most existing methods only consider direct relations in representation learning. We argue that multiple-step relation paths also contain rich inference patterns between entities, and propose a path-based representation learning model. This model considers relation paths as translations between entities for representation learning, and addresses two key challenges: (1) Since not all relation paths are reliable, we design a path-constraint resource allocation algorithm to measure the reliability of relation paths. (2) We represent relation paths via semantic composition of relation embeddings. Experimental results on real-world datasets show that, as compared with baselines, our model achieves significant and consistent improvements on knowledge base completion and relation extraction from text. The source code of this paper can be obtained from https://github.com/mrlyk423/ relation_extraction.
meeting of the association for computational linguistics | 1998
Maosong Sun; Dayang Shen; Benjamin K. Tsou
Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance (especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.
meeting of the association for computational linguistics | 2016
Yankai Lin; Shiqi Shen; Zhiyuan Liu; Huanbo Luan; Maosong Sun
Distant supervised relation extraction has been widely used to find novel relational facts from text. However, distant supervision inevitably accompanies with the wrong labelling problem, and these noisy data will substantially hurt the performance of relation extraction. To alleviate this issue, we propose a sentence-level attention-based model for relation extraction. In this model, we employ convolutional neural networks to embed the semantics of sentences. Afterwards, we build sentence-level attention over multiple instances, which is expected to dynamically reduce the weights of those noisy instances. Experimental results on real-world datasets show that, our model can make full use of all informative sentences and effectively reduce the influence of wrong labelled instances. Our model achieves significant and consistent improvements on relation extraction as compared with baselines. The source code of this paper can be obtained from https: //github.com/thunlp/NRE.
Frontiers of Computer Science in China | 2012
Zhiyuan Liu; Xinxiong Chen; Maosong Sun
Microblogging provides a new platform for communicating and sharing information among Web users. Users can express opinions and record daily life using microblogs. Microblogs that are posted by users indicate their interests to some extent. We aim to mine user interests via keyword extraction from microblogs. Traditional keyword extraction methods are usually designed for formal documents such as news articles or scientific papers. Messages posted by microblogging users, however, are usually noisy and full of new words, which is a challenge for keyword extraction. In this paper, we combine a translation-based method with a frequency-based method for keyword extraction. In our experiments, we extract keywords for microblog users from the largest microblogging website in China, Sina Weibo. The results show that our method can identify users’ interests accurately and efficiently.
international conference natural language processing | 2007
Jun Li; Maosong Sun
Machine learning method in text classification has expanded from topic identification to more challenging tasks such as sentiment classification, and it is valuable to explore, compare methods applied in sentiment classification and investigate relevant influence factors. The chief aim of the present work is to compare four machine learning methods to sentiment classification of Chinese review. The corpus is made up of 16000 reviews from website. We investigate the factors which affect the performance: namely feature representation via Word-Based Unigram (WBU), Bigram (WBB) and Chinese Character-Based Bigram (CBB), Trigram (CBT); feature weighting schemes and feature dimensionality. Experimental evaluations show that performance depends on different settings. As a result, we draw a conclusion that Naive Bayes (NB) classifier obtains the best averaging performance when using WBB, CBT as features with bool weighting under different dimensionality to the task.
Computational Linguistics | 2009
Zhongguo Li; Maosong Sun
We present a Chinese word segmentation model learned from punctuation marks which are perfect word delimiters. The learning is aided by a manually segmented corpus. Our method is considerably more effective than previous methods in unknown word recognition. This is a step toward addressing one of the toughest problems in Chinese word segmentation.