Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining | 2019

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Abstract

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

Volume None

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining | 2019

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Abstract

Volume None

Pages None

DOI 10.1145/3289600.3291032

Language English

Journal Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Full Text