Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Wim De Smet is active.

Publication


Featured researches published by Wim De Smet.


Information Processing and Management | 2015

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

Ivan Vulić; Wim De Smet; Jie Tang; Marie-Francine Moens

Abstract Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.


knowledge discovery and data mining | 2011

Knowledge transfer across multilingual corpora via latent topics

Wim De Smet; Jie Tang; Marie-Francine Moens

This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.


social web search and mining | 2009

Cross-language linking of news stories on the web using interlingual topic modelling

Wim De Smet; Marie-Francine Moens

We have studied the problem of linking event information across different languages without the use of translation systems or dictionaries. The linking is based on interlingua information obtained through probabilistic topic models trained on comparable corpora written in two languages (in our case English and Dutch). The achieve this, we expand the Latent Dirichlet Allocation model to process documents in two languages. We demonstrate the validity of the learned interlingual topics in a document clustering task, where the evaluation is performed on Google News.


meeting of the association for computational linguistics | 2011

Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Ivan Vulić; Wim De Smet; Marie-Francine Moens


Information Retrieval | 2013

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Ivan Vulić; Wim De Smet; Marie-Francine Moens


computational linguistics in the netherlands | 2009

An aspect based document representation for event clustering

Wim De Smet; Marie-Francine Moens


Data Mining and Knowledge Discovery | 2013

Representations for multi-document event clustering

Wim De Smet; Marie-Francine Moens


Lecture Notes in Computer Science | 2011

Cross-language information retrieval with latent topic models trained on a comparable corpus

Ivan Vulić; Wim De Smet; Marie-Francine Moens


neural information processing systems | 2012

Probabilistic topic modeling in multilingual settings: a short overview of its methodology with applications

Ivan Vulić; Wim De Smet; Jie Tang; Marie-Francine Moens


Proceedings of SIM 2009, joint conference: SRL ILP MLG | 2009

Does Google own Youtube? Entity relationship extraction with minimal supervision

Jan De Belder; Wim De Smet; Raquel Mochales Palau; Marie-Francine Moens

Collaboration


Dive into the Wim De Smet's collaboration.

Top Co-Authors

Avatar

Marie-Francine Moens

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar

Ivan Vulić

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jan De Belder

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar

Raquel Mochales Palau

Katholieke Universiteit Leuven

View shared research outputs
Researchain Logo
Decentralizing Knowledge