Luís Marujo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Luís Marujo is active.

Explore More

Publication

Featured researches published by Luís Marujo.

empirical methods in natural language processing | 2015

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

Wang Ling; Chris Dyer; Alan W. Black; Isabel Trancoso; Ramon Fermandez; Silvio Amir; Luís Marujo; Tiago Luís

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs. Relative to traditional word representation models that have independent vectors for each word type, our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form‐function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages (e.g., Turkish).

Knowledge Based Systems | 2016

Exploring events and distributed representations of text in multi-document summarization

Luís Marujo; Wang Ling; Ricardo Ribeiro; Anatole Gershman; Jaime G. Carbonell; David Martins de Matos; João Paulo Neto

We explore an event detection framework to improve multi-document summarizationWe use distributed representations of text to address different lexical realizationsSummarization is based on the hierarchical combination of single-document summariesWe performed an automatic evaluation and a human study of the generated summariesQuantitative and qualitative results show clear improvements over the state-of-the-art In this article, we explore an event detection framework to improve multi-document summarization. Our approach is based on a two-stage single-document method that extracts a collection of key phrases, which are then used in a centrality-as-relevance passage retrieval model. We explore how to adapt this single-document method for multi-document summarization methods that are able to use event information. The event detection method is based on Fuzzy Fingerprint, which is a supervised method trained on documents with annotated event tags. To cope with the possible usage of different terms to describe the same event, we explore distributed representations of text in the form of word embeddings, which contributed to improve the summarization results. The proposed summarization methods are based on the hierarchical combination of single-document summaries. The automatic evaluation and human study performed show that these methods improve upon current state-of-the-art multi-document summarization systems on two mainstream evaluation datasets, DUC 2007 and TAC 2009. We show a relative improvement in ROUGE-1 scores of 16% for TAC 2009 and of 17% for DUC 2007.

international acm sigir conference on research and development in information retrieval | 2013

Self reinforcement for important passage retrieval

Ricardo Ribeiro; Luís Marujo; David Martins de Matos; João Paulo Neto; Anatole Gershman; Jaime G. Carbonell

In general, centrality-based retrieval models treat all elements of the retrieval space equally, which may reduce their effectiveness. In the specific context of extractive summarization (or important passage retrieval), this means that these models do not take into account that information sources often contain lateral issues, which are hardly as important as the description of the main topic, or are composed by mixtures of topics. We present a new two-stage method that starts by extracting a collection of key phrases that will be used to help centrality-as-relevance retrieval model. We explore several approaches to the integration of the key phrases in the centrality model. The proposed method is evaluated using different datasets that vary in noise (noisy vs clean) and language (Portuguese vs English). Results show that the best variant achieves relative performance improvements of about 31% in clean data and 18% in noisy data.

workshop on statistical machine translation | 2014

Crowdsourcing High-Quality Parallel Data Extraction from Twitter

Wang Ling; Luís Marujo; Chris Dyer; Alan W. Black; Isabel Trancoso

High-quality parallel data is crucial for a range of multilingual applications, from tuning and evaluating machine translation systems to cross-lingual annotation projection. Unfortunately, automatically obtained parallel data (which is available in relative abundance) tends to be quite noisy. To obtain high-quality parallel data, we introduce a crowdsourcing paradigm in which workers with only basic bilingual proficiency identify translations from an automatically extracted corpus of parallel microblog messages. For less than

text speech and dialogue | 2012

Key Phrase Extraction of Lightly Filtered Broadcast News

Luís Marujo; Ricardo Ribeiro; David Martins de Matos; João Paulo Neto; Anatole Gershman; Jaime G. Carbonell

350, we obtained over 5000 parallel segments in five language pairs. Evaluated against expert annotations, the quality of the crowdsourced corpus is significantly better than existing automatic methods: it obtains an performance comparable to expert annotations when used in MERT tuning of a microblog MT system; and training a parallel sentence classifier with it leads also to improved results. The crowdsourced corpora will be made available in http://www.cs.cmu.edu/ ~lingwang/microtopia/.

joint conference on lexical and computational semantics | 2015

Extending a Single-Document Summarizer to Multi-Document: a Hierarchical Approach

Luís Marujo; Ricardo Ribeiro; David Martins de Matos; João Paulo Neto; Anatole Gershman; Jaime G. Carbonell

This paper explores the impact of light filtering on automatic key phrase extraction (AKE) applied to Broadcast News (BN). Key phrases are words and expressions that best characterize the content of a document. Key phrases are often used to index the document or as features in further processing. This makes improvements in AKE accuracy particularly important. We hypothesized that filtering out marginally relevant sentences from a document would improve AKE accuracy. Our experiments confirmed this hypothesis. Elimination of as little as 10% of the document sentences lead to a 2% improvement in AKE precision and recall. AKE is built over MAUI toolkit that follows a supervised learning approach. We trained and tested our AKE method on a gold standard made of 8 BN programs containing 110 manually annotated news stories. The experiments were conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio news/programs, running daily, and monitoring 12 TV and 4 radio channels.

international joint conference on natural language processing | 2015

Automatic Keyword Extraction on Twitter

Luís Marujo; Wang Ling; Isabel Trancoso; Chris Dyer; Alan W. Black; Anatole Gershman; David Martins de Matos; João Paulo Neto; Jaime G. Carbonell

The increasing amount of online content motivated the development of multi-document summarization methods. In this work, we explore straightforward approaches to extend single-document summarization methods to multi-document summarization. The proposed methods are based on the hierarchical combination of single-document summaries, and achieves state of the art results.

Pattern Recognition Letters | 2016

Summarization of films and documentaries based on subtitles and scripts

Marta Aparício; Paulo Figueiredo; Francisco Raposo; David Martins de Matos; Ricardo Ribeiro; Luís Marujo

In this paper, we build a corpus of tweets from Twitter annotated with keywords using crowdsourcing methods. We identify key differences between this domain and the work performed on other domains, such as news, which makes existing approaches for automatic keyword extraction not generalize well on Twitter datasets. These datasets include the small amount of content in each tweet, the frequent usage of lexical variants and the high variance of the cardinality of keywords present in each tweet. We propose methods for addressing these issues, which leads to solid improvements on this dataset for this task.

IEEE Conf. on Intelligent Systems (1) | 2015

Textual Event Detection Using Fuzzy Fingerprints

Luís Marujo; João Paulo Carvalho; Anatole Gershman; Jaime G. Carbonell; João Paulo Neto; David Martins de Matos

We study the behavior of automatic summarization for films and documentaries.Well-known extractive summarization algorithms are ranked for this task.Assessment of strategies for effective extractive summarization in these domains.Quantitative results are presented for relevant experiments.Qualitative assessment is also provided (concerning the best approaches). We assess the performance of generic text summarization algorithms applied to films and documentaries, using extracts from news articles produced by reference models of extractive summarization. We use three datasets: (i) news articles, (ii) film scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics are used for comparing generated summaries against news abstracts, plot summaries, and synopses. We show that the best performing algorithms are LSA, for news articles and documentaries, and LexRank and Support Sets, for films. Despite the different nature of films and documentaries, their relative behavior is in accordance with that obtained for news articles.

Computational Linguistics | 2016

Mining parallel corpora from sina weibo and twitter

Wang Ling; Luís Marujo; Chris Dyer; Alan W. Black; Isabel Trancoso

In this paper we present a method to improve the automatic detection of events in short sentences when in the presence of a large number of event classes. Contrary to standard classification techniques such as Support Vector Machines or Random Forest, the proposed Fuzzy Fingerprints method is able to detect all the event classes present in the ACE 2005 Multilingual Corpus, and largely improves the obtained G-Mean value.

Explore More