Matthias Gallé | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthias Gallé is active.

Explore More

Publication

Featured researches published by Matthias Gallé.

international acm sigir conference on research and development in information retrieval | 2013

The bag-of-repeats representation of documents

Matthias Gallé

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption of word and introducing context. However, this comes at a cost of adding features which are non-descriptive, and increasing the dimension of the vector space model exponentially. We present new representations that avoid both pitfalls. They are based on sound theoretical notions of stringology, and can be computed in optimal asymptotic time with algorithms using data structures from the suffix family. While maximal repeats have been used in the past for similar tasks, we show how another equivalence class of repeats -- largest-maximal repeats -- obtain similar or better results, with only a fraction of the features. This class acts as a minimal generative basis of all repeated substrings. We also report their use for topic modeling, showing easier to interpret models.

language and automata theory and applications | 2014

On Context-Diverse Repeats and Their Incremental Computation

Matthias Gallé; Matias D. Tealdi

The context in which a substring appears is an important notion to identify --- for example --- its semantic meaning. However, existing classes of repeats fail to take this into account directly. We present here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases maximal and super-maximal repeats. We give sufficient and necessary condition to bound their number linearly in the size of the sequence, and show an optimal algorithm that computes them in linear time --- given a suffix array ---, independent on the size of the alphabet, as well as two other algorithms that are faster in practice. Additionally, we provide an independent and general framework that allows to compute these and other repeats incrementally; extending the application space of repeats in a streaming framework.

european conference on information retrieval | 2012

Full and mini-batch clustering of news articles with Star-EM

Matthias Gallé; Jean-Michel Renders

We present a new threshold-based clustering algorithm for news articles. The algorithm consists of two phases: in the first, a local optimum of a score function that captures the quality of a clustering is found with an Expectation-Maximization approach. In the second phase, the algorithm reduces the number of clusters and, in particular, is able to build non-spherical---shaped clusters. We also give a mini-batch version which allows an efficient dynamic processing of data points as they arrive in groups. Our experiments on the TDT5 benchmark collection show the superiority of both versions of this algorithm compared to other state-of-the-art alternatives.

international world wide web conferences | 2016

Enriching How-to Guides by Linking Actionable Phrases

Alexandr Chernov; Nikolaos Lagos; Matthias Gallé; Ágnes Sándor

The World Wide Web contains a large number of community created knowledge of instructional nature. Similarly, in a commercial setting, databases of instructions are used by customer-care providers to guide clients in the resolution of issues. Most of these instructions are expressed in natural language. Knowledge Bases including such information are valuable through the sum of their single entries. However, as each entry is created mostly independently, users (e.g. other community members) cannot take advantage of the accumulated knowledge that can be developed via the aggregation of related entries. In this paper we consider the problem of inter-linking Knowledge Base entries, in order to get relevant information from other parts of the Knowledge Base. To achieve this, we propose to detect \textit{actionable phrases} -- text fragments that describe how to perform a certain action -- and link them to other entries. The extraction method that we implement achieves an F-score of 67.35\%. We also show that using actionable phrases results in better linking quality than using coarser-grained spans of text, as proposed in the literature. Besides the evaluation of both steps, we also include a detailed error analysis and release our annotation to the community.

knowledge discovery and data mining | 2015

Reconstructing Textual Documents from n-grams

Matthias Gallé; Matías Tealdi

We analyze the problem of reconstructing documents when we only have access to the n-grams (for n fixed) and their counts from the original documents. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph, preserving all possible reconstructions while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.

international world wide web conferences | 2015

Roles for the Boys?: Mining Cast Lists for Gender and Role Distributions over Time

William Radford; Matthias Gallé

Film and television play an important role in popular culture. However studies that require watching and annotating video are time-consuming and expensive to run at scale. We explore information mined from media database cast lists to explore the evolution of different roles over time. We focus on the gender distribution of those roles and how this changes over time. Finally, we compare real-life census gender distributions to our web-mediated onscreen gender data. We propose these methodologies are a useful adjunct to traditional analysis that allow researchers to explore the relationship between online and onscreen gender depictions.

european conference on information retrieval | 2014

Boilerplate Detection and Recoding

Matthias Gallé; Jean-Michel Renders

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns --- called boilerplates ---, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon. We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.

international world wide web conferences | 2013

Who broke the news?: an analysis on first reports of news events

Matthias Gallé; Jean-Michel Renders; Eric Karstens

We present a data-driven study on which sources were the first to report on news events. For this, we implemented a news-aggregator that included a large number of established news sources and covered one year of data. We present a novel framework that is able to retrieve a large number of events and not only the most salient ones, while at the same time making sure that they are not exclusively of local impact. Our analysis then focuses on different aspects of the news cycle. In particular we analyze which are the sources to break most of the news. By looking when certain events become bursty, we are able to perform a finer analysis on those events and the associated sources that dominate the global news-attention. Finally we study the time it takes news outlet to report on these events and how this reects different strategies of which news to report. A general finding of our study is that big news agencies remain an important threshold to cross to bring global attention to particular news, but it also shows the importance of focused (by region or topic) outlets.

robot and human interactive communication | 2017

Context-aware selection of multi-modal conversational fillers in human-robot dialogues

Matthias Gallé; Ekaterina Kynev; Nicolas Monet; Christophe Legras

We study the problem of handling the inter-turn pauses in a human-robot dialogue. In order to reduce the impression of elapsed time while the robot transcribes, understands and starts uttering a response we propose to automatically generate conversational fillers, to fill the silences. These fillers combine verbal utterances with body movements. We propose a Bayesian model that samples filler whose production duration time is close to the expected computational time needed by the robot. To increase the sensation of engagement, the fillers also include contextual information gathered during the dialogue (such as the name of the interlocutor), if this information is present with high confidence. We evaluate this approach with an indirect user study measuring time perception, comparing three different strategies to overcome the inter-turn time (silence, static filler and our approach). The results show that users prefer the dynamic fillers, even when the conversation is objectively shorter with one of the other strategies.

human robot interaction | 2017

Robots as Conversational Mediators

Ekaterina Kynev; Nicolas Monet; Christophe Legras; Matthias Gallé

We propose a system in which a humanoid robot acts as a broker of a human-robot-human conversation, with one of the robot-human communication performed remotely. The robot plays a role of a \textit{mediator}, summarizing relevant information, translating across languages or point-of-views and even censoring some information.

Explore More