Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Joaquim Ferreira da Silva is active.

Publication


Featured researches published by Joaquim Ferreira da Silva.


portuguese conference on artificial intelligence | 1999

Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

Joaquim Ferreira da Silva; Gaël Dias; Sylvie Guilloré; José Gabriel Pereira Lopes

The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, IR and IE. In this paper we propose two new association measures, the Symmetric Conditional Probability (SCP) and the Mutual Expectation (ME) for the extraction of contiguous and non-contiguous MWUs. Both measures are used by a new algorithm, the LocalMaxs, that requires neither empirically obtained thresholds nor complex linguistic filters. We assess the results obtained by both measures by comparing them with reference association measures (Specific Mutual Information, o 2 , Dice and Log-Likelihood coefficients) over a multilingual parallel corpus. An additional experiment has been carried out over a part-of-speech tagged Portuguese corpus for extracting contiguous compound verbs.


computational intelligence for modelling, control and automation | 2006

Identification of Document Language is Not yet a Completely Solved Problem

Joaquim Ferreira da Silva; Gabriel Pereira Lopes

Existing Language Identification (LID) approaches do reach 100% precision, in most common situations, when dealing with documents written in just one language, and when those documents are large enough. However, LID approaches do not provide a reliable solution for some situations: when there is need to discriminate the correct variant of the language used in a text, for example, European or Brazilian variants of Portuguese, UK or USA English variants, or any other language variants. Another hard context occur with small touristic advertisements on the web, addressing foreigners but using local language to name most local entities. In this paper, we present a fully statistics- based LID approach which learns the most discriminant information according to each context, and identifies the correct language or language variant a text is written in. This methodology is shown to be correct for normal texts and maintains its robustness in hard LID contexts.


international conference on data mining | 2001

Document clustering and cluster topic extraction in multilingual corpora

Joaquim Ferreira da Silva; João T. Mexia; A. Coelho; Gabriel Pereira Lopes

A statistics-based approach for clustering documents and for extracting cluster topics is described relevant (meaningful) expressions (REs) automatically extracted from corpora are used as clustering base features. These features are transformed and its number is strongly reduced in order to obtain a small set of document classification features. This is achieved on the basis of principal components analysis. Model-based clustering analysis finds the best number of clusters. Then, the most important REs are extracted from each cluster and taken as document cluster topics.


international conference on conceptual structures | 2012

Mining concepts from texts

João Ventura; Joaquim Ferreira da Silva

The extraction of multi-word relevant expressions has been an increasingly hot topic in the last few years. Relevant expressions are applicable in diverse areas such as Information Retrieval, document clustering, or classification and indexing of documents. However, relevant single-words, which represent much of the knowledge in texts, have been a relatively dormant field. In this paper we present a statistical language-independent approach to extract concepts formed by relevant single and multi-word units. By achieving promising precision/recall values, it can be an alternative both to language dependent approaches and to extractors that deal exclusively with multi-words.


Archive | 2008

Ranking and Extraction of Relevant Single Words in Text

João Ventura; Joaquim Ferreira da Silva

The extraction of keywords is currently a very important technique used in several applications, for instance, the characterization of document topics. In this case, by extracting the right keywords on a query, one could easily know what documents should be read and what documents should be put aside. However, while the automatic extraction of multiword has been an active search field by the scientific community, the automatic extraction of single words, or unigrams, has been basically ignored due to its intrinsic difficulty. Meanwhile, it is easy to demonstrate that in a process of keyword extraction, leaving unigrams out impoverishes, in a certain extent, the quality of the final result. Take the following example:


portuguese conference on artificial intelligence | 2001

Multilingual Document Clustering, Topic Extraction and Data Transformations

Joaquim Ferreira da Silva; João T. Mexia; Carlos A. Coelho; José Gabriel Pereira Lopes

This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by Model-Based Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.


portuguese conference on artificial intelligence | 2009

A Document Descriptor Extractor Based on Relevant Expressions

Joaquim Ferreira da Silva; Gabriel Pereira Lopes

People are often asked to associate keywords to documents to enable applications to access the summarized core content of documents. This fact was the main motivation to work on an approach that may contribute to move from this manual procedure to an automatic one. Since Relevant Expressions (REs) or multi-word term expressions can be automatically extracted using the LocalMaxs algorithm, the most relevant ones can be used to describe the core content of each document. In this paper we present a language-independent approach for automatic generation of document descriptors. Results are shown for three different European languages and comparisons are made concerning different metrics for selecting the most informative REs of each document.


portuguese conference on artificial intelligence | 2007

Detection of strange and wrong automatic part-of-speech tagging

Vitor Rocio; Joaquim Ferreira da Silva; Gabriel Pereira Lopes

Automatic morphosyntactic tagging of corpora is usually imperfect. Wrong or strange tagging may be automatically repeated following some patterns. It is usually hard to manually detect all these errors, as corpora may contain millions of tags. This paper presents an approach to detect sequences of part-of-speech tags that have an internal cohesiveness in corpora. Some sequences match to syntactic chunks or correct sequences, but some are strange or incorrect, usually due to systematically wrong tagging. The amount of time spent in separating incorrect bigrams and trigrams from correct ones is very small, but it allows us to detect 70% of all tagging errors in the corpus.


portuguese conference on artificial intelligence | 2007

New techniques for relevant word ranking and extraction

João Ventura; Joaquim Ferreira da Silva

In this paper we first propose two new metrics to rank the relevance of words in a text. The metrics presented are purely statistic and language independent and are based in the analysis of each words neighborhood. Typically, a relevant word is more strongly connected to some of its neighbors in despite of others.We also present a new technique based on the syllable analysis and show that despite it can be a metric by itself, it can also improve the quality of the proposed methods as also greatly improve the quality of other proposed methods (such as Tf-idf). Finally, based on the rankings previously obtained and using another neighborhood analysis, we present a new method to decide about the relevance of words on a yes/no basis.


portuguese conference on artificial intelligence | 2013

Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors

João Ventura; Joaquim Ferreira da Silva

Keywords are single and multiword terms that describe the semantic content of documents. They are useful in many applications, such as document searching and indexing, or to be read by humans. Keywords can be explicit, by occurring in documents, or implicit, since, although not explicitly written in documents, they are semantically related to their contents. This paper presents a statistical approach to build document descriptors with explicit and implicit keywords automatically extracted from the documents. Our approach is language-independent and we show comparative results for three different European languages.

Collaboration


Dive into the Joaquim Ferreira da Silva's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

João Ventura

Universidade Nova de Lisboa

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

João Casteleiro

Universidade Nova de Lisboa

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sofia Cavaco

Universidade Nova de Lisboa

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Pablo Gamallo

University of Santiago de Compostela

View shared research outputs
Top Co-Authors

Avatar

Carlos A. Coelho

Universidade Nova de Lisboa

View shared research outputs
Top Co-Authors

Avatar

Gaël Dias

Universidade Nova de Lisboa

View shared research outputs
Researchain Logo
Decentralizing Knowledge