Marcos Garcia
University of Santiago de Compostela
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcos Garcia.
international conference on computational linguistics | 2014
Pablo Gamallo; Marcos Garcia
This article describes a strategy based on a naive-bayes classifier for detecting the polarity of English tweets. The experiments have shown that the best performance is achieved by using a binary classifier between just two sharp polarity categories: positive and negative. In addition, in order to detect tweets with and without polarity, the system makes use of a very basic rule that searchs for polarity words within the analysed tweets/texts. When the classifier is provided with a polarity lexicon and multiwords it achieves 63% F-score.
portuguese conference on artificial intelligence | 2015
Pablo Gamallo; Marcos Garcia
Open Information Extraction (OIE) is a recent unsupervised strategy to extract great amounts of basic propositions (verb-based triples) from massive text corpora which scales to Web-size document collections. We propose a multilingual rule-based OIE method that takes as input dependency parses in the CoNLL-X format, identifies argument structures within the dependency parses, and extracts a set of basic propositions from each argument structure. Our method requires no training data and, according to experimental studies, obtains higher recall and higher precision than existing approaches relying on training data. Experiments were performed in three languages: English, Portuguese, and Spanish.
portuguese conference on artificial intelligence | 2011
Pablo Gamallo; Marcos Garcia
We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language-independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.
Tetrahedron | 2002
María José Figueira; Olga Caamaño; Franco Fernández; Xerardo García-Mera; Marcos Garcia
Abstract Aminoalcohol 3 , a compound of interest for the synthesis of carbocyclic analogs of nucleosides, was prepared from (±)-(2endo,3exo)bicyclo[2.2.1]hept-5-ene-2,3-dimethanol. In the key step, oxidative degradation of a carboxamide was efficiently achieved by treatment of amidoester 10 with lead tetraacetate in tert-butanol.
symposium on languages applications and technologies | 2015
Marcos Garcia; Pablo Gamallo
This paper presents the current development of a multilingual suite for Natural Language Processing. It consists of a sentence chunker, a tokenizer, a PoS-tagger, a dictionary-based lemmatizer and a Named Entity Recognizer (both for enamex and numex expressions). The architecture of the pipeline and the main resources used for its development are described. Besides, the PoS-tagger and the Named Entity Recognizer are evaluated against several state-of-the-art systems. The experiments performed in Portuguese and English show that, in spite of its simplicity, our system competes with some well known tools for NLP. It is entirely written in Perl and distributed under a GPL license.
Linguamática | 2017
Pablo Gamallo; Marcos Garcia
Este artigo apresenta LinguaKit, uma suite multilingue de ferramentas de analise, extracao, anotacao e correcao linguisticas. LinguaKit permite realizar tarefas tao diversas como a lematizacao, a etiquetagem morfossintatica ou a analise sintatica (entre outras), incluindo tambem aplicacoes para a analise de sentimentos (ou minaria de opinioes), a extracao de termos multipalavra, ou a anotacao concetual e ligacao a recursos enciclopedicos tais como a DBpedia. A maior parte dos modulos funcionam para quatro variedades linguisticas: portugues, espanhol, ingles e galego. A linguagem de programacao de LinguaKit e Perl, e o codigo esta disponivel sob a licenca livre GPLv3.
Natural Language Engineering | 2015
Marcos Garcia; Pablo Gamallo
Machine learning techniques have been implemented to extract instances of semantic relations using diverse features based on linguistic knowledge, such as tokens, lemmas, PoS-tags, or dependency paths. However, there has been little work aiming to know which of these features works better in the relation extraction task, and less in languages other than English. In this paper, various features representing different levels of linguistic knowledge are systematically evaluated for biographical relation extraction. The effectiveness of these features was measured by training several supervised classifiers that only differ in the type of linguistic knowledge used to define their features. The experiments performed in this paper show that some basic linguistic knowledge (provided by lemmas and their combination in bigrams) behaves better than other complex features, such as those based on syntactic analysis. Furthermore, some feature combinations using different levels of analysis are proposed in order (i) to avoid feature overlapping as well as (ii) to evaluate the use of computationally inexpensive and widespread tools such as tokenization and lemmatization. This paper also describes two new freely available corpora for biographical relation extraction in Portuguese and Spanish, built by means of a distant-supervision strategy. Experiments were performed with five semantic relations and two languages, using these corpora.
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies : August 3-4, 2017 Vancouver, Canada, 2017, ISBN 978-1-945626-70-8, págs. 274-282 | 2017
Marcos Garcia; Pablo Gamallo
This article describes MetaRomance, a rule-based cross-lingual parser for Romance languages submitted to CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependen- cies. The system is an almost delexicalized parser which does not need training data to analyze Romance languages. It contains linguistically motivated rules based on PoS-tag patterns. The rules included in MetaRomance were developed in about 12 hours by one expert with no prior knowledge in Universal Dependencies, and can be easily extended using a transparent formalism. In this paper we compare the performance of MetaRomance with other supervised systems participating in the competition, paying special attention to the parsing of different treebanks of the same language. We also compare our system with a delexicalized parser for Romance languages, and take advantage of the harmonized annotation of Universal Dependencies to propose a language ranking based on the syntactic distance each variety has from Romance languages.
processing of the portuguese language | 2012
Marcos Garcia; Isaac J. González
Automatic phonetic transcription tools usually perform phonetic transcriptions directly from orthographic representations. Although these approaches often achieve good results, theoretical studies suggest that including morphophonological knowledge allows those systems to improve their performance. Following this idea, we developed a tool which first obtains an underlying representation of each word, using small lexica and dedicated lemmatizers. For each representation, a phonological derivation generates the phonetic transcription by applying linguistically motivated rules. Since most of these rules are added as optional parameters, the system permits to generate dialect-specific transcriptions. This system is not only a grapheme-to-phone tool, but it also obtains phonological representations and evaluates several linguistic processes occurring during the derivation. Preliminary experiments emulating a phonological system of Galician (using as input words spelled in European Portuguese) show that the underlying representation of most words can be obtained using small lexica and also that the derivation produces high-quality phonetic transcriptions.
processing of the portuguese language | 2016
Marcos Garcia
Relation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged in machine-readable formats, useful for several applications that need structured semantic knowledge. The work presented in this paper explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Galician and Spanish. Both machine learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of an open information extraction tool. To implement the extraction systems, several natural language processing tools have been built for the three research languages: From sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and symbolic models. As a result of the performed work, new resources and tools are available for automated processing of texts in Portuguese, Galician and Spanish.