Is this you? Create Your Porfile

Paolo Rosso

Polytechnic University of Valencia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paolo Rosso is active.

Explore More

Publication

Featured researches published by Paolo Rosso.

language resources and evaluation | 2013

A multidimensional approach for detecting irony in Twitter

Antonio Reyes; Paolo Rosso; Tony Veale

Irony is a pervasive aspect of many online texts, one made all the more difficult by the absence of face-to-face contact and vocal intonation. As our media increasingly become more social, the problem of irony detection will become even more pressing. We describe here a set of textual features for recognizing irony at a linguistic level, especially in short texts created via social media such as Twitter postings or “tweets”. Our experiments concern four freely available data sets that were retrieved from Twitter using content words (e.g. “Toyota”) and user-generated tags (e.g. “#irony”). We construct a new model of irony detection that is assessed along two dimensions: representativeness and relevance. Initial results are largely positive, and provide valuable insights into the figurative issues facing tasks such as sentiment analysis, assessment of online reputations, or decision making.

language resources and evaluation | 2011

Cross-language plagiarism detection

Martin Potthast; Alberto Barrón-Cedeño; Benno Stein; Paolo Rosso

Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.

IEEE Transactions on Knowledge and Data Engineering | 2010

Automatic Ontology Matching via Upper Ontologies: A Systematic Evaluation

Viviana Mascardi; Angela Locoro; Paolo Rosso

¿Ontology matching¿ is the process of finding correspondences between entities belonging to different ontologies. This paper describes a set of algorithms that exploit upper ontologies as semantic bridges in the ontology matching process and presents a systematic analysis of the relationships among features of matched ontologies (number of simple and composite concepts, stems, concepts at the top level, common English suffixes and prefixes, and ontology depth), matching algorithms, used upper ontologies, and experiment results. This analysis allowed us to state under which circumstances the exploitation of upper ontologies gives significant advantages with respect to traditional approaches that do no use them. We run experiments with SUMO-OWL (a restricted version of SUMO), OpenCyc, and DOLCE. The experiments demonstrate that when our ¿structural matching method via upper ontology¿ uses an upper ontology large enough (OpenCyc, SUMO-OWL), the recall is significantly improved while preserving the precision obtained without upper ontologies. Instead, our ¿nonstructural matching method¿ via OpenCyc and SUMO-OWL improves the precision and maintains the recall. The ¿mixed method¿ that combines the results of structural alignment without using upper ontologies and structural alignment via upper ontologies improves the recall and maintains the F-measure independently of the used upper ontology.

international conference on computational linguistics | 2009

ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy

Yassine Benajiba; Paolo Rosso; José Miguel BenedíRuiz

The task of Named Entity Recognition (NER) allows to identify proper names as well as temporal and numeric expressions, in an open-domain text. NER systems proved to be very important for many tasks in Natural Language Processing (NLP) such as Information Retrieval and Question Answering tasks. Unfortunately, the main efforts to build reliable NER systems for the Arabic language have been made in a commercial frame and the approach used as well as the accuracy of the performance are not known. In this paper, we present ANERsys: a NER system built exclusively for Arabic texts based-on n-grams and maximum entropy. Furthermore, we present both the specific Arabic language dependent heuristic and the gazetteers we used to boost our system. We developed our own training and test corpora (ANERcorp) and gazetteers (ANERgazet) to train, evaluate and boost the implemented technique. A major effort was conducted to make sure all the experiments are carried out in the same framework of the CONLL 2002 conference. We carried out several experiments and the preliminary results showed that this approach allows to tackle successfully the problem of NER for the Arabic language.

international conference on computational linguistics | 2011

Wikipedia vandalism detection: combining natural language, metadata, and reputation features

B. Thomas Adler; Luca de Alfaro; Santiago Moisés Mola-Velasco; Paolo Rosso; Andrew G. West

Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions.

decision support systems | 2012

Making objective decisions from subjective data: Detecting irony in customer reviews

Antonio Reyes; Paolo Rosso

The research described in this work focuses on identifying key components for the task of irony detection. By means of analyzing a set of customer reviews, which are considered ironic both in social and mass media, we try to find hints about how to deal with this task from a computational point of view. Our objective is to gather a set of discriminating elements to represent irony, in particular, the kind of irony expressed in such reviews. To this end, we built a freely available data set with ironic reviews collected from Amazon. Such reviews were posted on the basis of an online viral effect; i.e. contents that trigger a chain reaction in people. The findings were assessed employing three classifiers. Initial results are largely positive, and provide valuable insights into the subjective issues of language facing tasks such as sentiment analysis, opinion mining and decision making.

empirical methods in natural language processing | 2008

Arabic Named Entity Recognition using Optimized Feature Sets

Yassine Benajiba; Mona T. Diab; Paolo Rosso

The Named Entity Recognition (NER) task has been garnering significant attention in NLP as it helps improve the performance of many natural language processing applications. In this paper, we investigate the impact of using different sets of features in two discriminative machine learning frameworks, namely, Support Vector Machines and Conditional Random Fields using Arabic data. We explore lexical, contextual and morphological features on eight standardized data-sets of different genres. We measure the impact of the different features in isolation, rank them according to their impact for each named entity class and incrementally combine them in order to infer the optimal machine learning approach and feature set. Our system yields a performance of Fβ=1-measure=83.5 on ACE 2003 Broadcast News data.

Computational Linguistics | 2013

Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

Alberto Barrón-Cedeño; Marta Vila; Maria Antònia Martí; Paolo Rosso

Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation.The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and (iii) paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

iberoamerican congress on pattern recognition | 2006

Authorship attribution using word sequences

Rosa María Coyotl-Morales; Luis Villaseñor-Pineda; Manuel Montes-y-Gómez; Paolo Rosso

Authorship attribution is the task of identifying the author of a given text. The main concern of this task is to define an appropriate characterization of documents that captures the writing style of authors. This paper proposes a new method for authorship attribution supported on the idea that a proper identification of authors must consider both stylistic and topic features of texts. This method characterizes documents by a set of word sequences that combine functional and content words. The experimental results on poem classification demonstrated that this method outperforms most current state-of-the-art approaches, and that it is appropriate to handle the attribution of short documents.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Arabic Named Entity Recognition: A Feature-Driven Study

Yassine Benajiba; Mona T. Diab; Paolo Rosso

The named entity recognition task aims at identifying and classifying named entities within an open-domain text. This task has been garnering significant attention recently as it has been shown to help improve the performance of many natural language processing applications. In this paper, we investigate the impact of using different sets of features in three discriminative machine learning frameworks, namely, support vector machines, maximum entropy and conditional random fields for the task of named entity recognition. Our language of interest is Arabic. We explore lexical, contextual and morphological features and nine data-sets of different genres and annotations. We measure the impact of the different features in isolation and incrementally combine them in order to evaluate the robustness to noise of each approach. We achieve the highest performance using a combination of 15 features in conditional random fields using broadcast news data (Fbeta = 1=83.34).

Explore More