Helena Gómez-Adorno | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Helena Gómez-Adorno is active.

Explore More

Publication

Featured researches published by Helena Gómez-Adorno.

Sensors | 2016

Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs.

Helena Gómez-Adorno; Grigori Sidorov; David Pinto; Darnes Vilariño; Alexander F. Gelbukh

We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of documents. On average, our method outperforms the state of the art approaches and gives consistently high results across different corpora, unlike existing methods. Our results show that our textual patterns are useful for the task of authorship attribution.

Computational Intelligence and Neuroscience | 2016

Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

Helena Gómez-Adorno; Ilia Markov; Grigori Sidorov; Juan Pablo Posadas-Durán; Miguel A. Sanchez-Perez; Liliana Chanona-Hernández

We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.

north american fuzzy information processing society | 2015

Computing text similarity using Tree Edit Distance

Grigori Sidorov; Helena Gómez-Adorno; Ilia Markov; David Pinto; Nahun Loya

In this paper, we propose the application of the Tree Edit Distance (TED) for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts. The computation of text similarity is the basic task for many natural language processing problems, and it is an open research field. Syntactic n-grams are text features for Vector Space Model construction extracted from dependency trees. Soft similarity is application of Vector Space Model taking into account similarity of features. First, we discuss the advantages of the application of the TED to syntactic n-grams. Then, we present a procedure based on the TED and syntactic n-grams for calculating soft similarity between texts.

ibero-american conference on artificial intelligence | 2014

Content and Style Features for Automatic Detection of Users’ Intentions in Tweets

Helena Gómez-Adorno; David Pinto; Manuel Montes; Grigori Sidorov; Rodrigo Alfaro

The aim of this paper is to evaluate the use of content and style features in automatic classification of intentions of Tweets. For this we propose different style features and evaluate them using a machine learning approach. We found that although the style features by themselves are useful for the identification of the intentions of tweets, it is better to combine such features with the content ones. We present a set of experiments, where we achieved a 9.46 % of improvement on the overall performance of the classification with the combination of content and style features as compared with the content features.

soft computing | 2017

Application of the distributed document representation in the authorship attribution task for small corpora

Juan Pablo Posadas-Durán; Helena Gómez-Adorno; Grigori Sidorov; Ildar Z. Batyrshin; David Pinto; Liliana Chanona-Hernández

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) | 2017

Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words.

Helena Gómez-Adorno; Ilia Markov; Jorge Baptista; Grigori Sidorov; David Pinto

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Computing | 2018

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; David Pinto

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

cross language evaluation forum | 2017

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

Miguel A. Sanchez-Perez; Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov

We compare the performance of character n-gram features (\(n=3{-}8\)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (\(n=5{-}8\)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (\(n=1{-}2\) for words and \(n=3{-}8\) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

International Journal of Pattern Recognition and Artificial Intelligence | 2018

Plagiarism Detection with Genetic-Based Parameter Tuning

Miguel A. Sanchez-Perez; Alexander F. Gelbukh; Grigori Sidorov; Helena Gómez-Adorno

A crucial step in plagiarism detection is text alignment. This task consists in finding similar text fragments between two given documents. We introduce an optimization methodology based on genetic algorithms to improve the performance of a plagiarism detection model by optimizing its input parameters. The implementation of the genetic algorithm is based on nonbinary representation of individuals, elitism selection, uniform crossover, and high mutation rate. The obtained parameter settings allow the plagiarism detection model to achieve better results than the state-of-the-art approaches.

north american chapter of the association for computational linguistics | 2016

CICBUAPnlp at SemEval-2016 Task 4-A: Discovering Twitter Polarity using Enhanced Embeddings.

Helena Gómez-Adorno; Darnes Vilariño; Grigori Sidorov; David Pinto Avendaño

This paper presents our approach for SemEval 2016 task 4: Sentiment Analysis in Twitter. We participated in Subtask A: Message Polarity Classification. The aim is to classify Twitter messages into positive, neutral, and negative polarity. We used a lexical resource for pre-processing of social media data and train a neural network model for feature representation. Our resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. For the classification process, we pass the features obtained in an unsupervised manner into an SVM classifier.

Explore More