Liliana Chanona-Hernández

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Liliana Chanona-Hernández is active.

Explore More

Publication

Featured researches published by Liliana Chanona-Hernández.

Expert Systems With Applications | 2014

Syntactic N-grams as machine learning features for natural language processing

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

mexican international conference on artificial intelligence | 2012

Syntactic dependency-based n-grams as classification features

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency trees some simple additional steps should be made. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. SVM classifier for several profile sizes was used. We used as baseline traditional n-grams of words, POS tags and characters. Obtained results are better when applying sn-grams.

applications of natural language to data bases | 2010

Automatic term extraction using log-likelihood based comparison with general reference corpus

Alexander F. Gelbukh; Grigori Sidorov; Eduardo Lavin-Villa; Liliana Chanona-Hernández

In the paper we present a method that allows an extraction of single-word terms for a specific domain. At the next stage these terms can be used as candidates for multi-word term extraction. The proposed method is based on comparison with general reference corpus using log-likelihood similarity. We also perform clustering of the extracted terms using k-means algorithm and cosine similarity measure. We made experiments using texts of the domain of computer science. The obtained term list is analyzed in detail.

international conference on computational linguistics | 2013

Syntactic dependency-based n-grams: more evidence of usefulness in classification

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sn-grams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship attribution for corpora of three and seven authors with very promising results.

international conference on computational linguistics | 2002

Compilation of a Spanish Representative Corpus

Alexander F. Gelbukh; Grigori Sidorov; Liliana Chanona-Hernández

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Computational Intelligence and Neuroscience | 2016

Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

Helena Gómez-Adorno; Ilia Markov; Grigori Sidorov; Juan Pablo Posadas-Durán; Miguel A. Sanchez-Perez; Liliana Chanona-Hernández

We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.

Polibits | 2014

Modelo computacional del diálogo basado en reglas aplicado a un robot guía móvil

Grigori Sidorov; Irina Kobozeva; Anton Zimmerling; Liliana Chanona-Hernández; Olga Kolesnikova

This paper presents a formal detailed description of the dialogue management module for a mobile robot functioning as a guide. The module includes a propositional dialogue model, specification of speech acts and speech blocks as well as the inventory of speech patterns corresponding to all speech acts of the model. The dialogue model is implemented as a network of states and transitions between states conditioned by rules, which include verbal and visual factors. The architecture of the module is language independent and can be adapted to any natural language.

soft computing | 2017

Application of the distributed document representation in the authorship attribution task for small corpora

Juan Pablo Posadas-Durán; Helena Gómez-Adorno; Grigori Sidorov; Ildar Z. Batyrshin; David Pinto; Liliana Chanona-Hernández

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

applications of natural language to data bases | 2008

Division of Spanish Words into Morphemes with a Genetic Algorithm

Alexander F. Gelbukh; Grigori Sidorov; Diego Lara-Reyes; Liliana Chanona-Hernández

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

Programming and Computer Software | 2017

Measuring similarity between Karel programs using character and word n-grams

Grigori Sidorov; M. Ibarra Romero; Ilia Markov; R. Guzman-Cabrera; Liliana Chanona-Hernández; Francisco Velasquez

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

Explore More