Ilia Markov
Instituto Politécnico Nacional
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ilia Markov.
Computational Intelligence and Neuroscience | 2016
Helena Gómez-Adorno; Ilia Markov; Grigori Sidorov; Juan Pablo Posadas-Durán; Miguel A. Sanchez-Perez; Liliana Chanona-Hernández
We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.
north american fuzzy information processing society | 2015
Grigori Sidorov; Helena Gómez-Adorno; Ilia Markov; David Pinto; Nahun Loya
In this paper, we propose the application of the Tree Edit Distance (TED) for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts. The computation of text similarity is the basic task for many natural language processing problems, and it is an open research field. Syntactic n-grams are text features for Vector Space Model construction extracted from dependency trees. Soft similarity is application of Vector Space Model taking into account similarity of features. First, we discuss the advantages of the application of the TED to syntactic n-grams. Then, we present a procedure based on the TED and syntactic n-grams for calculating soft similarity between texts.
international conference on computational linguistics | 2017
Ilia Markov; Efstathios Stamatatos; Grigori Sidorov
The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we present an improved algorithm for cross-topic AA. We demonstrate that the effectiveness of character n-grams representation can be significantly enhanced by performing simple pre-processing steps and appropriately tuning the number of features, especially in cross-topic conditions.
Programming and Computer Software | 2017
Grigori Sidorov; M. Ibarra Romero; Ilia Markov; R. Guzman-Cabrera; Liliana Chanona-Hernández; Francisco Velasquez
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) | 2017
Helena Gómez-Adorno; Ilia Markov; Jorge Baptista; Grigori Sidorov; David Pinto
This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.
symposium on languages, applications and technologies | 2014
Ilia Markov; Nuno J. Mamede; Jorge Baptista
In this paper, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns. 1998 ACM Subject Classification I.2.7 Natural Language Processing
processing of the portuguese language | 2014
Jorge Baptista; Nuno J. Mamede; Ilia Markov
This paper describes the integration of verbal idioms into an Natural Language Processing (NLP) system, adopting a construction approach, which is based on the prior parsing stage, so that these Multi-Word Expressions (MWE) can be taken into account in subsequent tasks, such as semantic role labeling or whole-part relation extraction. The paper focuses on body-part nouns, which are often part of many verbal idioms, and uses a manually annotated corpus to evaluate its parsing strategy. Results showed a precision of 0.92, 0.83 recall, 0.87 f-measure and an accuracy 0.99.
cross language evaluation forum | 2017
Miguel A. Sanchez-Perez; Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov
We compare the performance of character n-gram features (\(n=3{-}8\)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (\(n=5{-}8\)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (\(n=1{-}2\) for words and \(n=3{-}8\) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.
mexican international conference on artificial intelligence | 2016
Ilia Markov; Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; Alexander F. Gelbukh
To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settings the neural network-based features outperform the traditional features when using the same classifier. Our method outperforms existing state of the art under some settings, though the current state-of-the-art results on those tasks have been quite weak.
processing of the portuguese language | 2014
Ilia Markov; Nuno J. Mamede; Jorge Baptista
In this paper, we target the extraction of whole-part relations involving human entities and body-part nouns occurrences in texts using STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.