Is this you? Create Your Porfile

Ilia Markov

Instituto Politécnico Nacional

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ilia Markov is active.

Explore More

Publication

Featured researches published by Ilia Markov.

Computational Intelligence and Neuroscience | 2016

Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

Helena Gómez-Adorno; Ilia Markov; Grigori Sidorov; Juan Pablo Posadas-Durán; Miguel A. Sanchez-Perez; Liliana Chanona-Hernández

We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.

north american fuzzy information processing society | 2015

Computing text similarity using Tree Edit Distance

Grigori Sidorov; Helena Gómez-Adorno; Ilia Markov; David Pinto; Nahun Loya

In this paper, we propose the application of the Tree Edit Distance (TED) for calculation of similarity between syntactic n-grams for further detection of soft similarity between texts. The computation of text similarity is the basic task for many natural language processing problems, and it is an open research field. Syntactic n-grams are text features for Vector Space Model construction extracted from dependency trees. Soft similarity is application of Vector Space Model taking into account similarity of features. First, we discuss the advantages of the application of the TED to syntactic n-grams. Then, we present a procedure based on the TED and syntactic n-grams for calculating soft similarity between texts.

international conference on computational linguistics | 2017

Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing

Ilia Markov; Efstathios Stamatatos; Grigori Sidorov

The effectiveness of character n-gram features for representing the stylistic properties of a text has been demonstrated in various independent Authorship Attribution (AA) studies. Moreover, it has been shown that some categories of character n-grams perform better than others both under single and cross-topic AA conditions. In this work, we present an improved algorithm for cross-topic AA. We demonstrate that the effectiveness of character n-grams representation can be significantly enhanced by performing simple pre-processing steps and appropriately tuning the number of features, especially in cross-topic conditions.

Programming and Computer Software | 2017

Measuring similarity between Karel programs using character and word n-grams

Grigori Sidorov; M. Ibarra Romero; Ilia Markov; R. Guzman-Cabrera; Liliana Chanona-Hernández; Francisco Velasquez

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) | 2017

Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words.

Helena Gómez-Adorno; Ilia Markov; Jorge Baptista; Grigori Sidorov; David Pinto

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

symposium on languages, applications and technologies | 2014

Automatic Identification of Whole-Part Relations in Portuguese

Ilia Markov; Nuno J. Mamede; Jorge Baptista

In this paper, we improve the extraction of semantic relations between textual elements as it is currently performed by STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese, by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns. 1998 ACM Subject Classification I.2.7 Natural Language Processing

processing of the portuguese language | 2014

Integrating Verbal Idioms into an NLP System

Jorge Baptista; Nuno J. Mamede; Ilia Markov

This paper describes the integration of verbal idioms into an Natural Language Processing (NLP) system, adopting a construction approach, which is based on the prior parsing stage, so that these Multi-Word Expressions (MWE) can be taken into account in subsequent tasks, such as semantic role labeling or whole-part relation extraction. The paper focuses on body-part nouns, which are often part of many verbal idioms, and uses a manually annotated corpus to evaluate its parsing strategy. Results showed a precision of 0.92, 0.83 recall, 0.87 f-measure and an accuracy 0.99.

cross language evaluation forum | 2017

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

Miguel A. Sanchez-Perez; Ilia Markov; Helena Gómez-Adorno; Grigori Sidorov

We compare the performance of character n-gram features (\(n=3{-}8\)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams (\(n=5{-}8\)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (\(n=1{-}2\) for words and \(n=3{-}8\) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

mexican international conference on artificial intelligence | 2016

Author Profiling with Doc2vec Neural Network-Based Document Embeddings

Ilia Markov; Helena Gómez-Adorno; Juan Pablo Posadas-Durán; Grigori Sidorov; Alexander F. Gelbukh

To determine author demographics of texts in social media such as Twitter, blogs, and reviews, we use doc2vec document embeddings to train a logistic regression classifier. We experimented with age and gender identification on the PAN author profiling 2014–2016 corpora under both single- and cross-genre conditions. We show that under certain settings the neural network-based features outperform the traditional features when using the same classifier. Our method outperforms existing state of the art under some settings, though the current state-of-the-art results on those tasks have been quite weak.

processing of the portuguese language | 2014

Body part nouns and Whole-Part Relations in Portuguese

Ilia Markov; Nuno J. Mamede; Jorge Baptista

In this paper, we target the extraction of whole-part relations involving human entities and body-part nouns occurrences in texts using STRING, a hybrid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.

Explore More