Grigori Sidorov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Grigori Sidorov is active.

Explore More

Publication

Featured researches published by Grigori Sidorov.

Expert Systems With Applications | 2014

Syntactic N-grams as machine learning features for natural language processing

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

international conference on computational linguistics | 2001

Zipf and Heaps Laws' Coefficients Depend on Language

Alexander F. Gelbukh; Grigori Sidorov

We observed that the coefficients of two important empirical statistical laws of language - Zipf law and Heaps law - are different for different languages, as we illustrate on English and Russian examples. This may have both theoretical and practical implications. On the one hand, the reasons for this may shed light on the nature of language. On the other hand, these two laws are important in, say, full-text database design allowing predicting the index size.

mexican international conference on artificial intelligence | 2012

Empirical study of machine learning based approach for opinion mining in tweets

Grigori Sidorov; Sabino Miranda-Jiménez; Francisco Viveros-Jiménez; Alexander F. Gelbukh; Noé Alejandro Castro-Sánchez; Francisco Velasquez; Ismael Díaz-Rangel; Sergio Suárez-Guerra; Alejandro Treviño; Juan Gordon

Opinion mining deals with determining of the sentiment orientation--positive, negative, or neutral--of a (short) text. Recently, it has attracted great interest both in academia and in industry due to its useful potential applications. One of the most promising applications is analysis of opinions in social networks. In this paper, we examine how classifiers work while doing opinion mining over Spanish Twitter data. We explore how different settings (n-gram size, corpus size, number of sentiment classes, balanced vs. unbalanced corpus, various domains) affect precision of the machine learning algorithms. We experimented with Naive Bayes, Decision Tree, and Support Vector Machines. We describe also language specific preprocessing--in our case, for Spanish language--of tweets. The paper presents best settings of parameters for practical applications of opinion mining in Spanish Twitter. We also present a novel resource for analysis of emotions in texts: a dictionary marked with probabilities to express one of the six basic emotions(Probability Factor of Affective use (PFA)(Spanish Emotion Lexicon that contains 2,036 words.

mexican international conference on artificial intelligence | 2012

Syntactic dependency-based n-grams as classification features

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency trees some simple additional steps should be made. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. SVM classifier for several profile sizes was used. We used as baseline traditional n-grams of words, POS tags and characters. Obtained results are better when applying sn-grams.

text speech and dialogue | 1999

Use of a Weighted Topic Hierarchy for Document Classification

Alexander F. Gelbukh; Grigori Sidorov; Adolfo Guzmán-Arenas

A statistical method of document classification driven by a hierarchical topic dictionary is proposed. The method uses a dictionary with a simple structure and is insensible to inaccuracies in the dictionary. Two kinds of weights of dictionary entries, namely, relevance and discrimination weights are discussed. The first type of weights is associated with the links between words and topics and between the nodes in the tree, while the weights of the second type depend on user database. A common sense-complaint way of assignment of these weights to the topics is presented. A system for text classification Classifier based on the discussed method is described.

applications of natural language to data bases | 2010

Automatic term extraction using log-likelihood based comparison with general reference corpus

Alexander F. Gelbukh; Grigori Sidorov; Eduardo Lavin-Villa; Liliana Chanona-Hernández

In the paper we present a method that allows an extraction of single-word terms for a specific domain. At the next stage these terms can be used as candidates for multi-word term extraction. The proposed method is based on comparison with general reference corpus using log-likelihood similarity. We also perform clustering of the extracted terms using k-means algorithm and cosine similarity measure. We made experiments using texts of the domain of computer science. The obtained term list is analyzed in detail.

international conference on computational linguistics | 2013

Syntactic dependency-based n-grams: more evidence of usefulness in classification

Grigori Sidorov; Francisco Velasquez; Efstathios Stamatatos; Alexander F. Gelbukh; Liliana Chanona-Hernández

The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sn-grams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship attribution for corpora of three and seven authors with very promising results.

mexican international conference on artificial intelligence | 2005

A domain independent natural language interface to databases capable of processing complex queries

Rodolfo A. Pazos Rangel; O Joaquín Pérez; B. Juan Javier González; Alexander F. Gelbukh; Grigori Sidorov; Myriam Rodríguez

We present a method for creating natural language interfaces to databases (NLIDB) that allow for translating natural language queries into SQL. The method is domain independent, i.e., it avoids the tedious process of configuring the NLIDB for a given domain. We automatically generate the domain dictionary for query translation using semantic metadata of the database. Our semantic representation of a query is a graph including information from database metadata. The query is translated taking into account the parts of speech of its words (obtained with some linguistic processing). Specifically, unlike most existing NLIDBs, we take seriously auxiliary words (prepositions and conjunctions) as set theory operators, which allows for processing more complex queries. Experimental results (conducted on two Spanish databases from different domains) show that treatment of auxiliary words improves correctness of translation by 12.1%. With the developed NLIDB 82of queries were correctly translated (and thus answered). Reconfiguring the NLIDB from one domain to the other took only ten minutes.

international conference natural language processing | 2005

On some optimization heuristics for lesk-like WSD algorithms

Alexander F. Gelbukh; Grigori Sidorov; Sang-Yong Han

For most English words, dictionaries give various senses: e.g., “bank”can stand for a financial institution, shore, set, etc. Automatic selection of the sense intended in a given text has crucial importance in many applications of text processing, such as information retrieval or machine translation: e.g., “(my account in the) bank” is to be translated into Spanish as “(mi cuenta en el) banco” whereas “(on the) bank (of the lake)” as “(en la) orilla (del lago).” To choose the optimal combination of the intended senses of all words, Lesk suggested to consider the global coherence of the text, i.e., which we mean the average relatedness between the chosen senses for all words in the text. Due to high dimensionality of the search space, heuristics are to be used to find a near-optimal configuration. In this paper, we discuss several such heuristics that differ in terms of complexity and quality of the results. In particular, we introduce a dimensionality reduction algorithm that reduces the complexity of computationally expensive approaches such as genetic algorithms.

international conference on computational linguistics | 2002

Compilation of a Spanish Representative Corpus

Alexander F. Gelbukh; Grigori Sidorov; Liliana Chanona-Hernández

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Explore More