Esaú Villatoro-Tello
Universidad Autónoma Metropolitana
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Esaú Villatoro-Tello.
Expert Systems With Applications | 2013
Fernando Sánchez-Vega; Esaú Villatoro-Tello; Manuel Montes-y-Gómez; Luis Villaseñor-Pineda; Paolo Rosso
An important task in plagiarism detection is determining and measuring similar text portions between a given pair of documents. One of the main difficulties of this task resides on the fact that reused text is commonly modified with the aim of covering or camouflaging the plagiarism. Another difficulty is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to tackle these problems, we propose a novel method for detecting likely portions of reused text. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of reused text. We also propose representing the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation. This new representation aims to facilitate the recognition of plagiarism by considering diverse characteristics of the reused text during the classification phase. Experimental results employing a supervised classification strategy showed that the proposed method is able to outperform traditionally used approaches.
flexible query answering systems | 2009
Maya Carrillo; Esaú Villatoro-Tello; Aurelio López-López; Chris Eliasmith; Manuel Montes-y-Gómez; Luis Villaseñor-Pineda
The bag of words representation (BoW), which is widely used in information retrieval (IR), represents documents and queries as word lists that do not express anything about context information. When we look for information, we find that not everything is explicitly stated in a document, so context information is needed to understand its content. This paper proposes the use of bag of concepts (BoC) and Holographic reduced representation (HRR) in IR. These representations go beyond BoW by incorporating context information to document representations. Both HRR and BoC are produced using a vector space methodology known as Random Indexing, and allow expressing additional knowledge from different sources. Our experiments have shown the feasibility of the representations and improved the mean average precision by up to 7% when they are compared with the traditional vector space model.
mexican international conference on artificial intelligence | 2014
Gabriela Ramírez-de-la-Rosa; Esaú Villatoro-Tello; Héctor Jiménez-Salazar; Christian Sánchez-Sánchez
Online communities are filled with comments of loyal readers or first-time viewers, that are constantly creating and sharing information at an unprecedented level, resulting in millions of messages containing opinions, ideas, needs and beliefs of Internet users. Therefore, businesses companies are very interested in finding influential users and encouraging them to create positive influence. Influential users represent users with the ability to influence individual’s attitudes in a desired way with relative frequency. We present an empirical analysis on influential users identification problem in Twitter. Our proposed approach considers that the influential level of users can be detected by considering its communication patterns, by means of particular writing style features as well as behavioral features. Performed experiments on more that 7000 users profiles, indicate that it is possible to automatically identify influential users among the members of a social networking community, and also it obtains competitive results against several state-of-the-art methods.
forum for information retrieval evaluation | 2014
Enrique Flores; Paolo Rosso; Lidia Moreno; Esaú Villatoro-Tello
This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software industry and academia fields. Accordingly, PAN@FIRE track, named SOurce COde Re-use (SOCO) focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total five teams submitted 17 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works on the posed shared task.
mexican international conference on artificial intelligence | 2009
Esaú Villatoro-Tello; Luis Villaseñor-Pineda; Manuel Montes-y-Gómez
Recent evaluation results from Geographic Information Retrieval (GIR) indicate that current information retrieval methods are effective to retrieve relevant documents for geographic queries, but they have severe difficulties to generate a pertinent ranking of them. Motivated by these results in this paper we present a novel re-ranking method, which employs information obtained through a relevance feedback process to perform a ranking refinement . Performed experiments show that the proposed method allows to improve the generated ranking from a traditional IR machine, as well as results from traditional re-ranking strategies such as query expansion via relevance feedback.
Information Retrieval Journal | 2018
Debasis Ganguly; Gareth J. F. Jones; Aarón Ramírez-de-la-Cruz; Gabriela Ramírez-de-la-Rosa; Esaú Villatoro-Tello
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.
Expert Systems With Applications | 2017
Hugo Jair Escalante; Esaú Villatoro-Tello; Sara E. Garza; A. Pastor López-Monroy; Manuel Montes-y-Gómez; Luis Villaseñor-Pineda
Abstract E-communication represents a major threat to users who are exposed to a number of risks and potential attacks. Detecting these risks with as much anticipation as possible is crucial for prevention. However, much research so far has focused on forensic tools that can be applied only when an attack has been performed. This paper proposes a novel and effective methodology for the early detection of threats in written social media. The goal is to recognize a potential attack before it is consummated, and using a minimum amount of information. The proposed approach considers the use of profile-based representations (PBRs) for this goal. PBRs have multiple benefits, including non-sparsity, low dimensionality, and a proved discriminative power. Moreover, representations for partial documents can be derived naturally with PBRs, which makes them suitable for the addressed problem. Results include empirical evidence on the usefulness of PBRs in the early recognition setting for two tasks in which anticipation is critical: sexual predator detection and aggressive text identification. These results reveal, on the one hand, that PBRs achieve state of the art performance when using full-length documents (i.e., the classical task), and, on the other hand, that the proposed methodology outperforms previous work on early recognition of sexual predators by a considerable margin, while obtaining state of the art performance in aggressive text identification. To the best of our knowledge, these are the best results reported on early recognition for the approached problems. We foresee this work will pave the way for the development of novel methodologies for the problem and will motivate further research from the intelligent systems and text mining communities.
cross language evaluation forum | 2008
Esaú Villatoro-Tello; Manuel Montes-y-Gómez; Luis Villaseñor-Pineda
This paper focuses on the problem of ranking documents for Geographic Information Retrieval. It aims to demonstrate that by using some query-related example texts it is possible to improve the final ranking of the retrieved documents. Experimental results indicated that our approach could improve the MAP of some sets of retrieved documents using only two example texts.
Journal of Intelligent and Fuzzy Systems | 2018
Miguel A. Álvarez-Carmona; Luis Pellegrin; Manuel Montes-y-Gómez; Fernando Sánchez-Vega; Hugo Jair Escalante; A. Pastor López-Monroy; Luis Villaseñor-Pineda; Esaú Villatoro-Tello
The goal of Author Profiling (AP) is to identify demographic aspects (e.g., age, gender) from a given set of authors by analyzing their written texts. Recently, the AP task has gained interest in many problems related to computer forensics, psychology, marketing, but specially in those related with social media exploitation. As known, social media data is shared through a wide range of modalities (e.g., text, images and audio), representing valuable information to be exploited for extracting valuable insights from users. Nevertheless, most of the current work in AP using social media data has been devoted to analyze textual information only, and there are very few works that have started exploring the gender identification using visual information. Contrastingly, this paper focuses in exploiting the visual modality to perform both age and gender identification in social media, specifically in Twitter. Our goal is to evaluate the pertinence of using visual information in solving the AP task. Accordingly, we have extended the Twitter corpus from PAN 2014, incorporating posted images from all the users, making a distinction between tweeted and retweeted images. Performed experiments provide interesting evidence on the usefulness of visual information in comparison with traditional textual representations for the AP task.
Pattern Analysis and Applications | 2017
Fernando Sánchez-Vega; Esaú Villatoro-Tello; Manuel Montes-y-Gómez; Paolo Rosso; Efstathios Stamatatos; Luis Villaseñor-Pineda
Abstract Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identification consists in automatically recognizing document fragments that contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes and morphological or lexical substitutions. Our main hypothesis establishes that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identification, which represents a new valuable resource to the NLP community for future research work in this field.