Juan Carlos Gomez | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan Carlos Gomez is active.

Explore More

Publication

Featured researches published by Juan Carlos Gomez.

Computational Statistics & Data Analysis | 2012

PCA document reconstruction for email classification

Juan Carlos Gomez; Marie-Francine Moens

This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.

Knowledge and Information Systems | 2012

Highly discriminative statistical features for email classification

Juan Carlos Gomez; Erik Boiy; Marie-Francine Moens

This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.

Expert Systems With Applications | 2014

Minimizer of the Reconstruction Error for multi-class document categorization

Juan Carlos Gomez; Marie-Francine Moens

In the present article we introduce and validate an approach for single-label multi-class document categorization based on text content features. The introduced approach uses the statistical property of Principal Component Analysis, which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. Such matrix transforms the original set of training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum reconstruction error. The proposed method, called Minimizer of the Reconstruction Error (mRE) classifier, uses this property, and extends and applies it to new unseen test documents. Several experiments on four multi-class datasets for text categorization are conducted in order to test the stable and generally better performance of the proposed approach in comparison with other popular classification methods.

information retrieval facility conference | 2012

Hierarchical classification of web documents by stratified discriminant analysis

Juan Carlos Gomez; Marie-Francine Moens

In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.

conference on multimedia modeling | 2016

Cross-Modal Fashion Search

Susana Zoghbi; Geert Heyman; Juan Carlos Gomez; Marie-Francine Moens

In this demo we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we demonstrate two tasks: (1) given a query image (without any accompanying text), we retrieve textual descriptions that correspond to the visual attributes in the visual query; and (2) given a textual query that may express an interest in specific visual characteristics, we retrieve relevant images (without leveraging textual meta-data) that exhibit the required visual attributes. The first task is especially useful to manage image collections by online stores who might want to automatically organize and mine predominantly visual items according to their attributes without human input. The second task renders useful for users to find items with specific visual characteristics, in the case where there is no text available describing the target image. We use a state-of-the-art visual and textual features, as well as a state-of-the-art latent variable model to bridge between textual and visual data: bilingual latent Dirichlet allocation. Unlike traditional search engines, we demonstrate a truly cross-modal system, where we can directly bridge between visual and textual content without relying on pre-annotated meta-data.

International Journal of Computer and Electrical Engineering | 2016

Fashion meets computer vision and NLP at e-commerce search

Susana Zoghbi; Geert Heyman; Juan Carlos Gomez; Marie-Francine Moens

In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express an interest in specific visual product characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we introduce a new dataset that consists of 53,689 images coupled with textual descriptions. The images contain fashion garments that display a great variety of visual attributes, such as different shapes, colors and textures in natural language. Unlike previous datasets, the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal e-commerce search. We investigate two state-of-the-art latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use state-of-the-art visual and textual features and report promising results.

mexican international conference on artificial intelligence | 2013

Flame Classification through the Use of an Artificial Neural Network Trained with a Genetic Algorithm

Juan Carlos Gomez; Fernando Hernandez; Carlos A. Coello Coello; Guillermo Ronquillo; Antonio Trejo

This paper introduces a Genetic Algorithm (GA) for training Artificial Neural Networks (ANNs) using the electromagnetic spectrum signal of a combustion process for flame pattern classification. Combustion requires identification systems that provide information about the state of the process in order to make combustion more efficient and clean. Combustion is complex to model using conventional deterministic methods thus motivate the use of heuristics in this domain. ANNs have been successfully applied to combustion classification systems; however, traditional ANN training methods get often trapped in local minima of the error function and are inefficient in multimodal and non-differentiable functions. A GA is used here to overcome these problems. The proposed GA finds the weights of an ANN than best fits the training pattern with the highest classification rate.

advanced data mining and applications | 2013

Automatic Labeling of Forums Using Bloom’s Taxonomy

Vanessa Echeverria; Juan Carlos Gomez; Marie-Francine Moens

The labeling of discussion forums using the cognitive levels of Bloom’s taxonomy is a time-consuming and very expensive task due to the big amount of information that needs to be labeled and the need of an expert in the educational field for applying the taxonomy according to the messages of the forums. In this paper we present a framework in order to automatically label messages from discussion forums using the categories of Bloom’s taxonomy. Several models were created using three kind of machine learning approaches: linear, Rule-Based and combined classifiers. The models are evaluated using the accuracy, the F1-measure and the area under the ROC curve. Additionally, a statistical significance of the results is performed using a McNemar test in order to validate them. The results show that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.

IEEE Transactions on Information Forensics and Security | 2010

Identifying and Resolving Hidden Text Salting

Marie-Francine Moens; Jan De Beer; Erik Boiy; Juan Carlos Gomez

Hidden salting in digital media involves the intentional addition or distortion of content patterns with the purpose of content filtering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of “tapping” into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identified, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam filtering task is assessed. We provide a solution to a relevant open problem in content filtering applications, namely the presence of tricks aimed at circumventing automatic filters.

Professional Search in the Modern World | 2014

A Survey of Automated Hierarchical Classification of Patents

Juan Carlos Gomez; Marie-Francine Moens

In this era of “big data”, hundreds or even thousands of patent applications arrive every day to patent offices around the world. One of the first tasks of the professional analysts in patent offices is to assign classification codes to those patents based on their content. Such classification codes are usually organized in hierarchical structures of concepts. Traditionally the classification task has been done manually by professional experts. However, given the large amount of documents, the patent professionals are becoming overwhelmed. If we add that the hierarchical structures of classification are very complex (containing thousands of categories), reliable, fast and scalable methods and algorithms are needed to help the experts in patent classification tasks. This chapter describes, analyzes and reviews systems that, based on the textual content of patents, automatically classify such patents into a hierarchy of categories. This chapter focuses specially in the patent classification task applied for the International Patent Classification (IPC) hierarchy. The IPC is the most used classification structure to organize patents, it is world-wide recognized, and several other structures use or are based on it to ensure office inter-operability.

Explore More