Eric Sadit Tellez
Universidad Michoacana de San Nicolás de Hidalgo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Eric Sadit Tellez.
Information Systems | 2015
Edgar Chávez; Mario Graff; Gonzalo Navarro; Eric Sadit Tellez
Proximity searching is the problem of retrieving, from a given database, those objects closest to a query. To avoid exhaustive searching, data structures called indexes are built on the database prior to serving queries. The curse of dimensionality is a well-known problem for indexes: in spaces with sufficiently concentrated distance histograms, no index outperforms an exhaustive scan of the database.In recent years, a number of indexes for approximate proximity searching have been proposed. These are able to cope with the curse of dimensionality in exchange for returning an answer that might be slightly different from the correct one.In this paper we show that many of those recent indexes can be understood as variants of a simple general model based on K-nearest reference signatures. A set of references is chosen from the database, and the signature of each object consists of the K references nearest to the object. At query time, the signature of the query is computed and the search examines only the objects whose signature is close enough to that of the query.Many known and novel indexes are obtained by considering different ways to determine how much detail the signature records (e.g., just the set of nearest references, or also their proximity order to the object, or also their distances to the object, and so on), how the similarity between signatures is defined, and how the parameters are tuned. In addition, we introduce a space-efficient representation for those families of indexes, making it possible to search very large databases in main memory. Small indexes are cache friendly, inducing faster queries.We perform exhaustive experiments comparing several known and new indexes that derive from our framework, evaluating their time performance, memory usage, and quality of approximation. The best indexes outperform the state of the art, offering an attractive balance between all these aspects, and turn out to be excellent choices in many scenarios. Our framework gives high flexibility to design new indexes. HighlightsA general framework to understand and analyze many recent indexes.A space efficient representation based on succinct data structures.An exhaustive experimentation comparing several known and new indexes derived from the framework.The possibility to design new indexes, which can be implemented in a similar way using our framework.
similarity search and applications | 2013
Guillermo Ruiz; Edgar Chávez; Karina Figueroa; Eric Sadit Tellez
Pivot tables are popular for exact metric indexing. It is well known that a large pivot table produces faster indexes. The rule of thumb is to use as many pivots as the available memory allows for a given application. To further speedup searches, redundant pivots can be eliminated or the scope of the pivots the number of database objects covered by a pivot can be reduced. In this paper, we apply a different technique to speedup searches. We assign objects to pivots while, at the same time, enforcing proper coverage of the database objects. This increases the discarding power of pivots and in turn leads to faster searches. The central idea is to select a set of essential pivots without redundancy covering the entire database. We call our technique extreme pivoting EP. A nice additional property of EP is that it balances performance and memory usage. For example; using the same amount of memory, EP is faster than the List of Clusters and the Spatial Approximation Tree. Moreover, EP is faster than LAESA even when it uses less memory. The EP technique was formally modeled allowing performance prediction without an actual implementation. Performance and memory usage depend on two parameters of EP, which are characterized with a wide range of experiments. Also, we provide automatic selection of one parameter fixing the other. The formal model was empirically tested with real world and synthetic datasets finding high consistency between the predicted and the actual performance.
similarity search and applications | 2010
Eric Sadit Tellez; Edgar Chávez
Modeling proximity search problems as a metric space provides a general framework usable in many areas, like pattern recognition, web search, clustering, data mining, knowledge management, textual and multimedia information retrieval, to name a few. Metric indexes have been improved over the years and many instances of the problem can be solved efficiently. However, when very large/high dimensional metric databases are indexed exact approaches are not yet capable of solving efficiently the problem, the performance in these circumstances is degraded to almost sequential search. To overcome the above limitation, non-exact proximity searching algorithms can be used to give answers that either in probability or in an approximation factor are close to the exact result. Approximation is acceptable in many contexts, specially when human judgement about closeness is involved. In vector spaces, on the other hand, there is a very successful approach dubbed Locality Sensitive Hashing which consist in making a succinct representation of the objects. This succinct representation is relatively insensitive to small variations of the locality. Unfortunately, the hashing function have to be carefully designed, very close to the data model, and different functions are used when objects come from different domains. In this paper we give a new schema to encode objects in a general metric space with a uniform framework, independent from the data model. Finally, we provide experimental support to our claims using several real life databases with different data models and distance functions obtaining excellent results in both the speed and the recall sense, specially for large databases.
iberoamerican congress on pattern recognition | 2009
Eric Sadit Tellez; Edgar Chávez; Antonio Camarena-Ibarrola
Many pattern recognition tasks can be modeled as proximity searching. Here the common task is to quickly find all the elements close to a given query without sequentially scanning a very large database. A recent shift in the searching paradigm has been established by using permutations instead of distances to predict proximity. Every object in the database record how the set of reference objects (the permutants) is seen , i.e. only the relative positions are used. When a query arrives the relative displacements in the permutants between the query and a particular object is measured. This approach turned out to be the most efficient and scalable, at the expense of loosing recall in the answers. The permutation of every object is represented with *** short integers in practice, producing bulky indexes of 16 ***n bits. In this paper we show how to represent the permutation as a binary vector, using just one bit for each permutant (instead of log*** in the plain representation). The Hamming distance in the binary signature is used then to predict proximity between objects in the database. We tested this approach with many real life metric databases obtaining faster queries with a recall close to the Spearman ρ using 16 times less space.
iberoamerican congress on pattern recognition | 2009
Antonio Camarena-Ibarrola; Edgar Chávez; Eric Sadit Tellez
Monitoring media broadcast content has deserved a lot of attention lately from both academy and industry due to the technical challenge involved and its economic importance (e.g. in advertising). The problem pose a unique challenge from the pattern recognition point of view because a very high recognition rate is needed under non ideal conditions. The problem consist in comparing a small audio sequence (the commercial ad) with a large audio stream (the broadcast) searching for matches. In this paper we present a solution with the Multi-Band Spectral Entropy Signature (MBSES) which is very robust to degradations commonly found on amplitude modulated (AM) radio. Using the MBSES we obtained perfect recall (all audio ads occurrences were accurately found with no false positives) in 95 hours of audio from five different am radio broadcasts. Our system is able to scan one hour of audio in 40 seconds if the audio is already fingerprinted (e.g. with a separated slave computer), and it totaled five minutes per hour including the fingerprint extraction using a single core off the shelf desktop computer with no parallelization.
similarity search and applications | 2011
Eric Sadit Tellez; Edgar Chávez; Gonzalo Navarro
In this paper we present a novel technique for nearest neighbor searching dubbed neighborhood approximation. The central idea is to divide the database into compact regions represented by a single object, called the reference. To search for nearest neighbors a set of candidate references is first obtained and later enriched with the database objects associated to those references. This approach can be implemented with an inverted index, which in turn can be represented in a succinct way, spending just a few bits per object. As a consequence it is possible to store the index in main memory, even for relatively large databases. The speed/compression/recall tradeoff achieved is excellent. To obtain 92% recall in 30-nearest neighbors searches the index reviews less than 0.6% of the database, in time ranging from 0.35 to 2.67 seconds using from 93 to 24 Mbytes for a ten million objects database. The tradeoff comes from using different compression techniques. The uncompressed index requires 0.17 seconds and 267 Mbytes of space. A quality measure complementary to the recall is the ratio between the covering radius of the actual nearest neighbors and the near neighbors reported by the algorithm. Using this measure our results are within a small constant compared to the exact results.
mexican conference on pattern recognition | 2011
Eric Sadit Tellez; Edgar Chávez; Mario Graff
Efficiently searching for patterns in very large collections of objects is a very active area of research. Over the last few years a number of indexes have been proposed to speed up the searching procedure. In this paper, we introduce a novel framework (the K-nearest references) in which several approximate proximity indexes can be analyzed and understood. The search spaces where the analyzed indexes work span from vector spaces, general metric spaces up to general similarity spaces. The proposed framework clarify the principles behind the searching complexity and allows us to propose a number of novel indexes with high recall rate, low search time, and a linear storage requirement as salient characteristics.
mexican conference on pattern recognition | 2010
Edgar Chávez; Eric Sadit Tellez
Nearest neighbor queries can be satisfied, in principle, with a greedy algorithm under a proximity graph. Each object in the database is represented by a node, and proximal nodes in this graph will share an edge. To find the nearest neighbor the idea is quite simple, we start in a random node and get iteratively closer to the nearest neighbor following only adjacent edges in the proximity graph. Every reachable node from current vertex is reviewed, and only the closer-to-the-query node is expanded in the next round. The algorithm stops when none of the neighbors of the current node is closer to the query. The number of revised objects will be proportional to the diameter of the graph times the average degree of the nodes. Unfortunately the degree of a proximity graph is unbounded for a general metric space [1], and hence the number of inspected objects can be linear on the size of the database, which is the same as no indexing at all. In this paper we introduce a quasi-proximity graph induced by the all-k-nearest neighbor graph. The degree of the above graph is bounded but we will face local minima when running the above greedy algorithm, which boils down to have false positives in the queries. We show experimental results for high dimensional spaces. We report a recall greater than 90% for most configurations, which is very good for many proximity searching applications, reviewing just a tiny portion of the database. The space requirement for the index is linear on the database size, and the construction time is quadratic in worst case. Relaxations of our method are sketched to obtain practical subquadratic implementations.
latin american web congress | 2006
Eric Sadit Tellez; Edgar Chávez; Juan Contreras-Castillo
Remote object management is a key element in distributed and collaborative information retrieval, peer-to-peer systems and agent oriented programming. In existing implementations the communication and parsing overhead represents a significant fraction of the overall latency time in information retrieval tasks. Furthermore, existing architectures are composed of several software layers with potential version conflicts. In this paper, we present SPyRO (simple Python remote objects) which is a Python remote object management system developed to provide transparent and translucent remote object access. The transparent mode is designed to create easily distributed applications supporting code mobility (Fuggetta et al., 1998) in Python programming language, whilst the translucent mode is designed to provide total control over remote calls, and allow access from other programming languages. To lower the communication latency, the connection is stateless, local objects and remote calls are not aware of the connection state. The protocol uses several marshal formats to communicate between peers, trying to maximize the homogeneity in a heterogeneous network. To support our claims we present results showing performance improvements of about 10 times when comparing with state of the art marshalling formats based on XML
Expert Systems With Applications | 2017
Eric Sadit Tellez; Sabino Miranda-Jimnez; Mario Graff; Daniela Moctezuma; Oscar S. Siordia; Elio A. Villaseor
A review of popular techniques to model short texts written in an informal style.An analysis of configurations that produce the top-k sentiment classifiers.The analysis is oriented to the performance in both accuracy and computing time.A simple method to create fast and accurate sentiment analysis systems. Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text because of the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads.The aim of this research is to identify in a large set of combinations which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., word n-grams), and token-weighting schemes make the most impact on the accuracy of a classifier (Support Vector Machine) trained on two Spanish datasets. The methodology used is to exhaustively analyze all combinations of text transformations and their respective parameters to find out what common characteristics the best performing classifiers have. Furthermore, we introduce a novel approach based on the combination of word-based n-grams and character-based q-grams. The results show that this novel combination of words and characters produces a classifier that outperforms the traditional word-based combination by 11.17% and 5.62% on the INEGI and TASS15 dataset, respectively.