Rares Vernica | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rares Vernica is active.

Explore More

Publication

Featured researches published by Rares Vernica.

international conference on management of data | 2010

Efficient parallel set-similarity joins using MapReduce

Rares Vernica; Michael J. Carey; Chen Li

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

Distributed and Parallel Databases | 2011

ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Alexander Behm; Vinayak R. Borkar; Michael J. Carey; Raman Grover; Chen Li; Nicola Onose; Rares Vernica; Alin Deutsch; Yannis Papakonstantinou; Vassilis J. Tsotras

ASTERIX is a new data-intensive storage and computing platform project spanning UC Irvine, UC Riverside, and UC San Diego. In this paper we provide an overview of the ASTERIX project, starting with its main goal—the storage and analysis of data pertaining to evolving-world models. We describe the requirements and associated challenges, and explain how the project is addressing them. We provide a technical overview of ASTERIX, covering its architecture, its user model for data and queries, and its approach to scalable query processing and data management. ASTERIX utilizes a new scalable runtime computational platform called Hyracks that is also discussed at an overview level; we have recently made Hyracks available in open source for use by other interested parties. We also relate our work on ASTERIX to the current state of the art and describe the research challenges that we are currently tackling as well as those that lie ahead.

international conference on management of data | 2009

Efficient top-k algorithms for fuzzy search in string collections

Rares Vernica; Chen Li

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

knowledge discovery and data mining | 2008

Entity categorization over large document collections

Venkatesh Ganti; Arnd Christian König; Rares Vernica

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entitys context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.

Journal of Chemical Information and Modeling | 2012

Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi

In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one often seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both threshold searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from http://cdb.ics.uci.edu/ .

very large data bases | 2012

On the optimization of schedules for MapReduce workloads in the presence of shared scans

Joel L. Wolf; Andrey Balmin; Deepak Rajan; Kirsten Hildrum; Rohit Khandekar; Sujay Parekh; Kun-Lung Wu; Rares Vernica

We consider MapReduce clusters designed to support multiple concurrent jobs, concentrating on environments in which the number of distinct datasets is modest relative to the number of jobs. In such scenarios, many individual datasets are likely to be scanned concurrently by multiple Map phase jobs. As has been noticed previously, this scenario provides an opportunity for Map phase jobs to cooperate, sharing the scans of these datasets, and thus reducing the costs of such scans. Our paper has three main contributions over previous work. First, we present a novel and highly general method for sharing scans and thus amortizing their costs. This concept, which we call cyclic piggybacking, has a number of advantages over the more traditional batching scheme described in the literature. Second, we notice that the various subjobs generated in this manner can be assumed in an optimal schedule to respect a natural chain precedence ordering. Third, we describe a significant but natural generalization of the recently introduced FLEX scheduler for optimizing schedules within the context of this cyclic piggybacking paradigm, which can be tailored to a variety of cost metrics. Such cost metrics include average response time, average stretch, and any minimax-type metric—a total of 11 separate and standard metrics in all. Moreover, most of this carries over in the more general case of overlapping rather than identical datasets as well, employing what we will call semi-shared scans. In such scenarios, chain precedence is replaced by arbitrary precedence, but we can still handle 8 of the original 11 metrics. The overall approach, including both cyclic piggybacking and the FLEX scheduling generalization, is called CIRCUMFLEX. We describe some practical implementation strategies. And we evaluate the performance of CIRCUMFLEX via a variety of simulation and real benchmark experiments.

very large data bases | 2008

SEPIA: estimating selectivities of approximate string predicates in large Databases

Liang Jin; Chen Li; Rares Vernica

Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates.

Operating Systems Review | 2012

CIRCUMFLEX: a scheduling optimizer for MapReduce workloads with shared scans

Joel L. Wolf; Andrey Balmin; Deepak Rajan; Kirsten Hildrum; Rohit Khandekar; Sujay Parekh; Kun-Lung Wu; Rares Vernica

engineering interactive computing system | 2015

To print or not to print: hybrid learning with METIS learning platform

Joshua M. Hailpern; Rares Vernica; Molly Bullock; Udi Chatow; Jian Fan; Georgia Koutrika; Jerry Liu; Lei Liu; Steven J. Simske; Shanchan Wu

As part of the explosion in educational software, online tools, and open educational resources there has been a rapid devaluation of printed textbooks. While digital texts have advantages, printed textbooks still provide irreplaceable value over online media. Therefore technology should enhance, rather than eliminate printed text. To this end, this paper presents METIS, a hybrid learning software/service platform that is designed to support active reading. METIS provides easy digital-to-print-to-digital usage, simple creation of Cheat Sheets & FlexNotes for personal note taking and organization, and a custom flexible rendering & publishing engine for education called Aero. METIS was designed based on lessons learned from a formative study of 523 students at SJSU, and validated through focus groups involving 32 educators and students at both high school and college levels.

document engineering | 2015

AERO: An Extensible Framework for Adaptive Web Layout Synthesis

Rares Vernica; Niranjan Damera Venkata

We present AERO, an extensible framework for adaptive web layout synthesis. The goal is to provide an underlying software architecture to allow general adaptive layout behaviors. The framework consists of a 1) a suite of templates specified in HTML/CSS, 2) A hierarchical, highly customizable scoring function specification and 3) An evaluation engine that leverages native browser rendering to rapidly render content and apply the scoring functions. Unlike current responsive layout frameworks for web (e.g., Twitter Bootstrap) that have pre-configured grid layouts that adapt in a manually pre-encoded content-independent manner, AERO allows layout to adapt automatically based on multiple content-dependent criteria like aesthetic quality, cropability of individual images, layout A/B testing results, Ad placement etc.

Explore More