Is this you? Create Your Porfile

Florin Rusu

University of California, Merced

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Florin Rusu is active.

Explore More

Publication

Featured researches published by Florin Rusu.

Science | 2017

Illuminating gravitational waves: A concordant picture of photons from a neutron star merger

Mansi M. Kasliwal; Ehud Nakar; L. P. Singer; David L. Kaplan; David O. Cook; A. Van Sistine; Ryan M. Lau; C. Fremling; O. Gottlieb; Jacob E. Jencson; S. M. Adams; U. Feindt; Kenta Hotokezaka; S. Ghosh; Daniel A. Perley; Po-Chieh Yu; Tsvi Piran; J. R. Allison; G. C. Anupama; A. Balasubramanian; Keith W. Bannister; John Bally; J. Barnes; Sudhanshu Barway; Eric C. Bellm; V. Bhalerao; D. Bhattacharya; N. Blagorodnova; J. S. Bloom; P. R. Brady

GROWTH observations of GW170817 The gravitational wave event GW170817 was caused by the merger of two neutron stars (see the Introduction by Smith). In three papers, teams associated with the GROWTH (Global Relay of Observatories Watching Transients Happen) project present their observations of the event at wavelengths from x-rays to radio waves. Evans et al. used space telescopes to detect GW170817 in the ultraviolet and place limits on its x-ray flux, showing that the merger generated a hot explosion known as a blue kilonova. Hallinan et al. describe radio emissions generated as the explosion slammed into the surrounding gas within the host galaxy. Kasliwal et al. present additional observations in the optical and infrared and formulate a model for the event involving a cocoon of material expanding at close to the speed of light, matching the data at all observed wavelengths. Science, this issue p. 1565, p. 1579, p. 1559; see also p. 1554 Observations of a binary neutron star merger at multiple wavelengths can be explained by an off-axis relativistic cocoon model. Merging neutron stars offer an excellent laboratory for simultaneously studying strong-field gravity and matter in extreme environments. We establish the physical association of an electromagnetic counterpart (EM170817) with gravitational waves (GW170817) detected from merging neutron stars. By synthesizing a panchromatic data set, we demonstrate that merging neutron stars are a long-sought production site forging heavy elements by r-process nucleosynthesis. The weak gamma rays seen in EM170817 are dissimilar to classical short gamma-ray bursts with ultrarelativistic jets. Instead, we suggest that breakout of a wide-angle, mildly relativistic cocoon engulfing the jet explains the low-luminosity gamma rays, the high-luminosity ultraviolet-optical-infrared, and the delayed radio and x-ray emission. We posit that all neutron star mergers may lead to a wide-angle cocoon breakout, sometimes accompanied by a successful jet and sometimes by a choked jet.

international conference on management of data | 2012

GLADE: big data analytics made easy

Yu Cheng; Chengjie Qin; Florin Rusu

We present GLADE, a scalable distributed system for large scale data analytics. GLADE takes analytical functions expressed through the User-Defined Aggregate (UDA) interface and executes them efficiently on the input data. The entire computation is encapsulated in a single class which requires the definition of four methods. The runtime takes the user code and executes it right near the data by taking full advantage of the parallelism available inside a single machine as well as across a cluster of computing nodes. The demonstration has two goals. First, it presents the architecture of GLADE and how processing is done by using a series of analytical functions. Second, it compares GLADE with two different classes of systems for data analytics: a relational database (PostgreSQL) enhanced with UDAs and Map-Reduce (Hadoop). We show how the analytical functions are coded into each of these systems (for Map-Reduce, we use both Java code as well as Pig Latin) and compare their expressiveness, scalability, and running time efficiency.

international conference on management of data | 2007

Statistical analysis of sketch estimators

Florin Rusu; Alin Dobra

Sketching techniques can provide approximate answers to aggregate queries either for data-streaming or distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this paper, we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. This makes Fast-AGMS sketches the preferred choice irrespective of the situation.

ACM Transactions on Database Systems | 2008

Sketches for size of join estimation

Florin Rusu; Alin Dobra

Sketching techniques provide approximate answers to aggregate queries both for data-streaming and distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution-independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this article we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. We apply the insights obtained from the statistical study and the experimental results to design effective algorithms for sketching interval data. We show how the two basic methods for sketching interval data, DMAP and fast range-summation, can be improved significantly with respect to the update time without a significant loss in accuracy. The gain in update time can be as large as two orders of magnitude, thus making the improved methods practical. The empirical study suggests that DMAP is preferable when update time is the critical requirement and fast range-summation is desirable for better accuracy.

Distributed and Parallel Databases | 2014

PF-OLA: a high-performance framework for parallel online aggregation

Chengjie Qin; Florin Rusu

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets.In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8 TB TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.

Operating Systems Review | 2012

GLADE: a scalable framework for efficient analytics

Florin Rusu; Alin Dobra

In this paper we introduce GLADE, a scalable distributed framework for large scale data analytics. GLADE consists of a simple user-interface to define Generalized Linear Aggregates (GLA), the fundamental abstraction at the core of GLADE, and a distributed runtime environment that executes GLAs by using parallelism extensively. GLAs are derived from User-Defined Aggregates (UDA), a relational database extension that allows the user to add specialized aggregates to be executed inside the query processor. GLAs extend the UDA interface with methods to Serialize/Deserialize the state of the aggregate required for distributed computation. As a significant departure from UDAs which can be invoked only through SQL, GLAs give the user direct access to the state of the aggregate, thus allowing for the computation of significantly more complex aggregate functions. GLADE runtime is an execution engine optimized for the GLA computation. The runtime takes the user-defined GLA code, compiles it inside the engine, and executes it right near the data by taking advantage of parallelism both inside a single machine as well as across a cluster of computers. This results in maximum possible execution time performance (all our experimental tasks are I/O-bound) and linear scaleup.

international conference on management of data | 2014

Parallel in-situ data processing with speculative loading

Yu Cheng; Florin Rusu

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file. In this paper, we propose SCANRAW, a novel database physical operator for in-situ processing over raw files that integrates data loading and external tables seamlessly while preserving their advantages: optimal performance across a query workload and zero time-to-query. Our major contribution is a parallel super-scalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multi-core processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database such that subsequent queries execute faster. As a result, SCANRAW makes optimal use of the available system resources -- CPU cycles and I/O bandwidth -- by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves optimal performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

Proceedings of the Second Workshop on Data Analytics in the Cloud | 2013

Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE

Chengjie Qin; Florin Rusu

Incremental gradient descent is a general technique to solve a large class of convex optimization problems arising in many machine learning tasks. GLADE is a parallel infrastructure for big data analytics providing a generic task specification interface. In this paper, we present a scalable and efficient parallel solution for incremental gradient descent in GLADE. We provide empirical evidence that our solution is limited only by the physical hardware characteristics, uses effectively the available resources, and achieves maximum scalability. When deployed in the cloud, our solution has the potential to dramatically reduce the cost of complex analytics over massive datasets.

ACM Transactions on Database Systems | 2007

Pseudo-random number generation for sketch-based estimations

Florin Rusu; Alin Dobra

The exact computation of aggregate queries, like the size of join of two relations, usually requires large amounts of memory (constrained in data-streaming) or communication (constrained in distributed computation) and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are one possible solution. The performance of sketches depends crucially on the ability to generate particular pseudo-random numbers. In this article we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3- and 4-wise independent pseudo-random numbers that are fast range-summable (i.e., they can be summed in sublinear time). Our specific contributions are: (a) we provide a thorough comparison of the various pseudo-random number generating schemes; (b) we study both theoretically and empirically the fast range-summation property of 3- and 4-wise independent generating schemes; (c) we provide algorithms for the fast range-summation of two 3-wise independent schemes, BCH and extended Hamming; and (d) we show convincing theoretical and empirical evidence that the extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join of two relations using AMS sketches, even though it is only 3-wise independent. We use this scheme to generate estimators that significantly outperform state-of-the-art solutions for two problems, namely, size of spatial joins and selectivity estimation.

international conference on data engineering | 2009

Sketching Sampled Data Streams

Florin Rusu; Alin Dobra

Sampling is used as a universal method to reduce the running time of computations -- the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream.In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling -- Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines -- and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only

Explore More