Is this you? Create Your Porfile

Benjamin Arai

University of California, Riverside

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Arai is active.

Explore More

Publication

Featured researches published by Benjamin Arai.

knowledge discovery and data mining | 2009

On burstiness-aware search for document sequences

Theodoros Lappas; Benjamin Arai; Manolis Platakis; Dimitrios Kotsakos; Dimitrios Gunopulos

As the number and size of large timestamped collections (e.g. sequences of digitized newspapers, periodicals, blogs) increase, the problem of efficiently indexing and searching such data becomes more important. Term burstiness has been extensively researched as a mechanism to address event detection in the context of such collections. In this paper, we explore how burstiness information can be further utilized to enhance the search process. We present a novel approach to model the burstiness of a term, using discrepancy theory concepts. This allows us to build a parameter-free, linear-time approach to identify the time intervals of maximum burstiness for a given term. Finally, we describe the first burstiness-driven search framework and thoroughly evaluate our approach in the context of different scenarios.

international conference on data engineering | 2006

Approximating Aggregation Queries in Peer-to-Peer Networks

Benjamin Arai; Gautam Das; Dimitrios Gunopulos; Vana Kalogeraki

Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large scale, ad-hoc analysis queries ― e.g., aggregation queries ― on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement given the distributed and dynamic nature of peer-to-peer databases. In this paper we present novel sampling-based techniques for approximate answering of ad-hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors ― the data is distributed (usually in uneven quantities) across many peers, within each peer the data is often highly correlated, and moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach, based on random walks of the P2P graph as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solutio

IEEE Transactions on Knowledge and Data Engineering | 2007

Efficient Approximate Query Processing in Peer-to-Peer Networks

Benjamin Arai; Gautam Das; Dimitrios Gunopulos; Vana Kalogeraki

Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution.

international conference on data engineering | 2008

Region Sampling: Continuous Adaptive Sampling on Sensor Networks

Song Lin; Benjamin Arai; Dimitrios Gunopulos; Gautam Das

Satisfying energy constraints while meeting performance requirements is a primary concern when a sensor network is being deployed. Many recent proposed techniques offer error bounding solutions for aggregate approximation but cannot guarantee energy spending. Inversely, our goal is to bound the energy consumption while minimizing the approximation error. In this paper, we propose an online algorithm, region sampling, for computing approximate aggregates while satisfying a pre-defined energy budget. Our algorithm is distinguished by segmenting a sensor network into partitions of non-overlapping regions and performing sampling and local aggregation for each region. The sampling energy cost rate and sampling statistics are collected and analyzed to predict the optimal sampling plan. Comprehensive experiments on real-world data sets indicate that our approach is at a minimum of 10% more accurate compared with the previously proposed solutions.

very large data bases | 2009

Anytime measures for top-k algorithms on exact and fuzzy data sets

Benjamin Arai; Gautam Das; Dimitrios Gunopulos; Nick Koudas

Top-k queries on large multi-attribute data sets are fundamental operations in information retrieval and ranking applications. In this article, we initiate research on the anytime behavior of top-k algorithms on exact and fuzzy data. In particular, given specific top-k algorithms (TA and TA-Sorted) we are interested in studying their progress toward identification of the correct result at any point during the algorithms’ execution. We adopt a probabilistic approach where we seek to report at any point of operation of the algorithm the confidence that the top-k result has been identified. Such a functionality can be a valuable asset when one is interested in reducing the runtime cost of top-k computations. We present a thorough experimental evaluation to validate our techniques using both synthetic and real data sets.

international conference on data mining | 2007

Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks

Benjamin Arai; Song Lin; Dimitrios Gunopulos

Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.

statistical and scientific database management | 2007

Reliable Hierarchical Data Storage in Sensor Networks

Song Lin; Benjamin Arai; Dimitrios Gunopulos

The ability to provide reliable in-network storage while balancing the energy consumption of individual sensors is a primary concern when deploying a sensor network. The main concern with data-centric storage in sensor networks is the ability to provide reliable and load balanced storage. Energy and wireless range constraints make centralized approaches for storage impractical, and in-network data-centric solutions can be used to reduce the number of messages sent over the network. However, these solutions quickly become expensive when combined with fault- tolerance, load balancing and routing. In this paper, we present a novel data-centric storage and query routing mechanism for sensor networks. The routing mechanism is constructed upon the neighborhood information of individual sensors and is completely independent of geographical information. Our data resilient algorithm is capable of recovering from multiple simultaneous failures in the network while adaptively adjusting the load distribution of the newly generated sensor data. Comprehensive experiments on both real-world and synthetic data sets indicate that our approach is more effective and efficient than the previously proposed solutions.

very large data bases | 2010

An access cost-aware approach for object retrieval over multiple sources

Benjamin Arai; Gautam Das; Dimitrios Gunopulos; Vagelis Hristidis; Nick Koudas

Source and object selection and retrieval from large multi-source data sets are fundamental operations in many applications. In this paper, we initiate research on efficient source (e.g., database) and object selection algorithms on large multi-source data sets. Specifically, in order to acquire a specified number of satisfying objects with minimum cost over multiple databases, the query engine needs to determine the access overhead for individual data sources, the overhead of retrieving objects from each source, and possibly other statistics such as estimating the frequency of finding a satisfying object in order to determine how many objects to retrieve from each data source. We adopt a probabilistic approach to source selection utilizing a cost structure and a dynamic programming model for computing the optimal number of objects to retrieve from each data source. Such a structure can be a valuable asset where there is a monetary or time related cost associated with accessing large distributed databases. We present a thorough experimental evaluation to validate our techniques using real-world data sets.

very large data bases | 2007