Mohamed A. Soliman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohamed A. Soliman is active.

Explore More

Publication

Featured researches published by Mohamed A. Soliman.

ACM Computing Surveys | 2008

A survey of top- k query processing techniques in relational database systems

Ihab F. Ilyas; George Beskales; Mohamed A. Soliman

Efficient processing of top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. In particular, efficient top-k processing in domains such as the Web, multimedia search, and distributed systems has shown a great impact on performance. In this survey, we describe and classify top-k processing techniques in relational databases. We discuss different design dimensions in the current techniques including query models, data access methods, implementation levels, data and query certainty, and supported scoring functions. We show the implications of each dimension on the design of the underlying techniques. We also discuss top-k queries in XML domain, and show their connections to relational approaches.

very large data bases | 2008

Efficient search for the top-k probable nearest neighbors in uncertain databases

George Beskales; Mohamed A. Soliman; Ihab F. Ilyas

Uncertainty pervades many domains in our lives. Current real-life applications, e.g., location tracking using GPS devices or cell phones, multimedia feature extraction, and sensor data management, deal with different kinds of uncertainty. Finding the nearest neighbor objects to a given query point is an important query type in these applications. In this paper, we study the problem of finding objects with the highest marginal probability of being the nearest neighbors to a query object. We adopt a general uncertainty model allowing for data and query uncertainty. Under this model, we define new query semantics, and provide several efficient evaluation algorithms. We analyze the cost factors involved in query evaluation, and present novel techniques to address the trade-offs among these factors. We give multiple extensions to our techniques including handling dependencies among data objects, and answering threshold queries. We conduct an extensive experimental study to evaluate our techniques on both real and synthetic data.

very large data bases | 2011

Automatic wrappers for large scale web extraction

Nilesh N. Dalvi; Ravi Kumar; Mohamed A. Soliman

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.

ACM Transactions on Database Systems | 2008

Probabilistic top- k and ranking-aggregate queries

Mohamed A. Soliman; Ihab F. Ilyas; Kevin Chen Chuan Chang

Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings. In this article we introduce new probabilistic formulations for top-k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The framework encapsulates a state space model and efficient search algorithms to compute query answers. Our proposed techniques minimize the number of accessed tuples and the size of materialized search space to compute query answers. Our experimental study shows the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve methods.

very large data bases | 2010

Supporting ranking queries on uncertain and incomplete data

Mohamed A. Soliman; Ihab F. Ilyas; Shalev Ben-David

Large databases with uncertain information are becoming more common in many applications including data integration, location tracking, and Web search. In these applications, ranking records with uncertain attributes introduces new problems that are fundamentally different from conventional ranking. Specifically, uncertainty in records’ scores induces a partial order over records, as opposed to the total order that is assumed in the conventional ranking settings. In this paper, we present a new probabilistic model, based on partial orders, to encapsulate the space of possible rankings originating from score uncertainty. Under this model, we formulate several ranking query types with different semantics. We describe and analyze a set of efficient query evaluation algorithms. We show that our techniques can be used to solve the problem of rank aggregation in partial orders under two widely adopted distance metrics. In addition, we design sampling techniques based on Markov chains to compute approximate query answers. Our experimental evaluation uses both real and synthetic data. The experimental study demonstrates the efficiency and effectiveness of our techniques under various configurations.

international conference on management of data | 2011

Ranking with uncertain scoring functions: semantics and sensitivity measures

Mohamed A. Soliman; Ihab F. Ilyas; Davide Martinenghi; Marco Tagliasacchi

Ranking queries report the top-K results according to a user-defined scoring function. A widely used scoring function is the weighted summation of multiple scores. Often times, users cannot precisely specify the weights in such functions in order to produce the preferred order of results. Adopting uncertain/incomplete scoring functions (e.g., using weight ranges and partially-specified weight preferences) can better capture users preferences in this scenario. In this paper, we study two aspects in uncertain scoring functions. The first aspect is the semantics of ranking queries, and the second aspect is the sensitivity of computed results to refinements made by the user. We formalize and solve multiple problems under both aspects, and present novel techniques that compute query results efficiently to comply with the interactive nature of these problems.

very large data bases | 2009

Modeling and querying possible repairs in duplicate detection

George Beskales; Mohamed A. Soliman; Ihab F. Ilyas; Shai Ben-David

One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations.

international conference on management of data | 2014

Orca: a modular query optimizer architecture for big data

Mohamed A. Soliman; Lyublena Antova; Venkatesh Raghavan; Amr El-Helw; Zhongxian Gu; Entong Shen; George Constantin Caragea; Carlos Garcia-Alvarado; Foyzur Rahman; Michalis Petropoulos; Florian Waas; Sivaramakrishnan Narayanan; Konstantinos Krikellas; Rhonda Baldwin

The performance of analytical query processing in data management systems depends primarily on the capabilities of the systems query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer. In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture. In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.

international conference on management of data | 2007

URank: formulation and efficient evaluation of top-k queries in uncertain databases

Mohamed A. Soliman; Ihab F. Ilyas; Kevin Chen Chuan Chang

Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between query scores and data uncertainty makes traditional techniques inapplicable. We introduce URank, a system that processes new probabilistic formulations of top-k queries inuncertain databases. The new formulations are based on marriage of traditional top-k semantics with possible worlds semantics. URank encapsulates a new processing framework that leverages existing query processing capabilities, and implements efficient search strategies that integrate ranking on scores with ranking on probabilities, to obtain meaningful answers for top-k queries.

Synthesis Lectures on Data Management | 2011

Probabilistic Ranking Techniques in Relational Databases

Ihab F. Ilyas; Mohamed A. Soliman

Ranking queries are widely used in data exploration, data analysis and decision making scenarios. While most of the currently proposed ranking techniques focus on deterministic data, several emerging applications involve data that are imprecise or uncertain. Ranking uncertain data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, the interplay between ranking and uncertainty models introduces new dimensions for ordering query results that do not exist in the traditional settings. This lecture describes new formulations and processing techniques for ranking queries on uncertain data. The formulations are based on marriage of traditional ranking semantics with possible worlds semantics under widely-adopted uncertainty models. In particular, we focus on discussing the impact of tuple-level and attribute-level uncertainty on the semantics and processing techniques of ranking queries. Under the tuple-level uncertainty model, we describe new processing techniques leveraging the capabilities of relational database systems to recognize and handle data uncertainty in score-based ranking. Under the attribute-level uncertainty model, we describe new probabilistic ranking models and a set of query evaluation algorithms, including sampling-based techniques. We also discuss supporting rank join queries on uncertain data, and we show how to extend current rank join methods to handle uncertainty in scoring attributes. Table of Contents: Introduction / Uncertainty Models / Query Semantics / Methodologies / Uncertain Rank Join / Conclusion

Explore More