Mohamed Yakout
Purdue University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mohamed Yakout.
very large data bases | 2011
Mohamed Yakout; Ahmed K. Elmagarmid; Jennifer Neville; Mourad Ouzzani; Ihab F. Ilyas
In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates directly to the database without the actual involvement of the user on these specific updates. To rank potential updates for consultation by the user, we first group these repairs and quantify the utility of each group using the decision-theory concept of value of information (VOI). We then apply active learning to order updates within a group based on their ability to improve the learned model. User feedback is used to repair the database and to adaptively refine the training set for the model. We empirically evaluate GDR on a real-world dataset and show significant improvement in data quality using our user guided repairing process. We also, assess the trade-off between the user efforts and the resulting data quality.
IEEE Transactions on Visualization and Computer Graphics | 2010
Ross Maciejewski; Stephen Rudolph; Ryan P. Hafen; Ahmad M. Abusalah; Mohamed Yakout; Mourad Ouzzani; William S. Cleveland; Shaun J. Grannis; David S. Ebert
As data sources become larger and more complex, the ability to effectively explore and analyze patterns among varying sources becomes a critical bottleneck in analytic reasoning. Incoming data contain multiple variables, high signal-to-noise ratio, and a degree of uncertainty, all of which hinder exploration, hypothesis generation/exploration, and decision making. To facilitate the exploration of such data, advanced tool sets are needed that allow the user to interact with their data in a visual environment that provides direct analytic capability for finding data aberrations or hotspots. In this paper, we present a suite of tools designed to facilitate the exploration of spatiotemporal data sets. Our system allows users to search for hotspots in both space and time, combining linked views and interactive filtering to provide users with contextual information about their data and allow the user to develop and explore their hypotheses. Statistical data models and alert detection algorithms are provided to help draw user attention to critical areas. Demographic filtering can then be further applied as hypotheses generated become fine tuned. This paper demonstrates the use of such tools on multiple geospatiotemporal data sets.
international conference on data engineering | 2009
Mohamed Yakout; Mikhail J. Atallah; Ahmed K. Elmagarmid
Record linkage is the computation of the associations among records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. The autonomous entities who wish to carry out the record matching computation are often reluctant to fully share their data. In such a framework where the entities are unwilling to share data with each other, the problem of carrying out the linkage computation without full data exchange has been called private record linkage. Previous private record linkage techniques have made use of a third party. We provide efficient techniques for private record linkage that improve on previous work in that (i) they make no use of a third party; (ii) they achieve much better performance than that of previous schemes in terms of execution time and quality of output (i.e., practically without false negatives and minimal false positives). Our software implementation provides experimental validation of our approach and the above claims.
international conference on data engineering | 2010
Nilothpal Talukder; Mourad Ouzzani; Ahmed K. Elmagarmid; Hazem Elmeleegy; Mohamed Yakout
The increasing popularity of social networks, such as Facebook and Orkut, has raised several privacy concerns. Traditional ways of safeguarding privacy of personal information by hiding sensitive attributes are no longer adequate. Research shows that probabilistic classification techniques can effectively infer such private information. The disclosed sensitive information of friends, group affiliations and even participation in activities, such as tagging and commenting, are considered background knowledge in this process. In this paper, we present a privacy protection tool, called Privometer, that measures the amount of sensitive information leakage in a user profile and suggests self-sanitization actions to regulate the amount of leakage. In contrast to previous research, where inference techniques use publicly available profile information, we consider an augmented model where a potentially malicious application installed in the users friend profiles can access substantially more information. In our model, merely hiding the sensitive information is not sufficient to protect the user privacy. We present an implementation of Privometer in Facebook.
BMC Medical Informatics and Decision Making | 2009
Ryan P. Hafen; David Anderson; William S. Cleveland; Ross Maciejewski; David S. Ebert; Ahmad M. Abusalah; Mohamed Yakout; Mourad Ouzzani; Shaun J. Grannis
BackgroundPublic health surveillance is the monitoring of data to detect and quantify unusual health events. Monitoring pre-diagnostic data, such as emergency department (ED) patient chief complaints, enables rapid detection of disease outbreaks. There are many sources of variation in such data; statistical methods need to accurately model them as a basis for timely and accurate disease outbreak methods.MethodsOur new methods for modeling daily chief complaint counts are based on a seasonal-trend decomposition procedure based on loess (STL) and were developed using data from the 76 EDs of the Indiana surveillance program from 2004 to 2008. Square root counts are decomposed into inter-annual, yearly-seasonal, day-of-the-week, and random-error components. Using this decomposition method, we develop a new synoptic-scale (days to weeks) outbreak detection method and carry out a simulation study to compare detection performance to four well-known methods for nine outbreak scenarios.ResultThe components of the STL decomposition reveal insights into the variability of the Indiana ED data. Day-of-the-week components tend to peak Sunday or Monday, fall steadily to a minimum Thursday or Friday, and then rise to the peak. Yearly-seasonal components show seasonal influenza, some with bimodal peaks.Some inter-annual components increase slightly due to increasing patient populations. A new outbreak detection method based on the decomposition modeling performs well with 90 days or more of data. Control limits were set empirically so that all methods had a specificity of 97%. STL had the largest sensitivity in all nine outbreak scenarios. The STL method also exhibited a well-behaved false positive rate when run on the data with no outbreaks injected.ConclusionThe STL decomposition method for chief complaint counts leads to a rapid and accurate detection method for disease outbreaks, and requires only 90 days of historical data to be put into operation. The visualization tools that accompany the decomposition and outbreak methods provide much insight into patterns in the data, which is useful for surveillance operations.
visual analytics science and technology | 2008
Ross Maciejewski; Stephen Rudolph; Ryan P. Hafen; Ahmad M. Abusalah; Mohamed Yakout; Mourad Ouzzani; William S. Cleveland; Shaun J. Grannis; Michael Wade; David S. Ebert
When analyzing syndromic surveillance data, health care officials look for areas with unusually high cases of syndromes. Unfortunately, many outbreaks are difficult to detect because their signal is obscured by the statistical noise. Consequently, many detection algorithms have a high false positive rate. While many false alerts can be easily filtered by trained epidemiologists, others require health officials to drill down into the data, analyzing specific segments of the population and historical trends over time and space. Furthermore, the ability to accurately recognize meaningful patterns in the data becomes more challenging as these data sources increase in volume and complexity. To facilitate more accurate and efficient event detection, we have created a visual analytics tool that provides analysts with linked geo-spatiotemporal and statistical analytic views. We model syndromic hotspots by applying a kernel density estimation on the population sample. When an analyst selects a syndromic hotspot, temporal statistical graphs of the hotspot are created. Similarly, regions in the statistical plots may be selected to generate geospatial features specific to the current time period. Demographic filtering can then be combined to determine if certain populations are more affected than others. These tools allow analysts to perform real-time hypothesis testing and evaluation.
Journal of Data and Information Quality | 2012
Mohamed Yakout; Mikhail J. Atallah; Ahmed K. Elmagarmid
Record linkage is used to associate entities from multiple data sources. For example, two organizations contemplating a merger may want to know how common their customer bases are so that they may better assess the benefits of the merger. Another example is a database of people who are forbidden from a certain activity by regulators, may need to be compared to a list of people engaged in that activity. The autonomous entities who wish to carry out the record matching computation are often reluctant to fully share their data; they fear losing control over its subsequent dissemination and usage, or they want to insure privacy because the data is proprietary or confidential, and/or they are cautious simply because privacy laws forbid its disclosure or regulate the form of that disclosure. In such cases, the problem of carrying out the linkage computation without full data exchange has been called private record linkage. Previous private record linkage techniques have made use of a third party. We provide efficient techniques for private record linkage that improve on previous work in that (1) our techniques make no use of a third party, and (2) they achieve much better performance than previous schemes in terms of their execution time while maintaining acceptable quality of output compared to nonprivacy settings. Our protocol consists of two phases. The first phase primarily produces candidate record pairs for matching, by carrying out a very fast (but not accurate) matching between such pairs of records. The second phase is a novel protocol for efficiently computing distances between each candidate pair (without any expensive cryptographic operations such as modular exponentiations). Our experimental evaluation of our approach validates these claims.
international conference on data engineering | 2010
Mohamed Yakout; Ahmed K. Elmagarmid; Jennifer Neville
Improving data quality is a time-consuming, laborintensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing users efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the users interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.
international conference on management of data | 2012
Mohamed Yakout; Kris Ganjam; Kaushik Chakrabarti; Surajit Chaudhuri
very large data bases | 2010
Mohamed Yakout; Ahmed K. Elmagarmid; Hazem Elmeleegy; Mourad Ouzzani; Alan Qi