Kris Ganjam
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kris Ganjam.
international conference on management of data | 2003
Surajit Chaudhuri; Kris Ganjam; Venkatesh Ganti; Rajeev Motwani
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
international conference on management of data | 2005
Surajit Chaudhuri; Kris Ganjam; Venkatesh Ganti; Rahul Kapoor; Vivek R. Narasayya; Theo Vassilakis
When collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive, challenge. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values. These data defects generally manifest themselves as foreign-key mismatches and approximately duplicate records, both of which make further data mining and decision support analyses either impossible or suspect. We demonstrate two new data cleansing operators, Fuzzy Lookup and Fuzzy Grouping, which address these problems in a scalable and domain-independent manner. These operators are implemented within Microsoft SQL Server 2005 Integration Services. Our demo will explain their functionality and highlight multiple real-world scenarios in which they can be used to achieve high data quality.
very large data bases | 2015
Yeye He; Kris Ganjam; Xu Chu
Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for ad-hoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in a spreadsheet environment, a user may want to join one table with a subject column country-name, with another table with a subject column country-code. Traditional equi-join cannot handle such joins automatically, and the user typically has to manually find an intermediate mapping table in order to perform the desired join. We develop a SEMA-JOIN approach that is a first step toward allowing users to perform semantic join automatically, with a click of the button. Our main idea is to utilize a data-driven method that leverages a big table corpus with over 100 million tables to determine statistical correlation between cell values at both row-level and column-level. We use the intuition that the correct join mapping is the one that maximizes aggregate pairwise correlation, to formulate the join prediction problem as an optimization problem. We develop a linear program relaxation and a rounding argument to obtain a 2-approximation algorithm in polynomial time. Our evaluation using both public tables from the Web and proprietary Enterprise tables from a large company shows that the proposed approach can perform automatic semantic joins with high precision for a variety of common join scenarios.
international conference on management of data | 2018
Yeye He; Kris Ganjam; Kukjin Lee; Yue Wang; Vivek R. Narasayya; Surajit Chaudhuri; Xu Chu; Yudian Zheng
Business analysts and data scientists today increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, phone number, etc., before they can perform analysis. These ad-hoc transformation problems are typically solved by one-off scripts, which is both difficult and time-consuming. Our observation is that these domain-specific transformation problems have long been solved by developers with code libraries, which are often shared in places like GitHub. We thus develop an extensible data transformation system called Transform-Data-by-Example (TDE) that can leverage rich transformation logic in source code, DLLs, web services and mapping tables, so that end-users only need to provide a few (typically 3) input/output examples, and TDE can synthesize desired programs using relevant transformation logic from these sources. The beta version of TDE was released in Office Store for Excel.
Archive | 2003
Surajit Chaudhuri; Kris Ganjam; Venkatesh Ganti; Rajeev Motwani
international conference on management of data | 2012
Mohamed Yakout; Kris Ganjam; Kaushik Chakrabarti; Surajit Chaudhuri
international world wide web conferences | 2015
Chi Wang; Kaushik Chakrabarti; Yeye He; Kris Ganjam; Zhimin Chen; Philip A. Bernstein
Archive | 2012
Kris Ganjam; Kaushik Chakrabarti; Mohamed Yakout; Surajit Chaudhuri
international conference on management of data | 2008
Arvind Arasu; Surajit Chaudhuri; Kris Ganjam; Raghav Kaushik
international conference on management of data | 2015
Xu Chu; Yeye He; Kaushik Chakrabarti; Kris Ganjam