Daisy Zhe Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daisy Zhe Wang is active.

Explore More

Publication

Featured researches published by Daisy Zhe Wang.

very large data bases | 2012

The MADlib analytics library: or MAD skills, the SQL

Joseph M. Hellerstein; Christopher Ré; Florian Schoppmann; Daisy Zhe Wang; Eugene Fratkin; Aleksander Gorajek; Kee Siong Ng; Caleb E. Welton; Xixuan Feng; Kun Li; Arun Kumar

MADlib is a free, open-source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open-source nature. We provide an overview of the librarys architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the projects goals. MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

very large data bases | 2008

BayesStore: managing large, uncertain data repositories with probabilistic graphical models

Daisy Zhe Wang; Eirinaios Michelakis; Minos N. Garofalakis; Joseph M. Hellerstein

Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Unfortunately, to date, most existing approaches to probabilistic database systems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic information is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BayesStore, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE employs concise statistical relational models to effectively encode the correlation patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS operator repertoire to support efficient and sound query processing. We present BAYESSTOREs uncertainty model based on a novel, first-order statistical model, and we redefine traditional query processing operators, to manipulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing.

international conference on management of data | 2014

Knowledge expansion over probabilistic knowledge bases

Yang Chen; Daisy Zhe Wang

Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

international conference on management of data | 2011

Hybrid in-database inference for declarative information extraction

Daisy Zhe Wang; Michael J. Franklin; Minos N. Garofalakis; Joseph M. Hellerstein; Michael L. Wick

In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a probabilistic database (PDB) based declarative IE system that supports a leading statistical IE model, and an associated inference algorithm to answer top-k-style queries over the probabilistic IE outcome. Still, the broader problem of effectively supporting general probabilistic inference inside a PDB-based declarative IE system remains open. In this paper, we explore the in-database implementations of a wide variety of inference algorithms suited to IE, including two Markov chain Monte Carlo algorithms, the Viterbi and the sum-product algorithms. We describe the rules for choosing appropriate inference algorithms based on the model, the query and the text, considering the trade-off between accuracy and runtime. Based on these rules, we describe a hybrid approach to optimize the execution of a single probabilistic IE query to employ different inference algorithms appropriate for different records. We show that our techniques can achieve up to 10-fold speedups compared to the non-hybrid solutions proposed in the literature.

international conference on management of data | 2016

Ontological Pathfinding

Yang Chen; S. Goldberg; Daisy Zhe Wang; Soumitra Siddharth Johri

Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, these knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale.

international conference on data engineering | 2010

Probabilistic declarative information extraction

Daisy Zhe Wang; Eirinaios Michelakis; Michael J. Franklin; Minos N. Garofalakis; Joseph M. Hellerstein

Unstructured text represents a large fraction of the worlds data. It often contains snippets of structured information (e.g., peoples names and zip codes). Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and probabilistic databases for querying the output of IE. In this paper, we make the first step to merge these two directions, without loss of statistical robustness, by implementing a state-of-the-art statistical IE model - Conditional Random Fields (CRF) - in the setting of a Probabilistic Database that treats statistical models as first-class data objects. We show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL. We also show the performance benefits relative to a standalone open-source Viterbi implementation. This work opens up the optimization opportunities for queries involving both inference and relational operators over IE models.

international conference on data engineering | 2011

Selectivity estimation for extraction operators over text data

Daisy Zhe Wang; Long Wei; Yunyao Li; Frederick R. Reiss; Shivakumar Vaithyanathan

Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.

Proceedings of the Second Workshop on Data Analytics in the Cloud | 2013

GPText: Greenplum parallel statistical text analysis framework

Kun Li; Christan Grant; Daisy Zhe Wang; Sunny Khatri; George Chitouras

Many companies keep large amounts of text data inside of relational databases. Several challenges exist in using state-of-the-art systems to perform analysis on such datasets. First, expensive big data transfer cost must be paid up front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production sized datasets. In this paper, we introduce GPText, Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on Post-greSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an eDiscovery application built on the GPText framework.

very large data bases | 2016

ScaLeKB: scalable learning and inference over large knowledge bases

Yang Chen; Daisy Zhe Wang; S. Goldberg

Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

international conference on machine learning and applications | 2012

A Machine Learning Based Topic Exploration and Categorization on Surveys

Clint P. George; Daisy Zhe Wang; Joseph N. Wilson; Liana M. Epstein; Philip Garland; Annabell Suh

This paper describes an automatic topic extraction, categorization, and relevance ranking model for multi-lingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible enhancements to the current system and impacts in the business domain.

Explore More