Douglas Burdick | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Douglas Burdick is active.

Explore More

Publication

Featured researches published by Douglas Burdick.

international conference on data engineering | 2001

MAFIA: a maximal frequent itemset algorithm for transactional databases

Douglas Burdick; Manuel Calimlim; Johannes Gehrke

We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five.

IEEE Transactions on Knowledge and Data Engineering | 2005

MAFIA: a maximal frequent itemset algorithm

Douglas Burdick; Manuel Calimlim; Jason Flannick; Johannes Gehrke; Tomi Yiu

We present a new algorithm for mining maximal frequent itemsets from a transactional database. The search strategy of the algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms that significantly improve mining performance. Our implementation for support counting combines a vertical bitmap representation of the data with an efficient bitmap compression scheme. In a thorough experimental analysis, we isolate the effects of individual components of MAFIA including search space pruning techniques and adaptive compression. We also compare our performance with previous work by running tests on very different types of data sets. Our experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to 30.

very large data bases | 2007

OLAP over uncertain and imprecise data

Douglas Burdick; Prasad M. Deshpande; T. S. Jayram; Raghu Ramakrishnan; Shivakumar Vaithyanathan

We extend the OLAP data model to represent data ambiguity, specifically imprecision and uncertainty, and introduce an allocation-based approach to the semantics of aggregation queries over such data. We identify three natural query properties and use them to shed light on alternative query semantics. While there is much work on representing and querying ambiguous data, to our knowledge this is the first paper to handle both imprecision and uncertainty in an OLAP setting.

IEEE Data(base) Engineering Bulletin | 2015

Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study

Douglas Burdick; Mauricio A. Hernández; Howard Ho; Georgia Koutrika; Rajasekar Krishnamurthy; Lucian Popa; Ioana Stanoi; Shivakumar Vaithyanathan; Sanjiv Ranjan Das

We present Midas, a system that uses complex data processing to extract and aggregate facts from a large collection of structured and unstructured documents into a set of unified, clean entities and relationships. Midas focuses on data for financial companies and is based on periodic filings with the U.S. Securities and Exchange Commission (SEC) and Federal Deposit Insurance Corporation (FDIC). We show that, by using data aggregated by Midas, we can provide valuable insights about financial institutions either at the whole system level or at the individual company level. The key technology components that we implemented in Midas and that enable the various financial applications are: information extraction, entity resolution, mapping and fusion, all on top of a scalable infrastructure based on Hadoop. We describe our experience in building the Midas system and also outline the key research questions that remain to be addressed towards building a generic, high-level infrastructure for large-scale data integration from public sources.

very large data bases | 2014

Hybrid parallelization strategies for large-scale machine learning in SystemML

Matthias Boehm; Shirish Tatikonda; Berthold Reinwald; Prithviraj Sen; Yuanyuan Tian; Douglas Burdick; Shivakumar Vaithyanathan

SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast to existing large-scale machine learning libraries---automatic optimization. SystemMLs primary focus is on data parallelism but many ML algorithms inherently exhibit opportunities for task parallelism as well. A major challenge is how to efficiently combine both types of parallelism for arbitrary ML scripts and workloads. In this paper, we present a systematic approach for combining task and data parallelism for large-scale machine learning on top of MapReduce. We employ a generic Parallel FOR construct (ParFOR) as known from high performance computing (HPC). Our core contributions are (1) complementary parallelization strategies for exploiting multi-core and cluster parallelism, as well as (2) a novel cost-based optimization framework for automatically creating optimal parallel execution plans. Experiments on a variety of use cases showed that this achieves both efficiency and scalability due to automatic adaptation to ad-hoc workloads and unknown data characteristics.

international conference on data engineering | 2015

Efficient sample generation for scalable meta learning

Sebastian Schelter; Juan Soto; Volker Markl; Douglas Burdick; Berthold Reinwald; Alexandre V. Evfimievski

Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes problematic, especially when the data is stored distributed in a block-partitioned representation, and processed on a shared-nothing cluster. We present a novel, parallel algorithm for efficient sample generation from large, block-partitioned datasets in a shared-nothing architecture. This algorithm executes in a single pass over the data, and minimizes inter-machine communication. The algorithm supports a wide variety of sample generation techniques through an embedded user-defined sampling function. We illustrate how to implement distributed sample generation for popular meta learning techniques such as hold-out tests, k-fold cross-validation, and bagging, using our algorithm and present an experimental evaluation on datasets with billions of datapoints.

Proceedings of the International Workshop on Data Science for Macro-Modeling | 2014

Data Science Challenges in Real Estate Asset and Capital Markets

Douglas Burdick; Michael J. Franklin; Paulo Issler; Rajasekar Krishnamurthy; Lucian Popa; Louiqa Raschid; Richard Stanton

The real estate financial markets are complex supply chains. Understanding their behavior is limited by a lack of data that would capture the richly interconnected networks of financial institutions and complex financial products, e.g., asset backed securities. This lack of transparency is further compounded by limited knowledge of the contractual rules that control the flow of funds from mortgage pools to securities, as well as the financial events that regulate these flows. In this project, we will use the IBM Midas framework and tools to extract entities, relationships, events, contractual rules and risk profiles for financial institutions. Our source of information will be the MBS prospectus documents that are public and are filed with the Securities and Exchange Commission. We will describe the data management needs of the Haas Real Estate and Financial Markets (REFM) Lab and presents some recent REFM analytics that highlight the importance of these markets and the impact on systemic risk. We use excerpts extracted from the prospectus of a mortgage backed security (MBS) to illustrate the information extraction challenges and outline our approach to address these challenges.

symposium on cloud computing | 2013

Compiling machine learning algorithms with SystemML

Matthias Boehm; Douglas Burdick; Alexandre V. Evfimievski; Berthold Reinwald; Prithviraj Sen; Shirish Tatikonda; Yuanyuan Tian

Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these analytics for varying data characteristics and at scale is challenging. To address these challenges, SystemML implements a declarative, high-level language using an R-like syntax extended with machine-learning-specific constructs, that is compiled to a MapReduce runtime [2]. The language is rich enough to express a wide class of statistical, predictive modeling and machine learning algorithms (Fig. 1). We chose robust algorithms that scale to large, and potentially sparse data with many features.

In Search of Elegance in the Theory and Practice of Computation | 2013

High-Level Rules for Integration and Analysis of Data: New Challenges

Bogdan Alexe; Douglas Burdick; Mauricio A. Hernández; Georgia Koutrika; Rajasekar Krishnamurthy; Lucian Popa; Ioana Stanoi; Ryan Wisnesky

Data integration remains a perenially difficult task. The need to access, integrate and make sense of large amounts of data has, in fact, accentuated in recent years. There are now many publicly available sources of data that can provide valuable information in various domains. Concrete examples of public data sources include: bibliographic repositories (DBLP, Cora, Citeseer), online movie databases (IMDB), knowledge bases (Wikipedia, DBpedia, Freebase), social media data (Facebook and Twitter, blogs). Additionally, a number of more specialized public data repositories are starting to play an increasingly important role. These repositories include, for example, the U.S. federal government data, congress and census data, as well as financial reports archived by the U.S. Securities and Exchange Commission (SEC).

international conference on management of data | 2016

resMBS: Constructing a Financial Supply Chain from Prospectus

Douglas Burdick; Soham De; Louiqa Raschid; Mingchao Shao; Zheng Xu; Elena Zotkina

Understanding the behavior of complex financial supply chains is usually difficult due to a lack of data capturing the interactions between financial institutions (FIs) and the roles that they play in financial contracts (FCs). resMBS is an example supply chain corresponding to the US residential mortgage backed securities that were critical in the 2008 US financial crisis. In this paper, we describe the process of creating the resMBS graph dataset from financial prospectus. We use the SystemT rule-based text extraction platform to develop two tools, ORG NER and Dict NER, for named entity recognition of financial institution (FI) names. The resMBS graph comprises a set of FC nodes (each prospectus) and the corresponding FI nodes that are extracted from the prospectus. A Role-FI extractor matches a role keyword such as originator, sponsor or servicer, with FI names. We study the performance of the Role-FI extractor, and ORG NER and Dict NER, in constructing the resMBS dataset. We also present preliminary results of a clustering based analysis to identify financial communities and their evolution in the resMBS financial supply chain.

Explore More