Craig Statchuk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Craig Statchuk is active.

Explore More

Publication

Featured researches published by Craig Statchuk.

ACM Computing Surveys | 2016

The Six Pillars for Building Big Data Analytics Ecosystems

Shadi Khalifa; Yehia Elshater; Kiran Sundaravarathan; Aparna Balachandra Bhat; Patrick Martin; Fahim T. Imam; Dan Rope; Mike McRoberts; Craig Statchuk

With almost everything now online, organizations look at the Big Data collected to gain insights for improving their services. In the analytics process, derivation of such insights requires experimenting-with and integrating different analytics techniques, while handling the Big Data high arrival velocity and large volumes. Existing solutions cover bits-and-pieces of the analytics process, leaving it to organizations to assemble their own ecosystem or buy an off-the-shelf ecosystem that can have unnecessary components to them. We build on this point by dividing the Big Data Analytics problem into six main pillars. We characterize and show examples of solutions designed for each of these pillars. We then integrate these six pillars into a taxonomy to provide an overview of the possible state-of-the-art analytics ecosystems. In the process, we highlight a number of ecosystems to meet organizations different needs. Finally, we identify possible areas of research for building future Big Data Analytics Ecosystems.

international congress on big data | 2015

A Study of Data Locality in YARN

Yehia Elshater; Patrick Martin; Dan Rope; Mike McRoberts; Craig Statchuk

Co-locating the computation as close as possible to the data is an important consideration in the current data intensive systems. This is known as data locality problem. In this paper, we analyze the impact of data locality on YARN, which is the new version of Hadoop. We investigate YARN delay scheduler behavior with respect to data locality for a variety of workloads and configurations. We address in this paper three problems related to data locality. First, we study the trade-off between the data locality and the job completion time. Secondly, we observe that there is an imbalance of resource allocation when considering the data locality, which may under-utilize the cluster. Thirdly, we address the redundant I/O operations when different YARN containers request input data blocks on the same node. Additionally, we propose YARN Locality Simulator (YLocSim), a simulator tool that simulates the interactions between YARN components in a real cluster and reports the data locality percentages in real time. We validate YLocSim over a real cluster setup and use it in our study.

international congress on big data | 2016

QDrill: Query-Based Distributed Consumable Analytics for Big Data

Shadi Khalifa; Patrick Martin; Dan Rope; Mike McRoberts; Craig Statchuk

Consumable analytics attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores. In this paper, we address these challenges in the proposed QDrill. We introduce the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor introduces the Distributed Analytics Query Language for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them. We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable model training and scoring phase while almost keeping the same performance for Non-Updatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.

Archive | 2011