Kevin Wilkinson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin Wilkinson is active.

Explore More

Publication

Featured researches published by Kevin Wilkinson.

international world wide web conferences | 2004

Jena: implementing the semantic web recommendations

Jeremy J. Carroll; Ian Dickinson; Chris Dollin; Dave Reynolds; Andy Seaborne; Kevin Wilkinson

The new Semantic Web recommendations for RDF, RDFS and OWL have, at their heart, the RDF graph. Jena2, a second-generation RDF toolkit, is similarly centered on the RDF graph. RDFS and OWL reasoning are seen as graph-to-graph transforms, producing graphs of virtual triples. Rich APIs are provided. The Model API includes support for other aspects of the RDF recommendations, such as containers and reification. The Ontology API includes support for RDFS and OWL, including advanced OWL Full support. Jena includes the de facto reference RDF/XML parser, and provides RDF/XML output using the full range of the rich RDF/XML grammar. N3 I/O is supported. RDF graphs can be stored in-memory or in databases. Jenas query language, RDQL, and the Web API are both offered for the next round of standardization.

extending database technology | 2009

Data integration flows for business intelligence

Umeshwar Dayal; Malu Castellanos; Alkis Simitsis; Kevin Wilkinson

Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Todays BI architecture typically consists of a data warehouse (or one or more data marts), which consolidates data from several operational databases, and serves a variety of front-end querying, reporting, and analytic tools. The back-end of the architecture is a data integration pipeline for populating the data warehouse by extracting data from distributed and usually heterogeneous operational sources; cleansing, integrating and transforming the data; and loading it into the data warehouse. Since BI systems have been used primarily for off-line, strategic decision making, the traditional data integration pipeline is a oneway, batch process, usually implemented by extract-transform-load (ETL) tools. The design and implementation of the ETL pipeline is largely a labor-intensive activity, and typically consumes a large fraction of the effort in data warehousing projects. Increasingly, as enterprises become more automated, data-driven, and real-time, the BI architecture is evolving to support operational decision making. This imposes additional requirements and tradeoffs, resulting in even more complexity in the design of data integration flows. These include reducing the latency so that near real-time data can be delivered to the data warehouse, extracting information from a wider variety of data sources, extending the rigidly serial ETL pipeline to more general data flows, and considering alternative physical implementations. We describe the requirements for data integration flows in this next generation of operational BI system, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges. The goal is to facilitate the design and implementation of optimal flows to meet business requirements.

international conference on management of data | 2012

Optimizing analytic data flows for multiple execution engines

Alkis Simitsis; Kevin Wilkinson; Malu Castellanos; Umeshwar Dayal

Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and time-consuming. Optimizing these flows is currently an ad-hoc process where the result is largely dependent on the abilities and experience of the flow designer. Our previous work addressed analytic flow optimization for multiple objectives over a single execution engine. This paper focuses on optimizing flows for a single objective, namely performance, over multiple execution engines. We consider flows that span a DBMS, a Map-Reduce engine, and an orchestration engine (e.g., an ETL tool or scripting language). This configuration is emerging as a common paradigm used to combine analysis of unstructured data with analysis of structured data (e.g., NoSQL plus SQL). We present flow transformations that model data shipping, function shipping, and operation decomposition and we describe how flow graphs are generated for multiple engines. Performance results for various configurations demonstrate the benefit of optimization.

international conference on data engineering | 2010

Optimizing ETL workflows for fault-tolerance

Alkis Simitsis; Kevin Wilkinson; Umeshwar Dayal; Malu Castellanos

Extract-Transform-Load (ETL) processes play an important role in data warehousing. Typically, design work on ETL has focused on performance as the sole metric to make sure that the ETL process finishes within an allocated time window. However, other quality metrics are also important and need to be considered during ETL design. In this paper, we address ETL design for performance plus fault-tolerance and freshness. There are many reasons why an ETL process can fail and a good design needs to guarantee that it can be recovered within the ETL time window. How to make ETL robust to failures is not trivial. There are different strategies that can be used and they each have different costs and benefits. In addition, other metrics can affect the choice of a strategy; e.g., higher freshness reduces the time window for recovery. The design space is too large for informal, ad-hoc approaches. In this paper, we describe our QoX optimizer that considers multiple design strategies and finds an ETL design that satisfies multiple objectives. In particular, we define the optimizer search space, cost functions, and search algorithms. Also, we illustrate its use through several experiments and we show that it produces designs that are very near optimal.

International Journal of Parallel Programming | 2013

Analytical Performance Models for MapReduce Workloads

Emanuel Vianna; Giovanni Comarela; Tatiana Pontes; Jussara M. Almeida; Virgílio A. F. Almeida; Kevin Wilkinson; Harumi A. Kuno; Umeshwar Dayal

MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence constraints and synchronization delays that are characteristic of a pipeline communication between maps (producers) and reduces (consumers). We here address the challenge of designing analytical models to estimate the performance of MapReduce workloads, notably Hadoop workloads, focusing particularly on the intra-job pipeline parallelism between map and reduce tasks belonging to the same job. We propose a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time, throughput and resource utilization. We validate our solution against a queuing network simulator and a real setup in various scenarios, finding very close agreement in both cases. In particular, our model produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %.

international conference on conceptual modeling | 2010

Leveraging business process models for ETL design

Kevin Wilkinson; Alkis Simitsis; Malu Castellanos; Umeshwar Dayal

As Business Intelligence evolves from off-line strategic decision making to on-line operational decision making, the design of the back-end Extract-Transform-Load (ETL) processes is becoming even more complex. Many challenges arise in this new context like their optimization and modeling. In this paper, we focus on the disconnection between the IT-level view of the enterprise presented by ETL processes and the business view of the enterprise required by managers and analysts. We propose the use of business process models for a conceptual view of ETL. We show how to link this conceptual view to existing business processes and how to translate from this conceptual view to a logical ETL view that can be optimized. Thus, we link the ETL processes back to their underlying business processes and so enable not only a business view of the ETL, but also a near real-time view of the entire enterprise.

extending database technology | 2009

Managing long-running queries

Stefan Krompass; Harumi A. Kuno; Janet L. Wiener; Kevin Wilkinson; Umeshwar Dayal; Alfons Kemper

Business Intelligence query workloads that run against very large data warehouses contain queries whose execution times range, sometimes unpredictably, from seconds to hours. The presence of even a handful of long-running queries can significantly slow down a workload consisting of thousands of queries, creating havoc for queries that require a quick response. Long-running queries are a known problem in all commercial database products. However, we have not seen a thorough classification of long-running queries nor a systematic study of the most effective corrective actions. We present here a systematic study of workload management policies, including many implemented by commercial database vendors. Our goal is to enable a system to: (1) recognize long-running queries and categorize them in terms of their impact on performance and (2) determine and take (automatically!) the most effective control actions to remedy the situation. To this end, we identify common workload management scenarios involving long-running queries, and create a taxonomy of long-running queries. We carry out an extensive set of experiments to evaluate different management policies and the relative and absolute thresholds that they may use. We find that in some scenarios, the right combination of policies can reduce the runtime of a workload by a factor of two, but that in other scenarios, any action taken increases runtime. One surprising result was that relative thresholds for execution control can compensate for inaccurate cost estimates, so that Kill&Requeue actions perform as well as Suspend&Resume.

international conference on data engineering | 2013

HFMS: Managing the lifecycle and complexity of hybrid analytic data flows

Alkis Simitsis; Kevin Wilkinson; Umeshwar Dayal; Meichun Hsu

To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near realtime views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call these multi-engine flows hybrid flows. Designing and optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows is even more challenging since their execution engines are likely under different administrative domains and there is no single point of control. To address these needs, we present a Hybrid Flow Management System (HFMS). It is an independent software layer over a number of independent execution engines and storage repositories. It simplifies the design of analytic data flows and includes optimization and executor modules to produce optimized executable flows that can run across multiple execution engines. HFMS dispatches flows for execution and monitors their progress. To meet service level objectives for a workload, it may dynamically change a flows execution plan to avoid processing bottlenecks in the computing infrastructure. We present the architecture of HFMS and describe its components. To demonstrate its potential benefit, we describe performance results for running sample batch workloads with and without HFMS. The ability to monitor multiple execution engines and to dynamically adjust plans enables HFMS to provide better service guarantees and better system utilization.

Operating Systems Review | 2009

Managing operational business intelligence workloads

Umeshwar Dayal; Harumi A. Kuno; Janet L. Wiener; Kevin Wilkinson; Archana Ganapathi; Stefan Krompass

We explore how to manage database workloads that contain a mixture of OLTP-like queries that run for milliseconds as well as business intelligence queries and maintenance tasks that last for hours. As data warehouses grow in size to petabytes and complex analytic queries play a greater role in day-to-day business operations, factors such as inaccurate cardinality estimates, data skew, and resource contention all make it notoriously difficult to predict how such queries will behave before they start executing. However, traditional workload management assumes that accurate expectations for the resource requirements and performance characteristics of a workload are available at compile-time, and relies on such information in order to make critical workload management decisions. In this paper, we describe our approach to dealing with inaccurate predictions. First, we evaluate the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries. Second, we describe a new and more accurate method for predicting the resource usage of queries before runtime. We have carried out an extensive set of experiments, and report on a few of our results.

extending database technology | 2009

Automating the loading of business process data warehouses

Malu Castellanos; Alkis Simitsis; Kevin Wilkinson; Umeshwar Dayal

Business processes drive the operations of an enterprise. In the past, the focus was primarily on business process design, modeling, and automation. Recently, enterprises have realized that they can benefit tremendously from analyzing the behavior of their business processes with the objective of optimizing or improving them. In our research, we address the problem of warehousing business process execution data so that we can analyze their behavior using the analytic and reporting tools that are available in data warehouse environments. We build upon our previous work that described the design and implementation of a generic process data warehouse for use with any business processes. In this paper, we show how to automate the population of the generic process warehouse by tracking business events from an application environment. Typically, the source data consists of event streams that indicate changes in the business process state (i.e., progression of the process). The target schema is designed to allow querying of task and process execution data. The core of our approach for processing progression data relies on the construction of generic templates that specify the semantics of the event streams extraction and the subsequent transformations that translate the underlying IT events into business data changes. Using this extensible template mechanism, we show how to automate the construction of mappings to populate the generic process warehouse using two-levels of mappings that are applied in two-phases. Interestingly, our approach of using ETL technology for warehousing process data can be seen the other way around. An arbitrary ETL process can be modeled as a business process. Hence, we describe the benefit of modeling ETL as a business process and illustrate how to use our approach to warehouse ETL execution data, and to monitor and analyze the progress of ETL processes. Finally, we discuss implementation issues.

Explore More