Pekka Kostamaa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pekka Kostamaa is active.

Explore More

Publication

Featured researches published by Pekka Kostamaa.

international conference on management of data | 2010

Integrating hadoop and parallel DBMs

Yu Xu; Pekka Kostamaa; Like Gao

Teradatas parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.

international conference on parallel architectures and languages europe | 1992

Exegesis of DBC/1012 and P-90 - Industrial Supercomputer Database Machines

Felipe Cariño; Pekka Kostamaa

This paper describes the DBC/1012s architecture, its broadcast sort/merge Ynet interconnection network, its data distribution and access strategies, its parallel execution schemes and intra-query parallelism, its method for handling linear system growth, and an optical technology solution that replaces microfiche. Also described are the hierarchical data storage needs of most large industrial and governmental concerns and why certain applications can now be implemented for the first time. Various industry case studies are examined on the important issues of how they use multimedia parallel database systems to satisfy complex decision support analysis. Finally, we present a brief analysis of P-90, our next generation parallel database computer system. In particular, we discuss the performance enhancement gains provided by introducing multiprocessing at the board level, the expected benefits of RAID5 disk arrays, and a new bidirectional, multistage, high bandwidth, pointtopoint interconnection network.

international conference on data engineering | 2010

A new algorithm for small-large table outer joins in parallel DBMS

Yu Xu; Pekka Kostamaa

Large enterprises have been relying on parallel database management systems (PDBMS) to process their ever-increasing data volume and complex queries. Business intelligence tools used by enterprises frequently generate a large number of outer joins and require high performance from the underlying database systems. A common type of outer joins in business applications is the small-large table outer join studied in this paper where one table is relatively small and the other is large. We present an efficient and easy to implement algorithm called DER (Duplication and Efficient Redistribution) for small and large table outer joins. Our experimental results show that the DER algorithm significantly speeds up query elapsed time and scales linearly.

international conference on management of data | 2011

A Hadoop based distributed loading approach to parallel data warehouses

Yu Xu; Pekka Kostamaa; Yan Qi; Jian Wen; Kevin Keliang Zhao

One critical part of building and running a data warehouse is the ETL (Extraction Transformation Loading) process. In fact, the growing ETL tool market is already a multi-billion-dollar market. Getting data into data warehouses has been a hindering factor to wider potential database applications such as scientific computing, as discussed in recent panels at various database conferences. One particular problem with the current load approaches to data warehouses is that while data are partitioned and replicated across all nodes in data warehouses powered by parallel DBMS(PDBMS), load utilities typically reside on a single node which face the issues of i) data loss/data availability if the node/hard drives crash; ii) file size limit on a single node; iii) load performance. All of these issues are mostly handled manually or only helped to some degree by tools. We notice that one common thing between Hadoop and Teradata Enterprise Data Warehouse (EDW) is that data in both systems are partitioned across multiple nodes for parallel computing, which creates parallel loading opportunities not possible for DBMSs running on a single node. In this paper we describe our approach of using Hadoop as a distributed load strategy to Teradata EDW. We use Hadoop as the intermediate load server to store data to be loaded to Teradata EDW. We gain all the benefits from HDFS (Hadoop Distributed File System): i) significantly increased disk space for the file to be loaded; ii) once the data is written to HDFS, it is not necessary for the data sources to keep the data even before the file is loaded to Teradata EDW; iii) MapReduce programs can be used to transform and add structures to unstructured or semi-structured data; iv) more importantly since a file is distributed in HDFS, the file can be loaded more quickly in parallel to Teradata EDW, which is the main focus in this paper. When both Hadoop and Teradata EDW coexist on the same hardware platform, as being increasingly required by customers because of reduced hardware and system administration costs, we have another optimization opportunity to directly load HDFS data blocks to Teradata parallel units on the same nodes. However, due to the inherent non-uniform data distribution in HDFS, rarely we can avoid transferring HDFS blocks to remote Teradata nodes. We designed a polynomial time optimal algorithm and a polynomial time approximate algorithm to assign HDFS blocks to Teradata parallel units evenly and minimize network traffic. We performed experiments on synthetic and real data sets to compare the performances of the algorithms.

international conference on management of data | 2001

StorHouse metanoia - new applications for database, storage & data warehousing

Felipe Cariño; Pekka Kostamaa; Art Kaufmann; John G. Burgess

This paper describes the StorHouse/Relational Manager (RM) database system that uses and exploits an active storage hierarchy. By active storage hierarchy, we mean that StorHouse/RM executes SQL queries directly against data stored on all hierarchical storage (i.e. disk, optical, and tape) without post processing a file or a DBA having to manage a data set. We describe and analyze StorHouse/RM features and internals. We also describe how StorHouse/RM differs from traditional HSM (Hierarchical Storage Management) systems. For commercial applications we describe an evolution to the Data Warehouse concept, called Atomic Data Store, whereby atomic data is stored in the database system. Atomic data is defined as storing all the historic data values and executing queries against them. We also describe a Hub-and-Spoke Data Warehouse architecture, which is used to feed or fuel data into Data Marts. Furthermore, we provide analysis how StorHouse/RM can be federated with DB2, Oracle and Microsoft SQL Server 7 (SS7) and thus provide these databases an active storage hierarchy (i.e. tape). We then show two federated data modeling techniques (a) logical horizontal partitioning (LHP) of tuples and (b) logical vertical partitioning (LVP) of columns to demonstrate our database extension capabilities. We conclude with a TPC-like performance analysis of data stored on tape and disk.

international conference on data engineering | 2010

Large scale data warehousing: Trends and observations

Richard Winter; Pekka Kostamaa

How large are data warehouses? How fast are they growing? How big are they going to get? What is driving their growth? Why is all this data of value in commercial enterprises? What can we say about how these large data warehouses are being used? What are some key challenges ahead? In this talk, Richard Winter will share his views and observations concerning these questions and others, based on more than three decades of involvement with commercial data warehouses and their preferences.

international conference on data engineering | 2017

BigBench V2: The New and Improved BigBench

Ahmad Ghazal; Todor Ivanov; Pekka Kostamaa; Alain Crolotte; Ryan Voong; Mohammed Al-Kateb; Waleed Ghazal; Roberto V. Zicari

Benchmarking Big Data solutions has been gaining a lot of attention from research and industry. BigBench is one of the most popular benchmarks in this area which was adopted by the TPC as TPCx-BB. BigBench, however, has key shortcomings. The structured component of the data model is the same as the TPC-DS data model which is a complex snowflake-like schema. This is contrary to the simple star schema Big Data models in real life. BigBench also treats the semi-structured web-logs more or less as a structured table. In real life, web-logs are modeled as key-value pairs with unknown schema. Specific keys are captured at query time - a process referred to as late binding. In addition, eleven (out of thirty) of the BigBench queries are TPC-DS queries. These queries are complex SQL applied on the structured part of the data model which again is not typical of Big Data workloads. In this paper1, we present BigBench V2 to address the aforementioned limitations of the original BigBench. BigBench V2 is completely independent of TPC-DS with a new data model and an overhauled workload. The new data model has a simple structured data model. Web-logs are modeled as key-value pairs with a substantial and variable number of keys. BigBench V2 mandates late binding by requiring query processing to be done directly on key-value web-logs rather than a pre-parsed form of it. A new scale factor-based data generator is implemented to produce structured tables, key-value semistructured web-logs, and unstructured data. We implemented and executed BigBench V2 on Hive. Our proof of concept shows the feasibility of BigBench V2 and outlines different ways of implementing late binding.

conference on information and knowledge management | 2012

A new tool for multi-level partitioning in teradata

Young-Kyoon Suh; Ahmad Ghazal; Alain Crolotte; Pekka Kostamaa

This paper introduces a new tool that recommends an optimized partitioning solution called Multi-Level Partitioned Primary Index (MLPPI) for a fact table based on the queries in the workload. The tool implements a new technique using a greedy algorithm for search space enumeration. The space is driven by predicates in the queries. This technique fits very well the Teradata MLPPI scheme, as it is based on a general framework using general expressions, ranges and case expressions for partition definitions. The cost model implemented in the tool is based on the Teradata optimizer, and it is used to prune the search space for reaching a final solution. The tool resides completely on the client, and interfaces the database through APIs as opposed to previous work that requires optimizer code extension. The APIs are used to simplify the workload queries, and to capture fact table predicates and costs necessary to make the recommendation. The predicate-driven method implemented by the tool is general, and it can be applied to any clustering or partitioning scheme based on simple field expressions or complex SQL predicates. Experimental results given a particular workload will show that the recommendation from the tool outperforms a human expert. The experiments also show that the solution is scalable both with the workload complexity and the size of the fact table.

international conference on management of data | 2008