Is this you? Create Your Porfile

Xiufeng Liu

Technical University of Denmark

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiufeng Liu is active.

Explore More

Publication

Featured researches published by Xiufeng Liu.

international database engineering and applications symposium | 2014

Survey of real-time processing systems for big data

Xiufeng Liu; Nadeem Iftikhar; Xike Xie

In recent years, real-time processing and analytics systems for big data--in the context of Business Intelligence (BI)--have received a growing attention. The traditional BI platforms that perform regular updates on daily, weekly or monthly basis are no longer adequate to satisfy the fast-changing business environments. However, due to the nature of big data, it has become a challenge to achieve the real-time capability using the traditional technologies. The recent distributed computing technology, MapReduce, provides off-the-shelf high scalability that can significantly shorten the processing time for big data; Its open-source implementation such as Hadoop has become the de-facto standard for processing big data, however, Hadoop has the limitation of supporting real-time updates. The improvements in Hadoop for the real-time capability, and the other alternative real-time frameworks have been emerging in recent years. This paper presents a survey of the open source technologies that support big data processing in a real-time/near real-time fashion, including their system architectures and platforms.

data warehousing and knowledge discovery | 2011

ETLMR: a highly scalable dimensional ETL framework based on mapreduce

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL programmer productivity. This paper presents a scalable dimensional ETL framework, ETLMR, based on MapReduce. ETLMR has built-in native support for operations on DW-specific constructs such as star schemas, snowflake schemas and slowly changing dimensions (SCDs). This enables ETL developers to construct scalable MapReduce-based ETL flows with very few code lines. To achieve good performance and load balancing, a number of dimension and fact processing schemes are presented, including techniques for efficiently processing different types of dimensions. The paper describes the integration of ETLMR with aMapReduce framework and evaluates its performance on large realistic data sets. The experimental results show that ETLMR achieves very good scalability and compares favourably with other MapReduce data warehousing tools.

very large data bases | 2012

MapReduce-based dimensional ETL made easy

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

This paper demonstrates ETLMR, a novel dimensional Extract--Transform--Load (ETL) programming framework that uses Map-Reduce to achieve scalability. ETLMR has built-in native support of data warehouse (DW) specific constructs such as star schemas, snowflake schemas, and slowly changing dimensions (SCDs). This makes it possible to build MapReduce-based dimensional ETL flows very easily. The ETL process can be configured with only few lines of code. We will demonstrate the concrete steps in using ETLMR to load data into a (partly snowflaked) DW schema. This includes configuration of data sources and targets, dimension processing schemes, fact processing, and deployment. In addition, we also present the scalability on large data sets.

Energy | 2016

A Hybrid ICT-Solution for Smart Meter Data Analytics

Xiufeng Liu; Per Sieverts Nielsen

Smart meters are increasingly used worldwide. Smart meters are the advanced meters capable of measuring energy consumption at a fine-grained time interval, e.g., every 15 min. Smart meter data are typically bundled with social economic data in analytics, such as meter geographic locations, weather conditions and user information, which makes the data sets very sizable and the analytics complex. Data mining and emerging cloud computing technologies make collecting, processing, and analyzing the so-called big data possible. This paper proposes an innovative ICT-solution to streamline smart meter data analytics. The proposed solution offers an information integration pipeline for ingesting data from smart meters, a scalable platform for processing and mining big data sets, and a web portal for visualizing analytics results. The implemented system has a hybrid architecture of using Spark or Hive for big data processing, and using the machine learning toolkit, MADlib, for doing in-database data analytics in PostgreSQL database. This paper evaluates the key technologies of the proposed ICT-solution, and the results show the effectiveness and efficiency of using the system for both batch and online analytics.

international conference on data engineering | 2015

SMAS: A smart meter data analytics system

Xiufeng Liu; Lukasz Golab; Ihab F. Ilyas

Smart electricity meters are replacing conventional meters worldwide and have enabled a new application domain: smart meter data analytics. In this paper, we introduce SMAS, our smart meter analytics system, which demonstrates the actionable insight that consumers and utilities can obtain from smart meter data. Notably, we implemented SMAS inside a relational database management system using open source tools: PostgreSQL and the MADLib machine learning toolkit. In the proposed demonstration, conference attendees will interact with SMAS as electricity providers, consultants and consumers, and will perform various analyses on real data sets.

extending database technology | 2015

Benchmarking Smart Meter Data Analytics

Xiufeng Liu; Lukasz Golab; Wojciech M. Golab; Ihab F. Ilyas

Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature, but the focus thus far has been on what can be done with the data rather than how to do it efficiently. In this paper, we examine smart meter analytics from a software performance perspective. First, we propose a performance benchmark that includes common data analysis tasks on smart meter data. Second, since obtaining large amounts of smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic data sets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADLib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark). We compare the five platforms in terms of application development effort and performance on a multi-core machine as well as a cluster of 16 commodity servers. We have made the proposed benchmark and data generator freely available online.

international database engineering and applications symposium | 2014

CloudETL: scalable dimensional ETL for hive

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however, challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.

Information Systems | 2011

3XL: Supporting efficient operations on very large OWL Lite triple-stores

Xiufeng Liu; Christian Thomsen; Torben Bach Pedersen

An increasing number of (semantic) web applications store a very large number of (subject, predicate, object) triples in specialized storage engines called triple-stores. Often, triple-stores are used mainly as plain data stores, i.e., for inserting and retrieving large amounts of triples, but not using more advanced features such as logical inference, etc. However, current triple-stores are not optimized for such bulk operations and/or do not support OWL Lite. Further, triple-stores can be inflexible when the data has to be integrated with other kinds of data in non-triple form, e.g., standard relational data. This paper presents 3XL, a triple-store that efficiently supports operations on very large amounts of OWL Lite triples. 3XL also provides the user with high flexibility as it stores data in an object-relational database in a schema that is easy to use and understand. It is, thus, easy to integrate 3XL data with data from other sources. The distinguishing features of 3XL include (a) flexibility as the data is stored in a database, allowing easy integration with other data, and can be queried by means of both triple queries and SQL, (b) using a specialized data-dependent schema (with intelligent partitioning) which is intuitive and efficient to use, (c) using object-relational DBMS features such as inheritance, (d) efficient loading through extensive use of bulk loading and caching, and (e) efficient triple query operations, especially in the important case when the subject and/or predicate is known. Extensive experiments with a PostgreSQL-based implementation show that 3XL performs very well for such operations and that the performance is comparable to state-of-the-art triple-stores.

ACM Transactions on Database Systems | 2017

Smart Meter Data Analytics: Systems, Algorithms, and Benchmarking

Xiufeng Liu; Lukasz Golab; Wojciech M. Golab; Ihab F. Ilyas; Shichao Jin

Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.

Knowledge and Information Systems | 2017

CITIESData: a smart city data management framework

Xiufeng Liu; Alfred Heller; Per Sieverts Nielsen

Smart city data come from heterogeneous sources including various types of the Internet of Things such as traffic, weather, pollution, noise, and portable devices. They are characterized with diverse quality issues and with different types of sensitive information. This makes data processing and publishing challenging. In this paper, we propose a framework to streamline smart city data management, including data collection, cleansing, anonymization, and publishing. The paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories. The paper evaluates the framework using a real-world smart city data set, and the results verify its effectiveness and efficiency. The framework can be a generic solution to manage smart city data.

Explore More