Peter Christen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peter Christen is active.

Explore More

Publication

Featured researches published by Peter Christen.

IEEE Communications Surveys and Tutorials | 2014

Context Aware Computing for The Internet of Things: A Survey

Charith Perera; Arkady B. Zaslavsky; Peter Christen; Dimitrios Georgakopoulos

As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.

IEEE Transactions on Knowledge and Data Engineering | 2012

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Peter Christen

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of todays databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.

international conference on data mining | 2006

A Comparison of Personal Name Matching: Techniques and Practical Issues

Peter Christen

Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, search engines, to information extraction, deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. In this paper, we discuss the characteristics of personal names and present potential sources of variations and errors. We then overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons using four large name data sets indicate that there is no clear best matching technique

knowledge discovery and data mining | 2008

Automatic record linkage using seeded nearest neighbour and support vector machine classification

Peter Christen

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and time-consuming process. The author has previously presented a novel two-step approach to automatic record pair classification [6, 7]. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest-neighbour classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches.

knowledge discovery and data mining | 2004

Febrl – A Parallel Open Source Data Linkage System

Peter Christen; Tim Churches; Markus Hegland

In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

IEEE Sensors Journal | 2014

Sensor Search Techniques for Sensing as a Service Architecture for the Internet of Things

Charith Perera; Arkady B. Zaslavsky; Chi Harold Liu; Michael Compton; Peter Christen; Dimitrios Georgakopoulos

The Internet of Things (IoT) is part of the Internet of the future and will comprise billions of intelligent communicating “things” or Internet Connected Objects (ICOs) that will have sensing, actuating, and data processing capabilities. Each ICO will have one or more embedded sensors that will capture potentially enormous amounts of data. The sensors and related data streams can be clustered physically or virtually, which raises the challenge of searching and selecting the right sensors for a query in an efficient and effective way. This paper proposes a context-aware sensor search, selection, and ranking model, called CASSARAM, to address the challenge of efficiently selecting a subset of relevant sensors out of a large set of sensors with similar functionality and capabilities. CASSARAM considers user preferences and a broad range of sensor characteristics such as reliability, accuracy, location, battery life, and many more. This paper highlights the importance of sensor search, selection and ranking for the IoT, identifies important characteristics of both sensors and data capture processes, and discusses how semantic and quantitative reasoning can be combined together. This paper also addresses challenges such as efficient distributed sensor search and relational-expression based filtering. CASSARAM testing and performance evaluation results are presented and discussed.

mobile data management | 2013

Context-Aware Sensor Search, Selection and Ranking Model for Internet of Things Middleware

Charith Perera; Arkady B. Zaslavsky; Peter Christen; Michael Compton; Dimitrios Georgakopoulos

As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a substantial acceleration of the growth rate in the future. It is also evident that the increasing number of IoT middleware solutions are developed in both research and commercial environments. However, sensor search and selection remain a critical requirement and a challenge. In this paper, we present CASSARAM, a context-aware sensor search, selection, and ranking model for Internet of Things to address the research challenges of selecting sensors when large numbers of sensors with overlapping and sometimes redundant functionality are available. CASSARAM proposes the search and selection of sensors based on user priorities. CASSARAM considers a broad range of characteristics of sensors for search such as reliability, accuracy, battery life just to name a few. Our approach utilises both semantic querying and quantitative reasoning techniques. User priority based weighted Euclidean distance comparison in multidimensional space technique is used to index and rank sensors. Our objectives are to highlight the importance of sensor search in IoT paradigm, identify important characteristics of both sensors and data acquisition processes which help to select sensors, understand how semantic and statistical reasoning can be combined together to address this problem in an efficient manner. We developed a tool called CASSARA to evaluate the proposed model in terms of resource consumption and response time.

knowledge discovery and data mining | 2009

Accurate Synthetic Generation of Realistic Personal Information

Peter Christen; Agus Pudjijono

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.

intelligent data engineering and automated learning | 2005

Probabilistic data generation for deduplication and data linkage

Peter Christen

In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and pre-processing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.

international world wide web conferences | 2012

New objective functions for social collaborative filtering

Joseph Noel; Scott Sanner; Khoi-Nguyen Tran; Peter Christen; Lexing Xie; Edwin V. Bonilla; Ehsan Abbasnejad; Nicolás Della Penna

This paper examines the problem of social collaborative filtering (CF) to recommend items of interest to users in a social network setting. Unlike standard CF algorithms using relatively simple user and item features, recommendation in social networks poses the more complex problem of learning user preferences from a rich and complex set of user profile and interaction information. Many existing social CF methods have extended traditional CF matrix factorization, but have overlooked important aspects germane to the social setting. We propose a unified framework for social CF matrix factorization by introducing novel objective functions for training. Our new objective functions have three key features that address main drawbacks of existing approaches: (a) we fully exploit feature-based user similarity, (b) we permit direct learning of user-to-user information diffusion, and (c) we leverage co-preference (dis)agreement between two users to learn restricted areas of common interest. We evaluate these new social CF objectives, comparing them to each other and to a variety of (social) CF baselines, and analyze user behavior on live user trials in a custom-developed Facebook App involving data collected over five months from over 100 App users and their 37,000+ friends.

Explore More