Hong Hai Do | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hong Hai Do is active.

Explore More

Publication

Featured researches published by Hong Hai Do.

very large data bases | 2002

COMA: a system for flexible combination of schema matching approaches

Hong Hai Do; Erhard Rahm

Schema matching is the task of finding semantic correspondences between elements of two schemas. It is needed in many database applications, such as integration of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches combining several match techniques are required. While such match approaches have found considerable interest recently, the problem of how to best combine different match algorithms still requires further work. We have thus developed the COMA schema matching system as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies.

international conference on management of data | 2005

Schema and ontology matching with COMA

David Aumueller; Hong Hai Do; Sabine Massmann; Erhard Rahm

We demonstrate the schema and ontology matching tool COMA++. It extends our previous prototype COMA utilizing a composite approach to combine different match algorithms [3]. COMA++ implements significant improvements and offers a comprehensive infrastructure to solve large real-world match problems. It comes with a graphical interface enabling a variety of user interactions. Using a generic data representation, COMA++ uniformly supports schemas and ontologies, e.g. the powerful standard languages W3C XML Schema and OWL. COMA++ includes new approaches for ontology matching, in particular the utilization of shared taxonomies. Furthermore, different match strategies can be applied including various forms of reusing previously determined match results and a so-called fragment-based match approach which decomposes a large match problem into smaller problems. Finally, COMA++ cannot only be used to solve match problems but also to comparatively evaluate the effectiveness of different match algorithms and strategies.

Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems | 2002

Comparison of Schema Matching Evaluations

Hong Hai Do; Sergey Melnik; Erhard Rahm

Recently, schema matching has found considerable interest in both research and practice. Determining matching components of database or XML schemas is needed in many applications, e.g. for E-business and data integration. Various schema matching systems have been developed to solve the problem semi-automatically. While there have been some evaluations, the overall effectiveness of currently available automatic schema matching systems is largely unclear. This is because the evaluations were conducted in diverse ways making it difficult to assess the effectiveness of each single system, let alone to compare their effectiveness. In this paper we survey recently published schema matching evaluations. For this purpose, we introduce the major criteria that influence the effectiveness of a schema matching approach and use these criteria to compare the various systems. Based on our observations, we discuss the requirements for future match implementations and evaluations.

Information Systems | 2007

Matching large schemas: Approaches and evaluation

Hong Hai Do; Erhard Rahm

Current schema matching approaches still have to improve for large and complex Schemas. The large search space increases the likelihood for false matches as well as execution times. Further difficulties for Schema matching are posed by the high expressive power and versatility of modern schema languages, in particular user-defined types and classes, component reuse capabilities, and support for distributed schemas and namespaces. To better assist the user in matching complex schemas, we have developed a new generic schema matching tool, COMA++, providing a library of individual matchers and a flexible infrastructure to combine the matchers and refine their results. Different match strategies can be applied including a new scalable approach to identify context-dependent correspondences between schemas with shared elements and a fragment-based match approach which decomposes a large match task into smaller tasks. We conducted a comprehensive evaluation of the match strategies using large e-Business standard schemas. Besides providing helpful insights for future match implementations, the evaluation demonstrated the practicability of our system for matching large schemas.

BMC Bioinformatics | 2007

FUNC: A package for detecting significant associations between gene sets and ontological annotations

Kay Prüfer; Bjoern Muetzel; Hong Hai Do; Gunter Weiss; Philipp Khaitovich; Erhard Rahm; Svante Pääbo; Michael Lachmann; Wolfgang Enard

BackgroundGenome-wide expression, sequence and association studies typically yield large sets of gene candidates, which must then be further analysed and interpreted. Information about these genes is increasingly being captured and organized in ontologies, such as the Gene Ontology. Relationships between the gene sets identified by experimental methods and biological knowledge can be made explicit and used in the interpretation of results. However, it is often difficult to assess the statistical significance of such analyses since many inter-dependent categories are tested simultaneously.ResultsWe developed the program package FUNC that includes and expands on currently available methods to identify significant associations between gene sets and ontological annotations. Implemented are several tests in particular well suited for genome wide sequence comparisons, estimates of the family-wise error rate, the false discovery rate, a sensitive estimator of the global significance of the results and an algorithm to reduce the complexity of the results.ConclusionFUNC is a versatile and useful tool for the analysis of genome-wide data. It is freely available under the GPL license and also accessible via a web service.

conference on information and knowledge management | 2007

Quickmig: automatic schema matching for data migration projects

Christian Drumm; Matthias Schmitt; Hong Hai Do; Erhard Rahm

A common task in many database applications is the migration of legacy data from multiple sources into a new one. This requires identifying semantically related elements of the source and target systems and the creation of mapping expressions to transform instances of those elements from the source format to the target format. Currently, data migration is typically done manually, a tedious and timeconsuming process, which is difficult to scale to a high number of data sources. In this paper, we describe QuickMig, a new semi-automatic approach to determining semantic correspondences between schema elements for data migration applications. QuickMig advances the state of the art with a set of new techniques exploiting sample instances, domain ontologies, and reuse of existing mappings to detect not only element correspondences but also their mapping expressions. QuickMig further includes new mechanisms to effectively incorporate domain knowledge of users into the matching process. The results from a comprehensive evaluation using real-world schemas and data indicate the high quality and practicability of the overall approach.

extending database technology | 2004

Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach

Hong Hai Do; Erhard Rahm

Molecular-biological annotation data is continuously being collected, curated and made accessible in numerous public data sources. Integration of this data is a major challenge in bioinformatics. We present the GenMapper system that physically integrates heterogeneous annotation data in a flexible way and supports large-scale analysis on the integrated data. It uses a generic data model to uniformly represent different kinds of annotations originating from different data sources. Existing associations between objects, which represent valuable biological knowledge, are explicitly utilized to drive data integration and combine annotation knowledge from different sources. To serve specific analysis needs, powerful operators are provided to derive tailored annotation views from the generic data representation. GenMapper is operational and has been successfully used for large-scale functional profiling of genes. Interactive access is provided under http://www.izbi.de.

data integration in the life sciences | 2005

Hybrid integration of molecular-biological annotation data

Toralf Kirsten; Hong Hai Do; Christine Körner; Erhard Rahm

We present a new approach to integrate annotation data from public sources for the expression analysis of genes and proteins. Expression data is materialized in a data warehouse supporting high performance for data-intensive analysis tasks. On the other hand, annotation data is integrated virtually according to analysis needs. Our virtual integration utilizes the commercial product SRS (Sequence Retrieval System) of LION bioscience. To couple the data warehouse and SRS, we implemented a query mediator exploiting correspondences between molecular-biological objects explicitly captured from public data sources. This hybrid integration approach has been implemented for a large gene expression warehouse and supports functional analysis using annotation data from GeneOntology, Locuslink and Ensembl. The paper motivates the chosen approach, details the integration concept and implementation, and provides results of preliminary performance tests.

Archive | 2000

Evaluierung von Data Warehouse-Werkzeugen

Hong Hai Do; Thomas Stöhr; Erhard Rahm; Robert Müller; Gernot Dern

Die wachsende Bedeutung von Data Warehouse-Losungen zur Entscheidungsunterstutzung in grosen Unternehmen hat zu einer unuberschaubaren Vielfalt von Software-Produkten gefuhrt. Aktuelle Data Warehouse-Projekte zeigen, das der Erfolg auch von der Wahl der passenden Werkzeuge fur diese komplexe und kostenintensive Umgebung abhangt. Wir prasentieren eine Methode zur Evaluierung von Data Warehouse Tools, die eine Kombination aus Bewertung per Kriterienkatalog und detaillierten praktischen Tests umfast. Die Vorgehensweise ist im Rahmen von Projekten mit Industriepartnern erprobt und wird am Beispiel einer Evaluierung fuhrender ETL-Werkzeuge demonstriert.

IEEE Data(base) Engineering Bulletin | 2000