Is this you? Create Your Porfile

Hongjun Lu

National University of Singapore

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongjun Lu is active.

Explore More

Publication

Featured researches published by Hongjun Lu.

IEEE Transactions on Knowledge and Data Engineering | 1996

Effective data mining using neural networks

Hongjun Lu; Rudy Setiono; Huan Liu

Classification is one of the data mining problems receiving great attention recently in the database community. The paper presents an approach to discover symbolic classification rules using neural networks. Neural networks have not been thought suited for data mining because how the classifications were made is not explicitly stated as symbolic rules that are suitable for verification or interpretation by humans. With the proposed approach, concise symbolic rules with high accuracy can be extracted from a neural network. The network is first trained to achieve the required accuracy rate. Redundant connections of the network are then removed by a network pruning algorithm. The activation values of the hidden units in the network are analyzed, and classification rules are generated using the result of this analysis. The effectiveness of the proposed approach is clearly demonstrated by the experimental results on a set of standard data mining test problems.

IEEE Transactions on Knowledge and Data Engineering | 2006

Text classification without negative examples revisit

Gabriel Pui Cheong Fung; Jeffrey Xu Yu; Hongjun Lu; Philip S. Yu

Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem of building a text classifier using positive examples (P) and unlabeled examples (U). The unlabeled examples are mixed with both positive and negative examples. Since no negative example is given explicitly, the task of building a reliable text classifier becomes far more challenging. Simply treating all of the unlabeled examples as negative examples and building a classifier thereafter is undoubtedly a poor approach to tackling this problem. Generally speaking, most of the studies solved this problem by a two-step heuristic: first, extract negative examples (N) from U. Second, build a classifier based on P and N. Surprisingly, most studies did not try to extract positive examples from U. Intuitively, enlarging P by P (positive examples extracted from U) and building a classifier thereafter should enhance the effectiveness of the classifier. Throughout our study, we find that extracting P is very difficult. A document in U that possesses the features exhibited in P does not necessarily mean that it is a positive example, and vice versa. The very large size of and very high diversity in U also contribute to the difficulties of extracting P. In this paper, we propose a labeling heuristic called PNLH to tackle this problem. PNLH aims at extracting high quality positive examples and negative examples from U and can be used on top of any existing classifiers. Extensive experiments based on several benchmarks are conducted. The results indicated that PNLH is highly feasible, especially in the situation where |P| is extremely small.

Archive | 2004

Conceptual Modeling – ER 2004

Paolo Atzeni; Wesley W. Chu; Hongjun Lu; Shuigeng Zhou; Tok Wang Ling

The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevanceranked lists as query results. This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search. 1 The Challenge of “Semantic” Information Search The age of information explosion poses tremendous challenges regarding the intelligent organization of data and the effective search of relevant information in business and industry (e.g., market analyses, logistic chains), society (e.g., health care), and virtually all sciences that are more and more data-driven (e.g., gene expression data analyses and other areas of bioinformatics). The problems arise in intranets of large organizations, in federations of digital libraries and other information sources, and in the most humongous and amorphous of all data collections, the World Wide Web and its underlying numerous databases that reside behind portal pages. The Web bears the potential of being the world’s largest encyclopedia and knowledge base, but we are very far from being able to exploit this potential. Database-system and search-engine technologies provide support for organizing and querying information; but all too often they require excessive manual preprocessing, such as designing a schema and cleaning raw data or manually classifying documents into a taxonomy for a good Web portal, or manual postprocessing such as browsing through large result lists with too many irrelevant items or surfing in the vicinity of promising but not truly satisfactory approximate matches. The following are a few example queries where current Web and intranet search engines fall short or where data P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 3–17, 2004. c

database and expert systems applications | 1999

Cleansing Data for Mining and Warehousing

Mong Li Lee; Tok Wang Ling; Hongjun Lu; Yee Teng Ko

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the garbage in, garbage out principle. Dirty data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several effcient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

Archive | 2004

Advanced Web Technologies and Applications

Jeffrey Xu Yu; Xuemin Lin; Hongjun Lu; Yanchun Zhang

Spatial data has now been used extensively in the Web environment, providing online customized maps and supporting map-based applications. The full potential of Web-based spatial applications, however, has yet to be achieved due to performance issues related to the large sizes and high complexity of spatial data. In this paper, we introduce a multiresolution approach to spatial data management and query processing such that the database server can choose spatial data at the right resolution level for different Web applications. One highly desirable property of the proposed approach is that the server-side processing cost and network traffic can be reduced when the level of resolution required by applications are low. Another advantage is that our approach pushes complex multiresolution structures and algorithms into the spatial database engine. That is, the developer of spatial Web applications needs not to be concerned with such complexity. This paper explains the basic idea, technical feasibility and applications of multiresolution spatial databases.This paper provides an overview of a query indexing method, called VCR indexing, for monitoring continual range queries. A VCRbased query index enables fast matching of events against a large number of range predicates. We first describe VCR indexing for general event matching against a set of 2D range predicates. We then show how VCR indexing can be used for efficient processing of continual range queries over moving objects. VCR stands for virtual construct rectangle. A set of VCRs are predefined, each with a unique ID. Each region defined by the range predicate is decomposed into or covered by one or more activated VCRs. The predicate ID is then stored in the ID lists associated with these activated VCRs. The use of VCRs provides an indirect and costeffective way of pre-computing the search result for any given event or object position. Event matching becomes very efficient.

International Conference on Applications of Databases | 1994

Efficient Image Retrieval By Color Contents

Hongjun Lu; Beng Chin Ooi; Kian-Lee Tan

Images are becoming an important asset and managing them for efficient retrieval poses challenges to the database community. In this paper, we proposed a novel three-tier color index that supports efficient image retrieval by color contents which is important to certain applications, especially when shapes and semantic objects cannot be easily recognized. A prototype painting database system is designed and implemented to demonstrate the effectiveness of the proposed indexing technique. Besides the color index, two other indexes, B+-trees for structured attributes and a signature file for free-text descriptions, were also implemented. As a result, a wide range of queries, both text-based and content-based can be processed efficiently. We also look at existing image database systems based on their query retrieval capabilities.

IEEE Transactions on Knowledge and Data Engineering | 2005

Constructing suffix tree for gigabyte sequences with megabyte memory

Ching-Fung Cheung; Jeffrey Xu Yu; Hongjun Lu

Mammalian genomes are typically 3 Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [S. Kurtz (1999)]. We show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(nlogn) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16 MB main memory only to construct more than 200 Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.

knowledge discovery and data mining | 1998

Identifying Relevant Databases for Multidatabase Mining

Huan Liu; Hongjun Lu; Jun Yao

Various tools and systems for knowledge discovery and data mining are developed and available for applications. However, when we are immersed in heaps of databases, an immediate question facing practitioners is where we should start mining. In this paper, breaking away from the conventional data mining assumption that many databases be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most likely relevant to an application; without doing so, the mining process can be lengthy, aimless and ineffective. A relevance measure is thus proposed to identify relevant databases for mining tasks with an objective to find patterns or regularities about certain attributes. An efficient implementation for identifying relevant databases is described. Experiments are conducted to validate the measures performance and to show its promising applications.

extending database technology | 1992

Dynamic and Load-balanced Task-Oriented Datbase Query Processing in Parallel Systems

Hongjun Lu; Kian-Lee Tan

Most parallel database query processing methods proposed so far adopt the task-oriented approach: decomposing a query into tasks, allocating tasks to processors, and executing the tasks in parallel. However, this strategy may not be effective when some processors are overloaded with time-consuming tasks caused by some unpredictable factors such as data skew. In this paper, we propose a dynamic and load-balanced task-oriented database query processing approach that minimizes the completion time of user queries. It consists of three phases: task generation, task acquisition and execution and task stealing. Using this approach, a database query is decomposed into a set of tasks. At run-time, these tasks are allocated dynamically to available processors. When a processor completes its assigned tasks and no more new tasks are available, it steals subtasks from other overloaded processors to share their load. A performance study was conducted to demonstrate the feasibility and effectiveness of this approach using join query as an example. The techniques that can be used to select task donors from overloaded processors and to determine the amount of work to be transferred are discussed. The factors that may affect the effectiveness, such as the number of tasks to be decomposed to, is also investigated.

international conference on management of data | 1992

H-trees: a dynamic associative search index for OODB

Chee Chin Low; Beng Chin Ooi; Hongjun Lu

The support of the superclass-subclass concept in object-oriented databases (OODB) makes an instance of a subclass also an instance of its superclass. As a result, the access scope of a query against a class in general includes the access scope of all its subclasses, unless specified otherwise. To support the superclass-subclass relationship efficiently, the index must achieve two objectives. First, the index must support efficient retrieval of instances from a single class. Second, it must also support efficient retrieval of instances from classes in a hierarchy of classes. In this paper, we propose a new index called the H-tree that supports efficient retrieval of instances of a single class as well as retrieval of instances of a class and its subclasses. The unique feature of H-trees is that they capture the superclass-subclass relationships. A performance analysis is conducted and both experimental and analytical results indicate that the H-tree is an efficient indexing structure for OODB.

Explore More