Yiu-Kai Ng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yiu-Kai Ng is active.

Explore More

Publication

Featured researches published by Yiu-Kai Ng.

data and knowledge engineering | 1999

Conceptual-model-based data extraction from multiple-record Web pages

David W. Embley; Douglas M. Campbell; Y. S. Jiang; Stephen W. Liddle; Deryle Lonsdale; Yiu-Kai Ng; Randy Smith

Abstract Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a documents content. For these kinds of data-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format.

international conference on management of data | 1999

Record-boundary discovery in Web documents

David W. Embley; Y. S. Jiang; Yiu-Kai Ng

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

international conference on conceptual modeling | 1998

A Conceptual-Modeling Approach to Extracting Data from the Web

David W. Embley; Douglas M. Campbell; Y. S. Jiang; Stephen W. Liddle; Yiu-Kai Ng; Dallan Quass; Randy Smith

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth.

ACM Transactions on Database Systems | 1996

A normal form for precisely characterizing redundancy in nested relations

Wai Yin Mok; Yiu-Kai Ng; David W. Embley

We give a straightforward definition for redundancy in individual nested relations and define a new normal form that precisely characterizes redundancy for nested relations. We base our definition of redundancy on an arbitrary set of functional and multivalued dependencies, and show that our definition of nested normal form generalizes standard relational normalization theory. In addition, we give a condition that can prevent an unwanted structural anomaly in nested relations, namely, embedded nested relations with at most one tuple. Like other normal forms, our nested normal form can serve as a guide for database design.

conference on information and knowledge management | 1999

An automated approach for retrieving hierarchical data from HTML tables

Seung Jin Lim; Yiu-Kai Ng

Among the HTML elements, HTML tables [RHJ98] encapsulate hierarchically structured data (hierarchical data in short) in a tabular structure. HTML tables do not come with a rigid schema and almost any forms of two-dimensional tables are acceptable according to the HTML grammar. This relaxation complicates the process of retrieving hierarchical data from HTML tables. In this paper, we propose an automated approach for retrieving hierarchical data from HTML tables. The proposed approach constructs the content tree of an HTML table, which captures the intended hierarchy of the data content of the table, without requiring the internal structure of the table to be known beforehand. Also, the user of the content tree does not deal with HTML tags while retrieving the desired data from the content tree. Our approach can be employed by (i) a query language written for retrieving hierarchically structured data, extracted from either the contents of HTML tables or other sources, (ii) a processor for converting HTML tables to XML documents, and (iii) a data warehousing repository for collecting hierarchical data from HTML tables and storing materialized views of the tables. The time complexity of the proposed retrieval approach is proportional to the number of HTML elements in an HTML table.

fuzzy systems and knowledge discovery | 2005

A sentence-based copy detection approach for web documents

Rajiv Yerra; Yiu-Kai Ng

Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but they also degrade the efficiency of Web information retrieval. In this paper, we present a sentence-based copy detection approach on Web documents, which determines the existence of overlapped portions of any two given Web documents and graphically displays the locations of (semantically the) same sentences detected in the documents. Two sentences are treated as either the same or different according to the degree of similarity of the sentences computed by using either the three least-frequent 4-gram approach or the fuzzy-set information retrieval (IR) approach. Experimental results show that the fuzzy-set IR approach outperforms the three least-frequent 4-gram approach in our copy detection approach, which handles wide range of documents in different subject areas and does not require static word lists.

granular computing | 2005

Detecting similar HTML documents using a fuzzy set information retrieval approach

Rajiv Yerra; Yiu-Kai Ng

Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we present a new approach for detecting similar Web documents, especially HTML documents. Our detection approach determines the odd ratio of any two documents, which makes use of the degrees of resemblance of the documents, and graphically displays the locations of similar (not necessary the same) sentences detected in the documents after (i) eliminating non-representative words in the sentences using the stopword-removal and stemming algorithms, (ii) computing the degree of similarity of sentences using a fuzzy set information retrieval approach, and (iii) matching the corresponding hierarchical content of the two documents using a simple tree matching algorithm. The proposed method for detecting similar documents handles wide range of Web pages of varying size and does not require static word lists and thus applicable to different Web (especially HTML) documents in different subject areas, such as sports, news, science, etc.

Information Systems | 1997

Vertical fragmentation and allocation in distributed deductive database systems

Seung Jin Lim; Yiu-Kai Ng

Abstract Although approaches for vertical fragmentation and data allocation have been proposed, algorithms for vertical fragmentation and allocation of data and rules in distributed deductive database systems (DDDBSs) are lacking. In this paper, we present different approaches for vertical fragmentation of relations that are referenced by rules and an allocation strategy for rules and fragments in a DDDBS. The potential advantages of the proposed fragmentation and allocation scheme include maximal locality of query evaluation and minimization of communication cost in a distributed system, in addition to the desirable properties of (vertical) fragmentation and rule allocation as discussed in the literature. We also formulate the mathematical interpretation of the proposed vertical fragmentation and allocation algorithms.

conference on information and knowledge management | 2007

Using word similarity to eradicate junk emails

Maria Soledad Pera; Yiu-Kai Ng

Emails are one of the most commonly used modern communication media these days; however, unsolicited emails obstruct this otherwise fast and convenient technology for information exchange and jeopardize the continuity of this popular communication tool. Waste of valuable resources and time and exposure to offensive content are only a few of the problems that arise as a result of junk emails. In addition, the monetary cost of processing junk emails reaches billions of dollars per year and is absorbed by public users and Internet service providers. Even though there has been extensive work in the past dedicated to eradicate junk emails, none of the existing junk email detection approaches has been highly successful in solving these problems, since spammers have been able to infiltrate existing detection techniques. In this paper, we present a new tool, JunEX, which relies on the content similarity of emails to eradicate junk emails. JunEX compares each incoming email to a core of emails marked as junk by each individual user to identify unwanted emails while reducing the number of legitimate emails treated as junk, which is critical. Conducted experiments on JunEX verify its high accuracy.

intelligent information systems | 2014

Exploiting the wisdom of social connections to make personalized recommendations on scholarly articles

Maria Soledad Pera; Yiu-Kai Ng

Existing scholarly publication recommenders were designed to aid researchers, as well as ordinary users, in discovering pertinent literature in diverse academic fields. These recommenders, however, often (i) depend on the availability of users’ historical data in the form of ratings or access patterns, (ii) generate recommendations pertaining to users’ (articles included in their) profiles, as oppose to their current research interests, or (iii) fail to analyze valuable user-generated data at social sites that can enhance their performance. To address these design issues, we propose PReSA, a personalized recommender on scholarly articles. PReSA recommends articles bookmarked by the connections of a user U on a social bookmarking site that are not only similar in content to a target publication P currently of interest to U but are also popular among U’s connections. PReSA (i) relies on the content-similarity measure to identify potential academic publications to be recommended and (ii) uses only information readily available on popular social bookmarking sites to make recommendations. Empirical studies conducted using data from CiteULike have verified the efficiency and effectiveness of (the recommendation and ranking strategies of) PReSA, which outperforms a number of existing (scholarly publication) recommenders.

Explore More