Jiying Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiying Wang is active.

Explore More

Publication

Featured researches published by Jiying Wang.

very large data bases | 2004

Instance-based schema matching for web databases by domain-specific query probing

Jiying Wang; Ji-Rong Wen; Frederick H. Lochovsky; Wei-Ying Ma

In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema-matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall.

ACM Transactions on Database Systems | 2009

ODE: Ontology-assisted data extraction

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.

extending database technology | 2006

Holistic schema matching for web query interfaces

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

One significant part of today’s Web is Web databases, which can dynamically provide information in response to user queries. To help users submit queries to different Web databases, the query interface matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we observe that attribute matchings can be discovered from attribute-occurrence patterns. For example, First Name often appears together with Last Name while it is rarely co-present with Author in the Books domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces. In particular, HSM can identify both simple matching i.e., 1:1 matching, and complex matching, i.e., 1:n or m:n matching, between attributes. Our experiments show that HSM can discover both simple and complex matchings accurately and efficiently on real data sets.

IEEE Transactions on Knowledge and Data Engineering | 2010

Record Matching over Query Results from Multiple Web Databases

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the ¿presumed¿ nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.

IEEE Transactions on Knowledge and Data Engineering | 2012

Combining Tag and Value Similarity for Data Extraction and Alignment

Weifeng Su; Jiying Wang; Frederick H. Lochovsky; Yi Liu

Web databases generate query result pages based on a users query. Automatically extracting the data from these query result pages is very important for many applications, such as data integration, which need to cooperate with multiple web databases. We present a novel data extraction and alignment method called CTVS that combines both tag and value similarity. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. We also design a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information. Experimental results show that CTVS achieves high precision and outperforms existing state-of-the-art data extraction methods.

international conference on data engineering | 2006

Holistic Query Interface Matching using Parallel Schema Matching

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Using query interfaces of different Web databases, we propose a new complex schema matching approach, Parallel Schema Matching (PSM). A parallel schema is formed by comparing two individual schemas and deleting common attributes. The attribute matching can be discovered from the attribute-occurrence patterns if many parallel schemas are available. A count-based greedy algorithm identifies which attributes are more likely to be matched. Experiments show that PSM can identify both simple matching and complex matching accurately and efficiently.

web information systems engineering | 2006

Automatic hierarchical classification of structured deep web databases

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

international world wide web conferences | 2003