Weifeng Su | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weifeng Su is active.

Explore More

Publication

Featured researches published by Weifeng Su.

ACM Transactions on Database Systems | 2009

ODE: Ontology-assisted data extraction

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.

extending database technology | 2006

Holistic schema matching for web query interfaces

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

One significant part of today’s Web is Web databases, which can dynamically provide information in response to user queries. To help users submit queries to different Web databases, the query interface matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we observe that attribute matchings can be discovered from attribute-occurrence patterns. For example, First Name often appears together with Last Name while it is rarely co-present with Author in the Books domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces. In particular, HSM can identify both simple matching i.e., 1:1 matching, and complex matching, i.e., 1:n or m:n matching, between attributes. Our experiments show that HSM can discover both simple and complex matchings accurately and efficiently on real data sets.

IEEE Transactions on Knowledge and Data Engineering | 2010

Record Matching over Query Results from Multiple Web Databases

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the ¿presumed¿ nonduplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.

IEEE Transactions on Knowledge and Data Engineering | 2012

Combining Tag and Value Similarity for Data Extraction and Alignment

Weifeng Su; Jiying Wang; Frederick H. Lochovsky; Yi Liu

Web databases generate query result pages based on a users query. Automatically extracting the data from these query result pages is very important for many applications, such as data integration, which need to cooperate with multiple web databases. We present a novel data extraction and alignment method called CTVS that combines both tag and value similarity. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. We also design a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information. Experimental results show that CTVS achieves high precision and outperforms existing state-of-the-art data extraction methods.

conference on information and knowledge management | 2006

Query result ranking over e-commerce web databases

Weifeng Su; Jiying Wang; Qiong Huang; Frederick H. Lochovsky

To deal with the problem of too many results returned from an E-commerce Web database in response to a user query, this paper proposes a novel approach to rank the query results. Based on the user query, we speculate how much the user cares about each attribute and assign a corresponding weight to it. Then, for each tuple in the query result, each attribute value is assigned a score according to its desirableness to the user. These attribute value scores are combined according to the attribute weights to get a final ranking score for each tuple. Tuples with the top ranking scores are presented to the user first. Our ranking method is domain independent and requires no user feedback. Experimental results demonstrate that this ranking method can effectively capture a users preferences.

meeting of the association for computational linguistics | 2004

A Kernel PCA Method for Superior Word Sense Disambiguation

Dekai Wu; Weifeng Su; Marine Carpuat

We introduce a new method for disambiguating word senses that exploits a nonlinear Kernel Principal Component Analysis (KPCA) technique to achieve accuracy superior to the best published individual models. We present empirical results demonstrating significantly better accuracy compared to the state-of-the-art achieved by either naive Bayes or maximum entropy models, on Senseval-2 data. We also contrast against another type of kernel method, the support vector machine (SVM) model, and show that our KPCA-based model outperforms the SVM-based model. It is hoped that these highly encouraging first results on KPCA for natural language processing tasks will inspire further development of these directions.

Enterprise Information Systems | 2013

Discovering latent commercial networks from online financial news articles

Yunqing Xia; Weifeng Su; Raymond Y. K. Lau; Yi Liu

Unlike most online social networks where explicit links among individual users are defined, the relations among commercial entities (e.g. firms) may not be explicitly declared in commercial Web sites. One main contribution of this article is the development of a novel computational model for the discovery of the latent relations among commercial entities from online financial news. More specifically, a CRF model which can exploit both structural and contextual features is applied to commercial entity recognition. In addition, a point-wise mutual information (PMI)-based unsupervised learning method is developed for commercial relation identification. To evaluate the effectiveness of the proposed computational methods, a prototype system called CoNet has been developed. Based on the financial news articles crawled from Google finance, the CoNet system achieves average F-scores of 0.681 and 0.754 in commercial entity recognition and commercial relation identification, respectively. Our experimental results confirm that the proposed shallow natural language processing methods are effective for the discovery of latent commercial networks from online financial news.

international conference on data engineering | 2006

Holistic Query Interface Matching using Parallel Schema Matching

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

Using query interfaces of different Web databases, we propose a new complex schema matching approach, Parallel Schema Matching (PSM). A parallel schema is formed by comparing two individual schemas and deleting common attributes. The attribute matching can be discovered from the attribute-occurrence patterns if many parallel schemas are available. A count-based greedy algorithm identifies which attributes are more likely to be matched. Experiments show that PSM can identify both simple matching and complex matching accurately and efficiently.

data and knowledge engineering | 2014

Editorial: Spatial-aware interest group queries in location-based social networks

Yafei Li; Dingming Wu; Jianliang Xu; Byron Choi; Weifeng Su

With the recent advances in positioning and smartphone technologies, a number of social networks such as Twitter, Foursquare and Facebook are acquiring the dimension of location, thus bridging the gap between the physical world and online social networking services. Most of the location-based social networks released check-in services that allow users to share their visiting locations with their friends. In this paper, users interests are modeled by check-in actions. We propose a new type of Spatial-aware Interest Group (SIG) query that retrieves a user group of size k where each user is interested in the query keywords and they are close to each other in the Euclidean space. We prove that the SIG query problem is NP-complete. A family of efficient algorithms based on the IR-tree is thus proposed for the processing of SIG queries. Experiments on two real datasets show that our proposed algorithms achieve orders of magnitude improvement over the baseline algorithm.

web information systems engineering | 2006

Automatic hierarchical classification of structured deep web databases

Weifeng Su; Jiying Wang; Frederick H. Lochovsky

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

Explore More