Yunbo Cao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yunbo Cao is active.

Explore More

Publication

Featured researches published by Yunbo Cao.

international conference on computational linguistics | 2008

Understanding and Summarizing Answers in Community-Based Question Answering Services

Yuanjie Liu; Shasha Li; Yunbo Cao; Chin-Yew Lin; Dingyi Han; Yong Yu

Community-based question answering (cQA) services have accumulated millions of questions and their answers over time. In the process of accumulation, cQA services assume that questions always have unique best answers. However, with an in-depth analysis of questions and answers on cQA services, we find that the assumption cannot be true. According to the analysis, at least 78% of the cQA best answers are reusable when similar questions are asked again, but no more than 48% of them are indeed the unique best answers. We conduct the analysis by proposing taxonomies for cQA questions and answers. To better reuse the cQA content, we also propose applying automatic summarization techniques to summarize answers. Our results show that question-type oriented summarization techniques can improve cQA answer quality significantly.

international world wide web conferences | 2005

Ranking definitions with supervised learning methods

Jun Xu; Yunbo Cao; Hang Li; Min Zhao

This paper is concerned with the problem of definition search. Specifically, given a term, we are to retrieve definitional excerpts of the term and rank the extracted excerpts according to their likelihood of being good definitions. This is in contrast to the traditional approaches of either generating a single combined definition or simply outputting all retrieved definitions. Definition ranking is essential for the task. Methods for performing definition ranking are proposed in this paper, which formalize the problem as either classification or ordinal regression. A specification for judging the goodness of a definition is given. We employ SVM as the classification model and Ranking SVM as the ordinal regression model respectively, such that they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined. An enterprise search system based on this method has been developed and has been put into practical use. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods of using heuristic rules or employing the conventional information retrieval method of Okapi. This is true both when the answers are paragraphs and when they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.

acm/ieee joint conference on digital libraries | 2005

Automatic extraction of titles from general documents using machine learning

Yunhua Hu; Hang Li; Yunbo Cao; Dmitriy Meyerzon; Qinghua Zheng

We propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in do document retrieval by using the extracted titles

conference on information and knowledge management | 2010

Automatic extraction of web data records containing user-generated content

Xinying Song; Jing Liu; Yunbo Cao; Chin-Yew Lin; Hsiao-Wuen Hon

In this paper, we are concerned with the problem of automatically extracting web data records that contain user-generated content (UGC). In previous work, web data records are usually assumed to be well-formed with a limited amount of UGC, and thus can be extracted by testing repetitive structure similarity. However, when a web data record includes a large portion of free-format UGC, the similarity test between records may fail, which in turn results in lower performance. In our work, we find that certain domain constraints (e.g., post-date) can be used to design better similarity measures capable of circumventing the influence of UGC. In addition, we also use anchor points provided by the domain constraints to improve the extraction process, which ends in an algorithm called MiBAT (Mining data records Based on Anchor Trees). We conduct extensive experiments on a dataset consisting of forum thread pages which are collected from 307 sites that cover 219 different forum software packages. Our approach achieves a precision of 98.9% and a recall of 97.3% with respect to post record extraction. On page level, it perfectly handles 91.7% of pages without extracting any wrong posts or missing any golden posts. We also apply our approach to comment extraction and achieve good results as well.

meeting of the association for computational linguistics | 2014

Collective Tweet Wikification based on Semi-supervised Graph Regularization

Hongzhao Huang; Yunbo Cao; Xiaojiang Huang; Heng Ji; Chin-Yew Lin

Wikification for tweets aims to automatically identify each concept mention in a tweet and link it to a concept referent in a knowledge base (e.g., Wikipedia). Due to the shortness of a tweet, a collective inference model incorporating global evidence from multiple mentions and concepts is more appropriate than a noncollecitve approach which links each mention at a time. In addition, it is challenging to generate sufficient high quality labeled data for supervised models with low cost. To tackle these challenges, we propose a novel semi-supervised graph regularization model to incorporate both local and global evidence from multiple tweets through three fine-grained relations. In order to identify semanticallyrelated mentions for collective inference, we detect meta path-based semantic relations through social networks. Compared to the state-of-the-art supervised model trained from 100% labeled data, our proposed approach achieves comparable performance with 31% labeled data and obtains 5% absolute F1 gain with 50% labeled data.

empirical methods in natural language processing | 2009

A Structural Support Vector Method for Extracting Contexts and Answers of Questions from Online Forums

Wen-Yun Yang; Yunbo Cao; Chin-Yew Lin

This paper addresses the issue of extracting contexts and answers of questions from post discussion of online forums. We propose a novel and unified model by customizing the structural Support Vector Machine method. Our customization has several attractive properties: (1) it gives a comprehensive graphical representation of thread discussion. (2) It designs special inference algorithms instead of general-purpose ones. (3) It can be readily extended to different task preferences by varying loss functions. Experimental results on a real data set show that our methods are both promising and flexible.

Information Processing and Management | 2011

A structural support vector method for extracting contexts and answers of questions from online forums

Yunbo Cao; Wen-Yun Yang; Chin-Yew Lin; Yong Yu

This article addresses the issue of extracting contexts and answers of questions from posts of online discussion forums. In previous work, general-purpose graphical models have been employed without any customization to this specific extraction problem. Instead, in this article, we propose a unified approach to context and answer extraction by customizing the structural support vector machine method. The customization enables our proposal to explore various relations among sentences of posts and complex structures of threads. We design new inference algorithms to find or approximate the most violated constraint by utilizing the specific structure of forum threads, which enables us to efficiently find the global optimum of the customized optimizing problem. We also optimize practical performance measures by varying loss functions. Experimental results show that our methods are both promising and flexible.

Journal of the Association for Information Science and Technology | 2011

Re-ranking question search results by clustering questions

Yunbo Cao; Huizhong Duan; Chin-Yew Lin; Yong Yu

In this article, we address the problem of question clustering and study its use for re-ranking question search results. In question clustering we have to organize question search results into certain meaningful and condensed groups. Specifically, we propose to use a data structure consisting of question topic and question focus for modeling questions, and then cluster questions on the basis of the data structure. Experimental results show that our approach to question clustering improves the performance of question search significantly better than the approach not utilizing the topic–focus structure.

Journal of Computer Science and Technology | 2006

A supervised learning approach to search of definitions

Jun Xu; Yunbo Cao; Hang Li; Min Zhao; Yalou Huang

This paper addresses the issue of search of definitions. Specifically, for a given term, we are to find out its definition candidates and rank the candidates according to their likelihood of being good definitions. This is in contrast to the traditional methods of either generating a single combined definition or outputting all retrieved definitions. Definition ranking is essential for tasks. A specification for judging the goodness of a definition is given. In the specification, a definition is categorized into one of the three levels: good definition, indifferent definition, or bad definition. Methods of performing definition ranking are also proposed in this paper, which formalize the problem as either classification or ordinal regression. We employ SVM (Support Vector Machines) as the classification model and Ranking SVM as the ordinal regression model respectively, and thus they rank definition candidates according to their likelihood of being good definitions. Features for constructing the SVM and Ranking SVM models are defined, which represent the characteristics of terms, definition candidate, and their relationship. Experimental results indicate that the use of SVM and Ranking SVM can significantly outperform the baseline methods such as heuristic rules, the conventional information retrieval—Okapi, or SVM regression. This is true when both the answers are paragraphs and they are sentences. Experimental results also show that SVM or Ranking SVM models trained in one domain can be adapted to another domain, indicating that generic models for definition ranking can be constructed.

international world wide web conferences | 2013

An error driven approach to query segmentation

Wei Zhang; Yunbo Cao; Chin-Yew Lin; Jian Su; Chew Lim Tan

Query segmentation is the task of splitting a query into a sequence of non-overlapping segments that completely cover all tokens in the query. The majority of query segmentation methods are unsupervised. In this paper, we propose an error-driven approach to query segmentation (EDQS) with the help of search logs, which enables unsupervised training with guidance from the system-specific errors. In EDQS, we first detect the systems errors by examining the consistency among the segmentations of similar queries. Then, a model is trained by the detected errors to select the correct segmentation of a new query from the top-n outputs of the system. Our evaluation results show that EDQS can significantly boost the performance of state-of-the-art query segmentation methods on a publicly available data set.

Explore More