Is this you? Create Your Porfile

Shuo Bai

Chinese Academy of Sciences

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuo Bai is active.

Explore More

Publication

Featured researches published by Shuo Bai.

Journal of Computer Science and Technology | 2002

Semantic computation in a Chinese question-answering system

Sujian Li; Jian Zhang; Xiong Huang; Shuo Bai; Qun Liu

This paper introduces a kind of semantic computation and presents how to combine it into our Chinese Question-Answering (QA) system. Based on two kinds of language resources,Hownet andCilin, we present an approach to computing the similarity and relevancy between words. Using these results, we can calculate the relevancy between two sentences and then get the optimal answer for the query in the system. The calculation adopts quantitative methods and can be incorporated into QA systems easily, avoiding some difficulties in conventional NLP (Natural Language Processing) problems. The experiments show that the results are satisfactory.

knowledge discovery and data mining | 2008

Detecting near-duplicates in large-scale short text databases

Caichun Gong; Yulan Huang; Xueqi Cheng; Shuo Bai

Near-duplicates are abundant in short text databases. Detecting and eliminating them is of great importance. SimFinder proposed in this paper is a fast algorithm to identify all near-duplicates in large-scale short text databases. An ad hoc term weighting scheme is employed to measure each terms discriminative ability. A certain number of terms with higher weights are seletect as features for each short text. SimFinder generates several fingerprints for each text, and only texts with at least one fingerprint in common are compared with each other. An optimization procedure is employed in SimFinder to make it more efficient. Experiments indicate that SimFinder is an effective solution for short text duplicate detection with almost linear time and storage complexity. Both precision and recall of SimFinder are promising.

Journal of Computer Science and Technology | 2005

Computation on sentence semantic distance for novelty detection

Hua-Ping Zhang; Jian Sun; Bing Wang; Shuo Bai

Novelty detection is to retrieve new information and filter redundancy from given sentences that are relevant to a specific topic. In TREC2003, the authors tried an approach to novelty detection with semantic distance computation. The motivation is to expand a sentence by introducing semantic information. Computation on semantic distance between sentences incorporates WordNet with statistical information. The novelty detection is treated as a binary classification problem: new sentence or not. The feature vector, used in the vector space model for classification, consists of various factors, including the semantic distance from the sentence to the topic and the distance from the sentence to the previous relevant context occurring before it. New sentences are then detected with Winnow and support vector machine classifiers, respectively. Several experiments are conducted to survey the relationship between different factors and performance. It is proved that semantic computation is promising in novelty detection. The ratio of new sentence size to relevant size is further studied given different relevant document sizes. It is found that the ratio reduced with a certain speed (about 0.86). Then another group of experiments is performed supervised with the ratio. It is demonstrated that the ratio is helpful to improve the novelty detection performance.

asia information retrieval symposium | 2008

Blog post and comment extraction using information quantity of web format

Donglin Cao; Xiangwen Liao; Hongbo Xu; Shuo Bai

With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.

international joint conference on natural language processing | 2004

A re-examination of IR techniques in QA system

Yi Chang; Hongbo Xu; Shuo Bai

The performance of Information Retrieval in the Question Answering system is not satisfactory from our experiences in TREC QA Track. In this article, we take a comparative study to re-examine IR techniques on document retrieval and sentence level retrieval respectively. Our study shows: 1) query reformulation should be a necessary step to achieve a better retrieval performance; 2) The techniques for document retrieval are also effective in sentence level retrieval, and single sentence will be the appropriate retrieval granularity.

international acm sigir conference on research and development in information retrieval | 2002

Example-based phrase translation in Chinese-English CLIR

Bin Wang; Xueqi Cheng; Shuo Bai

This paper proposes an example-based phrase translation method in a Chinese to English cross-language information retrieval (CLIR) system. The method can generate much more accurate query translations than dictionary-based and common MT-based methods, and then improves the retrieval performance of our CLIR system.

asia information retrieval symposium | 2009

A Clustering Framework Based on Adaptive Space Mapping and Rescaling

Yiling Zeng; Hongbo Xu; Jiafeng Guo; Yu Wang; Shuo Bai

Traditional clustering algorithms often suffer from model misfit problem when the distribution of real data does not fit the model assumptions. To address this problem, we propose a novel clustering framework based on adaptive space mapping and rescaling, referred as M-R framework. The basic idea of our approach is to adjust the data representation to make the data distribution fit the model assumptions better. Specifically, documents are first mapped into a low dimensional space with respect to the cluster centers so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained in hand, a rescaling operation is then applied to regularize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to constantly improve the clustering performance. In our work, we apply the M-R framework on the most widely used clustering algorithm, i.e. k-means, as an example. Experiments on well known datasets show that our M-R framework can obtain comparable performance with state-of-the-art methods.

systems man and cybernetics | 2001

Automatic extraction of lexical relations from Chinese machine readable dictionary

Sujian Li; Qun Liu; Shuo Bai; Xueqi Cheng

Lexical relations are very important for NLP. Most previous work to get them is done by hand. In this paper, we describe an automated strategy which exploits a machine readable dictionary (MRD) to construct a richly-structured network of lexical relations. In our system lexical relations include five basic semantic relations, two phonetic relations and one orthographic relation. These relations constitute the basic framework of our lexical network. Then we present an approach to use heuristic functions to extract semantic relations while we conduct syntactic parsing. Experimental results demonstrate that our method is effective.

acm symposium on applied computing | 2010

Introducing global scaling parameters into Ncut

Yiling Zeng; Hongbo Xu; Xueqi Cheng; Shuo Bai

Gaussian similarity is usually used in spectral clustering. It generates the affinity matrix by mainly considering point-to-point distances in a local region with respect to the scaling parameters Δ. As a result, global information is not considered. To address this problem, we design a mapping and rescaling framework (referred as M-R framework) to introduce global scaling parameters into spectral clustering. The M-R framework is applied on Normalized Cut to form the M-R Ncut algorithm which obtains remarkable performance improvements in our experimental evaluations.

web intelligence | 2008

A Novel Language Model Based on Cognition Attention Attenuation in Web Retrieval

Donglin Cao; Hongbo Xu; Shuo Bai; Xueqi Cheng; Shaozi Li

Language model is widely used in many retrieval systems. Its document representation is based on the bag of words assumption. Hence, each term in document is treated as an equal object and only the term frequency is considered as the evidence of the importance of term. In this paper, we study the problem of cognition attention attenuation in processing documents and present a cognition attention attenuation based language model. This model estimates the document model by attenuation process of term in document. Compared with the classical language model, the advantage of this model is considering about the document structure which is often used in text summarization. From the experiments results, our novel cognition attention attenuation based language model outperformed the classical language model with Dirichlet smoothing in blog page and Web page.

Explore More