IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume | 2019

Automatic Construction of Chinese Typo-Pairs Based on Web Corpus

 
 
 
 
 

Abstract


With the development of big data, the amount of text data is growing bigger and bigger in which errors are also more and more. The traditional human-correction cannot meet the actual demand. It is a trend for automatic text proofing by using computer data processing. Chinese text errors can be divided into two categories: non-word error and real-word error. One or more character in a Chinese word replaced by other character will result in the word does not belong to the Chinese dictionary, which we call non-word error . The word segmentation is firstly performed on the corpus in Chinese NLP, and non-word error will be divided into several disperse strings, which bring Chinese text proofreading several problems, because there are single-character words and multi-characters words in Chinese dictionary. In this paper, an approach is proposed to construct Chinese typo-pairs from Web corpus, which can be used in Chinese text automatic proofreading efficiently. Firstly, the method adds similar words into a candidate set using fuzzy matching algorithm, and then validates the similar words in the candidate set using statistical models, and finally constructs the typo-pairs.

Volume None
Pages None
DOI 10.1145/3358695.3360941
Language English
Journal IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume

Full Text