Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Chenggang Mi is active.

Publication


Featured researches published by Chenggang Mi.


NLPCC | 2014

Detection of Loan Words in Uyghur Texts

Chenggang Mi; Yating Yang; Lei Wang; Xiao Li; Kamali Dalielihan

For low-resource languages like Uyghur, data sparseness is always a serious problem in related information processing, especially in some tasks based on parallel texts. To enrich bilingual resources, we detect Chinese and Russian loan words from Uyghur texts according to phonetic similarities between a loan word and its corresponding donor language word. In this paper, we propose a novel approach based on perceptron model to discover loan words from Uyghur texts, which consider the detection of loan words in Uyghur as a classification procedure. The experimental results show that our method is capable of detecting the Chinese and Russian loan words in Uyghur Texts effectively.


Journal of Computers | 2014

A Phrase Table Filtering Model Based on Binary Classification for Uyghur-Chinese Machine Translation

Chenggang Mi; Yating Yang; Xi Zhou; Lei Wang; Xiao Li; Eziz Tursun

In statistical machine translation, large amount of unreasonable phrase pairs in a phrase table can affect the decoding efficiency and the overall translation performance, especially in Uyghur-Chinese machine translation. In this paper, we present a novel phrase table filtering model based on binary classification, which consider differences between Uyghur and Chinese, and draw lessons from binary classification in machine learning. In our model, four features are considered: 1) Difference in length between source and target phrase; 2) Proportion of translated words in phrase pairs; 3) Proportion of symbol words; 4) Average number of co-occurrence words in training corpus. We use this model to generate a filtered phrase table. Experimental results show that this new filtering model can improve the performance and efficiency of our current Uygur-Chinese machine translation system.


China Workshop on Machine Translation | 2014

Character Tagging-Based Word Segmentation for Uyghur

Yating Yang; Chenggang Mi; Bo Ma; Rui Dong; Lei Wang; Xiao Li

For effectively obtain information in Uyghur words, we present a novel method based on character tagging for Uyghur word segmentation. In this paper, we suggest five labels for characters in a Uyghur word, include: Su, Bu, Iu, Eu and Au, according to our method, we segment Uyghur words as a sequence labeling procedure, which use Conditional Random Fields (CRFs) as the basic labeling model. Experimental show that our method collect more features in Uyghur words, therefore outperform several traditional used word segmentation models significantly.


recent advances in natural language processing | 2017

Log-linear Models for Uyghur Segmentation in Spoken Language Translation

Chenggang Mi; Yating Yang; Rui Dong; Xi Zhou; Lei Wang; Xiao Li; Tonghai Jiang

To alleviate data sparsity in spoken Uyghur machine translation, we proposed a log-linear based morphological segmentation approach. Instead of learning model only from monolingual annotated corpus, this approach optimizes Uyghur segmentation for spoken translation based on both bilingual and monolingual corpus. Our approach relies on several features such as traditional conditional random field (CRF) feature, bilingual word alignment feature and monolingual suffixword co-occurrence feature. Experimental results shown that our proposed segmentation model for Uyghur spoken translation achieved 1.6 BLEU score improvements compared with the state-of-the-art baseline.


National CCF Conference on Natural Language Processing and Chinese Computing | 2017

Learning Bilingual Lexicon for Low-Resource Language Pairs

ShaoLin Zhu; Xiao Li; Yating Yang; Lei Wang; Chenggang Mi

Learning bilingual lexicon from monolingual data is a novel idea in natural language process which can benefit many low-resource language pairs. In this paper, we present an approach for obtaining bilingual lexicon from monolingual data. Our method only requires a small seed bilingual lexicon and we use the Canonical Correlation Analysis to construct a shared latent space to explain two monolingual embeddings how to be linked. Experimental results show that a considerable precision and size bilingual lexicon can be learned in Chinese-Uyghur and Chinese-Kazakh monolingual data.


China Workshop on Machine Translation | 2017

A Content-Based Neural Reordering Model for Statistical Machine Translation

Yirong Pan; Xiao Li; Yating Yang; Chenggang Mi; Rui Dong; Wenxiao Zeng

Phrase-based lexicalized reordering models have attracted extensive interest in statistical machine translation (SMT) due to their capacity for dealing with swap between consecutive phrases. However, translations between two languages that with significant differences in syntactic structure have made it challenging to generate a semantically and syntactically correct word sequence. In an effort to alleviate this problem, we propose a novel content-based neural reordering model that estimates reordering probabilities based on the words of its surrounding contexts. We first utilize a simple convolutional neural network (CNN) to capture semantic contents conditioned on various sizes of context. And then we employ a softmax layer to predict the reordering orientations and probability distributions. Experimental results show that our model provides statistically obvious improvements for both Chinese-Uyghur (+0.48 on CWMT2015) and Chinese-English (+0.27 on CWMT2013) translation tasks over conventional lexicalized reordering models.


CCL | 2017

Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding

ShaoLin Zhu; Xiao Li; Yating Yang; Lei Wang; Chenggang Mi

Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.


applications of natural language to data bases | 2015

Optimized Uyghur Segmentation for Statistical Machine Translation

Chenggang Mi; Yating Yang; Rui Dong; Xi Zhou; Lei Wang; Xiao Li; Tonghai Jiang; Turghun Osman

In this paper, we propose an optimized method to segment the Uyghur word. We consider the optimization as a classification problem; the features are extracted from Uyghur-Chinese bilingual corpus. Experimental results show that with our method the performance of Uyghur-Chinese machine translation improved significantly.


The Open Automation and Control Systems Journal | 2014

Co-occurrence Degree Based Word Alignment in Statistical Machine Translation

Chenggang Mi; Yating Yang; Lei Wang; Xiao Li

To alleviate the data sparseness problem during word alignment, we propose a word alignment method based on word co-occurrence degree. In this paper, we propose a new method to get the statistical information from word co- occurrence. We combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree. Fuzzy co-occurrence weights can be obtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current word and other words in fuzzy co-occurrence word pairs. Experiments show that the quality of word alignment and the translation performance both improved.


CCL | 2014

Co-occurrence Degree Based Word Alignment: A Case Study on Uyghur-Chinese

Chenggang Mi; Yating Yang; Xi Zhou; Xiao Li; Turghun Osman

Most widely used word alignment models are based on word co-occurrence counts in parallel corpus. However, the data sparseness during training of the word alignment model makes word co-occurrence counts of Uyghur-Chinese parallel corpus cannot indicate associations between source and target words effectively. In this paper, we propose a Uyghur-Chinese word alignment method based on word co-occurrence degree to alleviate the data sparseness problem. Our approach combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree, fuzzy co-occurrence weights can be obtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current Uyghur word and other Uyghur words in fuzzy co-occurrence word pairs. Experiment shows that with the co-occurrence degree based word alignment model, the performance of Uyghur-Chinese word alignment result is outperform the baseline word alignment model, the quality of Uyghur-Chinese machine translation also improved.

Collaboration


Dive into the Chenggang Mi's collaboration.

Top Co-Authors

Avatar

Yating Yang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Lei Wang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xiao Li

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xi Zhou

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Tonghai Jiang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Rui Dong

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

ShaoLin Zhu

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Turghun Osman

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Bo Ma

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Eziz Tursun

Chinese Academy of Sciences

View shared research outputs
Researchain Logo
Decentralizing Knowledge