Nguyen Le Minh
Japan Advanced Institute of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nguyen Le Minh.
ACM Transactions on Asian Language Information Processing | 2013
Ngo Xuan Bach; Nguyen Le Minh; Tran Thi Oanh; Akira Shimazu
Analyzing logical structures of texts is important to understanding natural language, especially in the legal domain, where legal texts have their own specific characteristics. Recognizing logical structures in legal texts does not only help people in understanding legal documents, but also in supporting other tasks in legal text processing. In this article, we present a new task, learning logical structures of paragraphs in legal articles, which is studied in research on Legal Engineering. The goals of this task are recognizing logical parts of law sentences in a paragraph, and then grouping related logical parts into some logical structures of formulas, which describe logical relations between logical parts. We present a two-phase framework to learn logical structures of paragraphs in legal articles. In the first phase, we model the problem of recognizing logical parts in law sentences as a multi-layer sequence learning problem, and present a CRF-based model to recognize them. In the second phase, we propose a graph-based method to group logical parts into logical structures. We consider the problem of finding a subset of complete subgraphs in a weighted-edge complete graph, where each node corresponds to a logical part, and a complete subgraph corresponds to a logical structure. We also present an integer linear programming formulation for this optimization problem. Our models achieve 74.37% in recognizing logical parts, 80.08% in recognizing logical structures, and 58.36% in the whole task on the Japanese National Pension Law corpus. Our work provides promising results for further research on this interesting task.
Expert Systems With Applications | 2014
Ngo Xuan Bach; Nguyen Le Minh; Akira Shimazu
Previous work on paraphrase identification using sentence similarities has not exploited discourse structures, which have been shown as important information for paraphrase computation. In this paper, we propose a new method named EDU-based similarity, to compute the similarity between two sentences based on elementary discourse units. Unlike conventional methods, which directly compute similarities based on sentences, our method divides sentences into discourse units and employs them to compute similarities. We also show the relation between paraphrases and discourse units, which plays an important role in paraphrasing. We apply our method to the paraphrase identification task. Experimental results on the PAN corpus, a large corpus for detecting paraphrases, show the effectiveness of using discourse information for identifying paraphrases. We achieve 93.1% and 93.4% accuracy, respectively by using a single SVM classifier and by using a maximal voting model.
knowledge and systems engineering | 2012
Bui Thanh Hung; Nguyen Le Minh; Akira Shimazu
Translation quality is often disappointed when a phrase based machine translation system deals with long sentences. Because of syntactic structure discrepancy between two languages, the translation output will not preserve the same word order as the source. When a sentence is long, it should be partitioned into several clauses and the word reordering in the translation should be done within clauses, not between clauses. In this paper, a rule-based technique is proposed to split long Vietnamese sentences based on linguistic information. We use splitting boundaries for translating sentences with two type of constrains: wall and zone. This method is useful for preserving word order and improving translation quality. We describe experiments on translation from Vietnamese to English, showing an improvement BLEU and NIST score.
Procedia Computer Science | 2013
Ngo Xuan Bach; Kunihiko Hiraishi; Nguyen Le Minh; Akira Shimazu
Abstract Part-of-speech (POS) tagging is a fundamental task in natural language processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic role labeling, and semantic parsing. In this paper, we present a new method for Vietnamese POS tagging using dual decomposition. We show how dual decomposition can be used to integrate a word-based model and a syllable-based model to yield a more powerful model for tagging Vietnamese sentences. We also describe experiments on the Viet Treebank corpus, a large annotated corpus for Vietnamese POS tagging. Experimental results show that our model using dual decomposition outperforms both word-based and syllable-based models.
International Conference on NLP | 2012
Ngo Xuan Bach; Nguyen Le Minh; Akira Shimazu
This paper presents UDRST, an unlabeled discourse parsing system in the RST framework. UDRST consists of a segmentation model and a parsing model. The segmentation model exploits subtree features to rerank N-best outputs of a base segmenter, which uses syntactic and lexical features in a CRF framework. In the parsing model, we present two algorithms for building a discourse tree from a segmented text: an incremental algorithm and a dual decomposition algorithm. Our system achieves 77.3% in the unlabeled score on the standard test set of the RST Discourse Treebank corpus, which improves 5.0% compared to HILDA [6], a state-of-the-art discourse parsing system.
applications of natural language to data bases | 2013
Ngo Xuan Bach; Nguyen Le Minh; Akira Shimazu
We propose a new method to compute the similarity between two sentences based on elementary discourse units, EDU-based similarity. Unlike conventional methods, which directly compute similarities based on sentences, our method divides sentences into discourse units and uses them to compute similarities. We also show the relation between paraphrases and discourse units, which plays an important role in paraphrasing. We apply our method to the paraphrase identification task. By using only a single SVM classifier, we achieve 93.1% accuracy on the PAN corpus, a large corpus for detecting paraphrases.
knowledge, information, and creativity support systems | 2012
Bui Thanh Hung; Nguyen Le Minh; Akira Shimazu
Translating legal text is generally considered to be difficult because legal text has some characteristics that make it different from other daily-use documents and legal text is usually long and complicated. In order boost the legal text translation quality, splitting an input sentence becomes mandatory. In this paper, we propose a novel method based on the logical structure of legal text sentence for dividing and translating legal text. We use a statistical learning method-Conditional Random Fields (CRFs) with rich linguistic information to recognize the logical structure of legal text sentence. We adapt the logical structure of legal text sentence to divide the sentence. By doing so, translation quality improves. Our experiments show that our approach can achieve better result for both Japanese-English and English-Japanese legal text translation by BLEU, NIST and TER score.
international conference on asian language processing | 2011
Bui Thanh Hung; Nguyen Le Minh; Akira Shimazu
This paper presents an approach to select appropriate translation rules to improve phrase-reordering of tree-based statistical machine translation. We propose new features with rich linguistic and contextual information. We give a new algorithm to extract features, use maximum entropy to combine rich linguistic and contextual information and integrate these features into the tree-based SMT model (Moses-chart). We obtain substantial improvements in performance for tree-based translation from Vietnamese to English.
International Conference of the Pacific Association for Computational Linguistics | 2017
Dac-Viet Lai; Nguyen Truong Son; Nguyen Le Minh
We propose a combined model of enhanced Bidirectional Long Short Term Memory (Bi-LSTM) and well-known classifiers such as Conditional Random Field (CRF) and Support Vector Machine (SVM) for compressing sentence, in which LSTM network works as a feature extractor. The task is to classify words into two categories: to be retained or to be removed. Facing the lack of reliable feature generating techniques in many languages, we employ the obtainable word embedding as the exclusive feature. Our models are trained and evaluated on public English and Vietnamese data sets, showing their state-of-the-art performance.
knowledge and systems engineering | 2015
Vu Xuan Tung; Nguyen Le Minh; Duc Tam Hoang
Based on a framework for English, we developed a Vietnamese Question Answering System. The learning paradigm in the framework reduces the burden of providing supervision during semantic parsing. Whilst taking the advantages from this mechanism, we further create our own feature calculation which is suitable for Vietnamese. A method of dynamic learning for feature computation is also presented in this work.