[PDF] Binary Tree based Chinese Word Segmentation

Abstract

Chinese word segmentation is a fundamental task for Chinese language processing. The granularity mismatch problem is the main cause of the errors. This paper showed that the binary tree representation can store outputs with different granularity. A binary tree based framework is also designed to overcome the granularity mismatch problem. There are two steps in this framework, namely tree building and tree pruning. The tree pruning step is specially designed to focus on the granularity problem. Previous work for Chinese word segmentation such as the sequence tagging can be easily employed in this framework. This framework can also provide quantitative error analysis methods. The experiments showed that after using a more sophisticated tree pruning function for a state-of-the-art conditional random field based baseline, the error reduction can be up to 20%.

Full PDF

BBinary Tree based Chinese Word Segmentation ZHANG Kaixu, WANG Can, SUN Maosong State Key Lab of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology Tsinghua University, Beijing, 100084, China P.R.

Abstract : Chinese word segmentation is a fundamental task for Chinese language processing. The granularity mismatch problem is the main cause of the errors. This paper showed that the binary tree representation can store outputs with different granularity. A binary tree based framework is also designed to overcome the granularity mismatch problem. There are two steps in this framework, namely tree building and tree pruning. The tree pruning step is specially designed to focus on the granularity problem. Previous work for Chinese word segmentation such as the sequence tagging can be easily employed in this framework. This framework can also provide quantitative error analysis methods. The experiments showed that after using a more sophisticated tree pruning function for a state-of-the-art conditional random field based baseline, the error reduction can be up to 20%.

Key words : Natural language processing; Chinese word segmentation; binary tree; Support vector machine.

Introduction

Each Chinese word consists of one or more characters. But there are no delimiters between characters in the sentences to indicate words. Since words are the basic units for many natural language processing tasks, Chinese word segmentation (CWS) is considered as a fundamental task for Chinese language processing. Languages such as Japanese, Thai and Vietnamese have similar problems. The state-of-the-art methods treat the CWS as a character sequence tagging task like the POS-tagging task. A tag indicates the position of the corresponding character in the word. We point out that the state-of-the-art methods suffer from a problem called the granularity mismatch . In CWS, it means that the granularity of the output is hard to perfectly match the granularity of the gold standard (the correct result). Without such problem, the performance is claimed to be increased by Li and Sun [5] . For example, the string 老树 ( old tree ) in the MSR corpus in SIGHAN bake-off 2005 [1] is considered as two words, 老 ( old ) and 树 ( tree ), whereas the string 老人 ( old people ) in the same corpus is a single multi-character word consists of two morphemes, 老 ( old ) and 人 ( people ). Similar examples are quite common in any Chinese corpus, and cause most of the errors for CWS models. Received: 2011-04- he explanation of this language phenomenon is that the boundary between the Chinese morphology and the syntax is not clear. The Chinese morphology and syntax share rules and even units (many morphemes such as “ 老 ” can also be used as free words). Sometimes it is hard to determine that whether a structure is morphological or syntactical (However, the CWS model has to do this). Historically, a large number of multi-character words in modern Chinese are used to be phrases in ancient Chinese. In order to represent the structures of the unclear part between the morphology and syntax, and to overcome the granularity mismatch problem, we propose the binary tree representation, and a binary tree based CWS framework. In this framework, the CWS is divided into two steps, namely tree building and tree pruning. Fig. 2 shows the data flow in this framework. In Step 1, the raw Chinese sentence is parsed to a binary tree (see Fig. 4 for an example) based on a simple tree building function. After this step `` 老树 '' and `` 老人 '' will have similar binary trees: Fig. 1

In the tree building step, it is no need to determine that whether the structures are morphological or syntactical. Step 2 is designed to focus on the granularity problem. In Step 2, the tree is pruned based on a tree pruning function. The leaves of the pruned tree (see Fig. 6 for an example) form the output. This binary tree based framework is with several benefits. First, it is a simple framework that can employ many previous CWS methods such as the dictionary based methods, the association measure based methods, and the sequence tagging based methods. Though in the framework we build trees for sentences, the training data is not needed to have any extra manually annotations. We will describe the two steps of the framework in Section 2. Second, it provides quantitative error analysis methods which are described in Section 3. From such analysis, we see that the granularity mismatch problem is the primary cause of error for both “mono corpus” CWS and cross-corpus CWS . The tree pruning step in our framework provides us a way to focus on the granularity problem. More sophisticated method can be employed in this step. We illustrated this idea in Section 4 by using an SVM-based model for the tree pruning. The experiments in Section 5 show that the errors reduced up to 20% comparing to the state-of-the-art CRF-based baseline. The definitions of Chinese word are not consistent between different corpora. The performance will drop a lot if we do cross-corpus CWS (i.e., train the CWS model from one corpus but test it on another one). This is also a research issue for CWS.

Fig. 2

The data flow of the binary tree based CWS framework

Dictionary Matching algorithms such as the forward maximum matching algorithm and the backward maximum matching algorithm are greedy algorithms. The forward maximum matching algorithm finds the longest word in the dictionary that the input sentence starts with, and do this matching recursively for the rest of the sentence. The backward maximum matching algorithm does similar process just from the end of a sentence. These algorithms will fail if the sentence contains any out-of-vocabulary (OOV) words (Words that do not appear in the training data or in the used dictionary are called OOV words. Otherwise they are called IV words). Association measures such as the pairwise mutual information (PMI) and the $t$-test are used for CWS \cite{sun_chinese_1998}. These methods treat Chinese words as the “character collocations” and use collocation extraction methods to find them. Xue \shortcite{xue:2003_j} proposed a character sequence tagging framework which is like the POS-tagging task. In such framework, the input is a raw Chinese sentence 𝒔 , which can be seen as a sequence of characters 𝑐 𝑖 . 𝒔 = 𝑐 ⋯ 𝑐 𝑛 The output of the character sequence tagging is a sequence 𝒐 of labels 𝑡 𝑖 corresponding to the input characters. 𝒐 = 𝑡 ⋯ 𝑡 𝑛 where 𝑡 𝑖 ∈ *B, M, E, S+ . The tag B / M / E indicates the corresponding character is at the beginning / middle / end of a multi-character word. The tag S indicates the corresponding character is a single character word. For example, if the gold standard result for the input `` 材料利用率高 '' is “ 材料利用率高 ”, the corresponding correct tag sequence will be B E B M E S . And in MSR corpus “ 老树 ” is tagged as S S , and `` 老人 '' is tagged as $\textsf{B~E}$. This character sequence tagging framework can be implemented by a CRF\cite{peng_chinese_2004} model, a perceptron\cite{gao_chinese_2005} or other models. Some reranking methods \cite{jiang_word_2008} are proposed to adjust the output. he sequence tagging methods are considered to have the ability to identify the OOV words and make use of existent dictionary \cite{kruengkrai_error-driven_2009}. We will show that they still suffer from the granularity mismatch problem. Some tree based methods for Chinese word segmentation were proposed\cite{zhao_character-level_2009,liu_information_2008} in order to represent the morphological or syntactical structure for better CWS. But these methods are hard to be applied for they need the training data to be extra manually annotated. The first step of our framework is to build a binary tree for an input sentence 𝑐 ⋯ 𝑐 𝑛 . The process can be simply based on a single function 𝑏(𝑐 𝑖 ) , which gives the confidence that there is a word boundary between 𝑐 𝑖 and c 𝑖+1 . The function 𝑏(𝑐) could be derived from various previous works. For a PMI-based method\cite{sun_chinese_1998} which is an association measure based method, we can define the function 𝑏 PMI (𝑐) as: 𝑏 PMI (𝑐 𝑖 ) = log 𝑃(𝑐 𝑖 𝑐 𝑖+1 )𝑃(𝑐 𝑖 )𝑃(𝑐 𝑖+1 ) For the CRF-based method, we define this function 𝑏 CRF (𝑐) as the marginal probability that: 𝑏 CRF (𝑐 𝑖 ) = 𝑃(𝑡 𝑖 = S ∨ 𝑡 𝑖 = E|𝐬) The algorithm to building the binary tree based on this function is described as the function [x] as the pseudo code.

Fig. 3

This function \ref{function:tb} recursively splits the sequence 𝑐 𝑖 ⋯ 𝑐 𝑗 at 𝑐 𝑚 which has a maximum 𝑏(𝑐) , until all the leaves are single characters. An example of the binary tree is showed in Fig. \ref{fig:tree}. Notice that this is a tree for the entire sentence, but it is not a tree used to represent the morphological or syntactical structure. It represents the structures of the unclear part between the Chinese morphology and syntax. Fig. 4

An binary tree for the phrase 材料利用率高 ~({\it the material utilization rate is high}) in the SIGHAN bake-off 2005 MSR corpus

The second step is to prune the binary tree, which focus on the granularity. The leaves of the pruned tree form the segmentation output. The example of a pruned binary tree is showed in Fig. \ref{fig:pruned-tree}. The pruning can also be applied by a single function 𝑝(𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 ) , where 𝑐 𝑖 … 𝑐 𝑚 and 𝑐 𝑚+1 … 𝑐 𝑗 are the roots of two subtrees of the node 𝑐 𝑖 … 𝑐 𝑗 . The function returns 1 if the subtrees of the node 𝑐 𝑖 … 𝑐 𝑗 should be pruned, or 0 if not. This binary function can be based on a threshold 𝑡 and the same 𝑏(𝑐) function used for the tree building: 𝑝 threshold 𝑡 (𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 ) = {1, 𝑏(𝑐 𝑚 ) < 𝑡;0, otherwise. ([x]) This is a trivial pruning function. When we set 𝑡 to 0.5 and set the 𝑏(𝑐) to 𝑏 CRF (𝑐 𝑖 ) , the output is the same as the output by the original CRF-based method. Word dictionary can be easily employed in the pruning step as another pruning function: 𝑝 dictionary (𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 ) = {1, 𝑐 𝑖 … 𝑐 𝑗 is in the dictionary;0, otherwise. ([x]) Notice that the values of particular pruning function may contain conflicts. For example, for the binary tree in Fig. \ref{fig:tree}, we may have 𝑝( 材料 , 利用率 ) = 1 and 𝑝(利利利) = 0 . So, for the tree 𝑇 to be pruned, we could have a top-down tree pruning function TDTP described as the pseudo code, and a bottom-up tree pruning function BUTP. Fig. 5

Fig. 6

A pruned tree for the binary tree in Fig. \ref{fig:tree}. The output for this pruned tree is “ 材料利用率高 ” For the previous work like the CRF-based methods without the binary tree based framework, the outputted words can be directly determined by the function 𝑏(𝑐 𝑖 ) and a threshold 𝑡 : The 𝑡 -words of an input sentence 𝑐 … 𝑐 𝑛 are any substrings like 𝑐 𝑖 … 𝑐 𝑗 , such that min (𝑏(𝑐 𝑖−1 ), 𝑏(𝑐 𝑗 )) > 𝑡 > max (𝑏(𝑐 𝑖 ), ⋯ , 𝑏(𝑐 𝑗−1 )) Definition : The inequalities in this definition mean that there is a word boundary between 𝑐 𝑖 and 𝑐 𝑖+1 if and only if 𝑏(𝑐 𝑖 ) > 𝑡 . Notice that different results can be got with different thresholds. The greater the threshold is, the more coarse-grained the segmentation result is, which means there are lesser number of words in the output. The threshold 𝑡 can be seen as a parameter to control the output granularity. It is better to store all the results with different granularity (by different thresholds). We can use an altered definition as: Definition : The word candidates of an input sentence 𝑐 … 𝑐 𝑛 are any substrings like 𝑐 𝑖 … 𝑐 𝑗 such that min (𝑏(𝑐 𝑖−1 ), 𝑏(𝑐 𝑗 )) > max (𝑏(𝑐 𝑖 ), ⋯ , 𝑏(𝑐 𝑗−1 )) The only difference is that there is no threshold 𝑡 in the inequalities. This definition is also natural. The explanation of this definition is that if the left and right boundaries of a string are more likely to be word boundaries than any character boundaries inside the string, this string may be a word. he relation between the binary tree and the word candidates of an input sentence can be described as a proposition: Proposition : A string is a word candidate if and only if it is in the binary tree of the corresponding sentence. This proposition shows that the binary tree is a suitable representation to store all the word candidates with different granularity, which provides rich information for the tree pruning process to focus on the granularity problem.

The statistics-based CWS algorithms are lack of error analysis methods. In the SIGHAN bake-off, the errors are only been divided into IV word errors and OOV word errors. Here we propose a new way to classify the errors for methods such as the CRF-based ones, and a way to investigate the performance without the granularity mismatch problem. The granularity mismatch is an important cause for the errors. Since using binary trees is a way to maintain all the results for different granularity, the use of the binary trees can be helpful for the error analysis. If an error is only caused by the granularity mismatch, the corresponding word in the gold standard should be found in the binary tree (It also should be the word candidate as we discussed in Section 2), although it is not in the output of the pruned tree. Otherwise, the corresponding word cannot be found in the binary tree. According to this idea, in our framework, we divide the errors into tree errors and pruning errors, and the pruning errors can be caused by either over-pruning or less-pruning. We describe this as follows:  Tree error . If a word in the gold standard cannot be found in the binary tree, it is called a tree error. It also means that the word is not in the output according to the threshold-based pruning method with any thresholds. This kind of errors cannot be corrected in the tree pruning step in our framework.  Pruning error . If a word in the gold standard and can be found in the binary tree but it is not in the output, it is called a pruning errors. This kind of errors is caused by the granularity mismatch and could be corrected in the tree pruning step. These errors can be further divided into two subcategories:  Over-pruning error . If a word in the gold standard is pruned by the tree pruning function, it is called a over-pruning error. This is because of that the segmentation granularity for this word is too coarse.  Less-pruning error . If a word in the gold standard and its children are not pruned by the tree pruning function, it is called a less-pruning error. This is because of that the segmentation granularity for this word is too fine. Both IV and OOV words may have tree errors and pruning errors. This is a new dimension to describe the errors besides the IV-OOV-based classification. In order to estimate the upper bound of the performance for the tree pruning step, we define the oracle pruning function based on the gold standard: oracle (𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 )= {1, there is no word boundary between 𝑐 𝑚 and 𝑐 𝑚+1 in the gold standard;0, otherwise. ([x]) This function can be seen as a perfect pruning. By this pruning function, we can investigate the performance without the granularity mismatch problem for both ``mono corpus'' CWS and cross-corpus CWS. Here we introduce a more sophisticated SVM-based function 𝑝 SVM for the tree pruning in our framework. A state-of-the-art CRF-based model is used as the tree building function. We need two training sets. The training set 𝐒 b is used to train a CRF-based CWS model. The training set 𝐒 p is used to train the SVM-based tree pruning model. We need to train an SVM model as the binary pruning function 𝑝 SVM () . Sentences in 𝐒 p will first be parsed to binary trees by the trained CRF-based tree building function. Any input pairs for 𝑝(𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 ) that 𝑚 ) > 0.05 are with less confidence for the CRF model and are used as the samples to train the SVM model. The oracle pruning 𝑝 oracle () values are used as the answers. There are some notations for describing the features. For the pruning function 𝑝 SVM (𝑐 𝑖 … 𝑐 𝑚 , 𝑐 𝑚+1 … 𝑐 𝑗 ) , we have 𝒍 = 𝑐 𝑖 … 𝑐 𝑚 $, 𝒓 = 𝑐 𝑚+1 … 𝑐 𝑗 and 𝒎 = 𝑐 𝑖 … 𝑐 𝑗 . We also have 𝒍 −𝟏 = 𝑐 𝑢 … 𝑐 𝑖−1 that 𝑏(𝑐 𝑢−1 ) > 0.5 and 𝑏(𝑐 𝑘 ) < 0.5 for $ 𝑘 = 𝑢, … , 𝑖 − 2 , which is the word on the left side of 𝒍 according to the CRF-based tree building function. Similarly we have the string 𝒓 +𝟏 . The operator ‖𝒔‖ returns the frequency of the string 𝒔 in a corpus. The operator ‖𝒔‖ tree returns the frequency of the string 𝒔 in the binary trees. The features for the SVM are described as follows: CRF-based features : The CRF-based features include the probability

𝑃(𝑡 𝑖 = S ∨ 𝑡 𝑖 = E|𝐬) , and the marginal probabilities, 𝑃(𝑡 𝑘 = 𝑡|𝒔) for 𝑘 ∈ *𝑚, 𝑚 + 1+ and 𝑡 ∈ *B, M, E, S+ . Length-based features : Each character is one syllable in Chinese. Since syllable-length is a factor for Chinese word forming, we use binary features to represent the length-based information for the strings 𝒍 and 𝒓 . Dictionary-based features : We use the word unigrams and word bigrams in 𝐒 b to construct a unigram dictionary and a bigram dictionary. We use binary features to represent whether the strings 𝒍 , 𝒓 and 𝒎 are IV or OOV, respectively. Similar features are for the string pairs (𝒍 −𝟏 , 𝒍) , (𝒍, 𝒓) , (𝒓, 𝒓 +𝟏 ) , (𝒍 −𝟏 , 𝒎) and (𝒎, 𝒓 +𝟏 ) . Association-measure-based features : These features are designed to capture the global information of the strings. We define a chi -like function to measure the association between two strings. For the string pair (𝒍, 𝒓) , we set 𝑎 = ‖𝒎‖ , 𝑎 + 𝑏 = ‖𝒍‖ , 𝑎 + 𝑐 = ‖𝒓‖ and 𝑎 + 𝑏 + 𝑐 + 𝑑 to be the total number of the characters. The frequencies are counted in training sets and the test set. The feature value is (𝑎𝑑 − 𝑏𝑐) (𝑎 + 𝑏) (𝑎 + 𝑐) (𝑏 + 𝑑) (𝑐 + 𝑑)⁄ . Similar feature values can be calculated for the string pairs (𝒍 −𝟏 , 𝒍) , (𝒓, 𝒓 +𝟏 ) , (𝒍 −𝟏 , 𝒎) and (𝒎, 𝒓 +𝟏 ) . Tree-based features : These features are designed to capture the local information of the strings in the context. The idea is that if a string appears frequently in a document, it is likely to be a word. The frequencies are counted in either 𝐒 p or the test set. The used feature values are log‖𝒎‖ tree and og‖𝒎‖ tree − max 𝒎 ′ ∈pa(𝒎) log‖𝒎 ′ ‖ tree , where pa(𝒎) is the set of all the parent nodes of 𝒎 in the trees. We use the CRF-based tree building function 𝑏 CRF () and the SVM-based tree pruning function 𝑝 SVM () for the test. We use these four corpora of the SIGHAN bake-off 2005~\cite{emerson_second_2005} for our evaluation. They are free and widely used for the evaluation by most of the previous work. The used measurements for the evaluation are the precision that 𝑝 = ( , the recall that 𝑟 =( and the

F_measure =2𝑝𝑟/(𝑝 + 𝑟) . Besides, the OOV rate is also calculated for the analysis. We use a state-of-the-art CRF-based method as the baseline, and to define the tree building function 𝑏(𝑐 𝑖 ) . The error analysis is also based on it. The CRF-based model is trained using the toolkit Pocket CRF . The used feature templates for character 𝑐 𝑖 are: Table 1

Types Templates unigram 𝑐 𝑖−1 𝑡 𝑖 , 𝑐 𝑖 𝑡 𝑖 , 𝑐 𝑖+1 𝑡 𝑖 bigram 𝑐 𝑖−2 𝑐 𝑖−1 𝑡 𝑖 , 𝑐 𝑖−1 𝑐 𝑖 𝑡 𝑖 , 𝑐 𝑖 𝑐 𝑖+1 𝑡 𝑖 , 𝑐 𝑖+1 𝑐 𝑖+2 𝑡 𝑖 transfer 𝑡 𝑖−1 𝑡 𝑖 First, we investigate the numbers of different errors for the baseline method. The errors are not only divided into IV errors and OOV errors, but are also divided based on the binary trees as we discussed in the previous section.

Table 2

Numbers of errors in different kinds for the CRF-based baseline for the SIGHAN bake-off 2005 corpora

AS CityU MSR PKU IV OOV IV OOV IV OOV IV OOV tree error 310 412 234 271 183 172 313 443 pruning error over-pruning 2661 190 1166 420 2839 139 4287 227 less-pruning 841 1039 92 263 268 593 309 590 The results are in Table \ref{tab:error_types}. Among the four different copora, the phenomena are similar. Tree errors are much less than the pruning errors. This indicates that the granularity mismatch is the most primary cause of the errors. We can also see that there are more IV words for the over-pruning errors while there are OOV words for the less-pruning errors. This is due to the phenomenon that most of the OOV words are consist of IV words but not vice versa. http://pocket-crf-1.sourceforge.net/ hen, with the help of the oracle pruning function, we can see what performance we can get without the granularity mismatch problem. Table 3

The OOV rate and the F measures of the threshold based pruning and the oracle pruning for four corpora

AS CityU MSR PKU OOV rate 4.3 7.4 2.6 5.8 F measures for CRF 𝑝 threshold 𝑝 oracle +TDTP 99.1 98.0 99.5 98.9 𝑝 oracle +BUTP 99.1 98.1 99.5 98.9 The F measures by the threshold based pruning function (the output will be the original output by the CRF-based baseline) and the oracle pruning function can be found in Table \ref{tab:baseline}. These results show that the upper bound of the F_measure for the best pruning function is quite high. We see that a better pruning function is useful to improve the performance. Then we investigate the performances of cross-corpus CWS. The MSR and PKU corpus are both in simplified Chinese. We trained a CRF-based model using the training set of the MSR corpus and test it on the test set of the PKU corpus. This experiment is called `MSR to PKU'. Similar experiment `PKU to MSR' is also performed. Table 4

The analysis of the performances for cross-corpus CWS

MSR to PKU PKU to MSR precision recall F1 precision recall F1 CRF 87.7 82.5 85.0 84.1 87.5 85.8 + 𝑝 oracle +TDTP 98.0 95.0 96.5 97.2 91.0 94.0 + 𝑝 oracle +TDBU 95.7 97.9 96.8 94.1 97.1 95.5 The results for these two cross-corpus CWS experiments are in Table \ref{tab:cross_corpus}. Though the F measures of the threshold based pruning function (i.e. the original CRF-based model) are much poor, the F measures of the oracle pruning function are still high. Especially for the `MSR to PKU', there are more than 98\% of the words in the gold standard which can be found in the binary trees. The morphological and syntax structures are the same in any corpora. The drop of the performance for the cross-corpus experiments are caused by the worse granularity mismatch problem. In order to compare to the previous works based on the same training sets in SIGHAN bake-off 2005 and avoid using any other resources, we divided the original training set into two parts. Nine tenths of them are used as the training set $\mathcal{S}_\textup{b}$ to train the CRF model for the tree building, while the rest one tenth are used as the training set $\mathcal{S}_\textup{p}$ to train the SVM model for the tree pruning. We use LibSVM for the training and testing for the SVM-based model, and use all the default arameters. Features are described in Section \ref{section:SVM}. The results are in Table \ref{tab:svm}. The first and second rows show the F measures of the original CRF-based baseline with 100% and 90% of the training set, respectively. The third row shows the F measures of the method that uses the SVM-based pruning function. We see that the performance only drops a little when we reduce 10\% of the training data for the CRF model. After using them for the training of the more sophisticated SVM-based pruning model, the performances increase. If we define the error rate as $1-\textup{F-measure}$, the error reduction is about 10\% and up to 20\%, which is significant for the CWS evaluation. Other experiments also show that all kinds of features help the performance. Table 5

A comparison between the baseline CRF-based method and the method using SVM-based pruning on the F measures

AS CityU MSR PKU 100%CRF 95.0% 94.3% 96.2% 94.6% 90%CRF 95.0% 94.1% 96.1% 94.4% + 𝑝 SVM

We proposed a binary tree representation for the structures of the unclear part between the Chinese morphology and syntax. We also proposed a simple binary tree based two-step framework for CWS, namely tree building and tree pruning. Previous models for CWS can be employed in this framework. The binary tree representation provides a quantitative error analysis method for CWS, by which we see that the granularity mismatch problem is the primary cause of the errors for CWS and cross-corpus CWS. We also illustrated with an SVM-based tree pruning model for the Step 2, and reduce the error rate up to 20\% from a state-of-the-art CRF-based baseline. The definition of Chinese word is not clear even for the linguists\cite{xue_defining_2001}. The disagreements of the segmentation standard between different corpora such as the disagreement between MSR and PKU corpus are mainly on the granularity. Moreover, applications such as machine translation and information retrieve need CWS models with different granularity. Our binary tree representation not only provides a way to improve the performance of CWS, but also provides a way to solve these problems. We can have a unified tree building function and different tree pruning functions for different corpora and applications with different granularity.

References [1]

Emerson T, The second international Chinese word segmentation bakeoff, In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea, 2005: 123–133. [2]

Gao Jianfeng, Li Mu, Huang Chang-Ning, Wu Andi. Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 2005, 31(4): 531–574. [3]

Jiang Wenbin, Mi Haitao, Liu Qun, Word lattice reranking for Chinese word segmentation and Part-of-Speech tagging. In: Proceedings of the 22 nd International Conference on Computational Linguistics Volume 1. Association for Computational Linguistics, 2008: 385–392. [4]

Kruengkrai C, Uchimoto K, Kazama J, Wang Yiou, Torisawa K, Isahara Hitoshi, An Error-Driven ord-Character hybrid model for joint Chinese word segmentation and POS tagging. In: Proc. of ACL-IJCNLP 2009, Suntec, Singapore, Association for Computational Linguistics, 2009: 513–521. [5]

Li Zhongguo, Sun Maosong. Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 2009, 35(4):505–512. [6]

Liu Y, Wang B, Ding F, Xu S, Information retrieval oriented word segmentation based on character associative strength ranking, In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008: 1061–1069. [7]

Peng Fuchun, Feng Fangfang, McCallum A, Chinese segmentation and new word detection using conditional random fields, In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 2004: 562. [8]

Sun Maosong , Shen Dayang , Tsou Benjamin K, Chinese word segmentation without using lexicon and hand-crafted training data, In: Proceedings of the 17th international conference on Computational linguistics-Volume 2. Morristown, NJ, USA, Association for Computational Linguistics, 1998: 1265–1271. [9]

Xue Nianwen. Defining and automatically identifying words in Chinese: Ph.D. Dissertation, University of Delaware, 2001. [10]