[PDF] Generalized Categorization Axioms

Abstract

Categorization axioms have been proposed to axiomatizing clustering results, which offers a hint of bridging the difference between human recognition system and machine learning through an intuitive observation: an object should be assigned to its most similar category. However, categorization axioms cannot be generalized into a general machine learning system as categorization axioms become trivial when the number of categories becomes one. In order to generalize categorization axioms into general cases, categorization input and categorization output are reinterpreted by inner and outer category representation. According to the categorization reinterpretation, two category representation axioms are presented. Category representation axioms and categorization axioms can be combined into a generalized categorization axiomatic framework, which accurately delimit the theoretical categorization constraints and overcome the shortcoming of categorization axioms. The proposed axiomatic framework not only discuses categorization test issue but also reinterprets many results in machine learning in a unified way, such as dimensionality reduction,density estimation, regression, clustering and classification.

Full PDF

aa r X i v : . [ c s . L G ] J a n Generalized Categorization Axioms

Jian YU

Beijing Key Lab of Traﬃc Data Analysis and MiningBeijing Jiaotong University,Beijing, ChinaEmail: [email protected]

Abstract

Categorization axioms have been proposed toaxiomatizing clustering results, which oﬀersa hint of bridging the diﬀerence between hu-man recognition system and machine learn-ing through an intuitive observation: an ob-ject should be assigned to its most similarcategory. However, categorization axiomscannot be generalized into a general machinelearning system as categorization axioms be-come trivial when the number of categoriesbecomes one. In order to generalize cate-gorization axioms into general cases, cate-gorization input and categorization outputare reinterpreted by inner and outer categoryrepresentation. According to the categoriza-tion reinterpretation, two category represen-tation axioms are presented. Category repre-sentation axioms and categorization axiomscan be combined into a generalized catego-rization axiomatic framework, which accu-rately delimit the theoretical categorizationconstraints and overcome the shortcomingof categorization axioms. The proposed ax-iomatic framework not only discuses catego-rization test issue but also reinterprets manyresults in machine learning in a uniﬁed way,such as dimensionality reduction, density es-timation, regression, clustering and classiﬁ-cation.

Keywords:

Similarity, Categorization, CategoryRepresentation, Dimensionality Reduction,DensityEstimation, Regression, Clustering, Classiﬁcation

Up to now, many elegant but complex machine learn-ing theories are developed for categorization, such as PAC theory (Valiant, 1984), statistical learning the-ory (Vapnik, 2000) and so on. However, a six or sevenyear old child can easily and correctly categorize manyobjects and does not understand about the above men-tioned machine learning theories. Therefore, there ex-ists a clear gap between human recognition system andmachine learning theories.In Yu and Xu (2014), categorization axioms have beenproposed to axiomatizing clustering results, which the-oretically oﬀers a hint of bridging the diﬀerence be-tween human recognition system and machine learn-ing by an intuitive observation: an object should beassigned to its most similar category. Assumed that c >

In cognitive sciences, a basic principle for humanrecognition system is that an object should be assignedto its most similar category. For human being, mem-bership explicitly represents that an object is assignedto some category and must be observed by others, sim-ilarity between an object and a category may be im-plicit and may not be observed by others. In otherwords, human beings has two category representationsfor categorization, membership is explicit and is calledouter category representation, similarity may be im-plicit and belongs to inner category representation.According to cognitive science, inner category repre-sentation for a category is in the mind of human be-ings, which may be diﬀerent from the outer categoryrepresentation. Human being establish the relationbetween objects in the world and corresponding con-cepts in the mind by two category representations forcategorization. For categories, a categorization algo-rithm should also have inner and outer category rep-resentations in order to reﬂect the relation betweenobjects in the world and the corresponding categoriesas Yu and Xu (2014) have done for clustering results.Considered the limits of the proposed representationin (Yu and Xu, 2014), we will reinterpret how to de-ﬁne the inner and outer category representation in acategorization algorithm in the following.Any algorithm has the input and the output. For a cat-egorization algorithm, the input is called categoriza-tion input and the output is called categorization re-sult. Categorization input should have inner and outerrepresentation. Inner categorization input is expected to be learned with respect to the outer categorizationinput. Similarly, Categorization output should haveinner and outer representation. Inner categorizationoutput is actually learned with respect to the outercategorization output.The outer categorization input is about the prede-ﬁned categorization information of the sampling ob-jects O = { o , o , · · · , o n } , including the input objectrepresentation and the corresponding outer categoryrepresentation.The input object representation is represented by X = { x , x , · · · , x n } with c subsets X , X , · · · , X c , where x k represents the k th object o k , X i is a set that con-sists of all the objects of the i th category in the dataset X . The outer category representation for the catego-rization input can be represented by U = [ u ik ] c × n , ∀ i ∀ k, u ik ≥ x k belonging to the i th category. Hence, the outer cat-egorization input can be represented by ( X, U ). Moredetailed can be seen in (Yu and Xu, 2014) . When U is known, one object should be assigned to the cat-egory with biggest membership. Therefore, assign-ment (outer referring) operator → can be deﬁned as ~X = { ~x , ~x , · · · , ~x n } , where ~x k = arg max i u ik .Similarly, the outer categorization result can beexpressed by ( Y, V ), where Y = { y , y , · · · , y n } represents the object representation for the out-put, y k also represents the k th object o k , and Y , Y , · · · , Y c represents the corresponding input c subsets X , X , · · · , X c , V is the outer categoryrepresentation for the output, V = [ v ik ] c × n =[ v , v , · · · , v n ] is a partition matrix, ∀ i ∀ k, v ik ≥ y k belong-ing to the i th category and v k = [ v k , v k , · · · , v ck ] T .Similarly, assignment operator → is deﬁned as ~Y = { ~y , ~y , · · · , ~y n } , where ~y k = arg max i v ik . If ~x k , ~y k aresingle value, x k belongs to the ~x thk category, y k belongsto the ~y thk category. In common sense, assignment op-erator → represents outer referring and reﬂects theexternal relation between the object and the category.As pointed out by Yu and Xu (2014), the cognitiverepresentation of a category is always supposed to ex-ist, even in an implicit state when designing a cat-egorization algorithm. For simplicity, when the in-put X = { x , x , · · · , x n } is categorized into c subsets X , X , · · · , X c , ∀ i , X i is supposed to be the cogni-tive representation of the i th category , and the out-put Y = { y , y , · · · , y n } is categorized into c subsets Y , Y , · · · , Y c , ∀ i , Y i is supposed to the cognitive rep-resentation of i th category.As pointed out by Yu and Xu (2014), when the cogni-tive representation for any category is deﬁned, objectscan be categorized based on the similarity between ob-ects and categories. As the input is usually diﬀerentfrom the output, the input category similarity map-ping and the output category similarity mapping canbe deﬁned by computing the similarity between ob-jects and categories as follows. Input Category Similarity Mapping:

Sim X : X × { X , X , · · · , X c } 7→ R + is called categorysimilarity mapping if an increase in Sim X ( x k , X i ) indi-cates greater similarity between x k and X i , a decreasein Sim X ( x k , X i ) indicates less similarity between x k and X i . Output Category Similarity Mapping:

Sim Y : Y × { Y , Y , · · · , Y c } 7→ R + is called categorysimilarity mapping if an increase in Sim Y ( y k , Y i ) indi-cates greater similarity between Y k and Y i , a decreasein Sim Y ( y k , Y i ) indicates less similarity between y k and Y i .For input category similarity mapping, similarity (in-ner referring) operator ∼ can be deﬁned as e X = { e x , e x , · · · , e x n } , where e x k = arg max i Sim X ( x k , X i ).Similarly, for output category similarity mapping,similarity operator ∼ can be deﬁned as e Y = { e y , e y , · · · , e y n } , where e y k = arg max i Sim Y ( y k , Y i ).It is easy to know that if e y k is single value, the larger Sim Y ( y k , Y e y k ), the better Sim Y . Similarly, if e x k is sin-gle value, the larger Sim X ( x k , X e x k ), the better Sim X .Similarly, the input category dissimilarity mappingand the output category dissimilarity mapping can bedeﬁned as follows: Input Category Dissimilarity Mapping: Ds X : X × { X , X , · · · , X c } 7→ R + is called categorydissimilarity mapping if an increase in Ds X ( x k , X i )indicates less similarity between x k and X i , a decreasein Ds X ( x k , X i ) indicates greater similarity between x k and X i . Output Category Dissimilarity Mapping: Ds Y : Y × { Y , Y , · · · , Y c } 7→ R + is called categorydissimilarity mapping if an increase in Ds Y ( y k , Y i ) in-dicates less similarity between y k and Y i , a decreasein Ds Y ( y k , Y i ) indicates greater similarity between y k and Y i . For input category dissimilarity mapping, similarityoperator ∼ can be deﬁned as e X = { e x , e x , · · · , e x n } ,where e x k = arg min i Ds X ( x k , X i ). Similarly, for out-put category dissimilarity mapping, similarity opera-tor ∼ can be deﬁned as e Y = { e y , e y , · · · , e y n } , where e y k = arg min i Ds Y ( y k , Y i ). If e x k is single value, the In order to be consistent with the intuition, categorysimilarity mapping and category dissimilarity mapping areusually supposed to be non negative in this section. In ap-plications, category similarity mapping and category dis-similarity mapping can be negative. less Ds Y ( y k , Y e y k ), the better Ds Y . Similarly, the less Ds X ( x k , Y e x k ), the better Ds X ,where e x k is single value.If e x thk and e y k are single value, x k is said to be similarto the e x thk category, y k is said to be similar to the e y thk category. In daily life, similarity operator ∼ repre-sents inner referring and established the latent relationbetween the object in the world and the cognitive cat-egory representation.According to the above analysis, when the outer cate-gorization input is ( X, U ), its corresponding inner cat-egorization input can be represented by (

X, Sim X )or by ( X, Ds X ), where X = { X , X , · · · , X c } .For brevity, ( X, U, X, Sim X ) or by ( X, U, X, Ds X )is called the categorization input. ( X, Sim X ) or( X, Ds X ) are the inner category representation for theinput, simply, called inner input.Likely, when the outer categorization result is ( Y, V ),its corresponding inner categorization result can berepresented by (

Y , Sim Y ) or by ( Y , Ds Y ), where Y = { Y , Y , · · · , Y c } . For brevity, ( Y, V, Y , Sim Y ) orby ( Y, V, Y , Ds Y ) is called the categorization result.( Y , Sim Y ) or ( Y , Ds Y ) are the inner category repre-sentation for the output, simply, called inner output.If a categorization algorithm can explicitly output Y ,such a categorization algorithm can be called whitebox. If a categorization algorithm can not explicitlyoutput Y but only explicitly output ( Y, V ), such a cat-egorization algorithm can be called black box. If a cat-egorization algorithm can explicitly output parts butnot full of Y , such a categorization algorithm can becalled grey box.For a categorization algorithm, its outer input andouter output should have the corresponding inner cat-egory representations. Therefore, we call it ExistenceAxiom of Category Representation (ECR). More ac-curately, it can be expressed as follows:

1) ECR :

For a categorization algorithm, if its outer input is(

X, U ) and its outer output is (

Y, V ), then there existsthe corresponding inner input (

X, Sim X ) and inneroutput ( Y , Sim Y ).For a categorization algorithm, the input is expectedto have the same category representation as the out-put with respect to categorization. ( X, Sim X ) andthe corresponding output ( Y , Sim Y ) is considered tohave the same category representation with respect tocategorization if ( X, e X ) = ( Y , e Y ). ( X, U ) and (

Y, V )is considered to have the same category representa-tion with respect to categorization if ~X = ~Y . Suchan assumption is called Uniqueness Axiom of Cate-gory Representation (UCR), which can be expressedas follows: ) UCR : For a categorization algorithm, its categorization input(

X, U, X, Sim X ) and its corresponding categorizationoutput ( Y, V, Y , Sim Y ) should satisfy ( ~X, X, e X ) =( ~Y , Y , e Y ).ECR and UCR are called category representa-tion axioms. ( X, U, X, Sim X ) represents the cate-gory information by the outer information provider,( Y, V, Y , Sim Y ) represents the category informationby the categorization algorithm, ( X, Sim X ) is ex-pected to be learned and represents the inner categoryrepresentation of the outer information provider, and( Y , Sim Y ) is actually learned and represents the in-ner category representation of the categorization algo-rithm. UCR oﬀers the conditions that learning can beperfectly accomplished, which states that the catego-rization input and the categorization output have thesame categorization semantics. Sometimes, ~X = ~Y can be further enhanced into U = V . According to Yu and Xu (2014), categorization axiomsincludes Sample Separation Axiom (SS), CategorySeparation Axiom(CS) and Categorization Equiv-alency Axiom (CE). For a categorization result(

Y, V, Y , Sim Y ), SS, CS and CE can be reinterpretedby similarity operator and assignment operator as fol-lows.

1) SS: ∀ k ∃ i (˜ y k = i )

2) CS: ∀ i ∃ k (˜ y k = i ))

3) CE: e Y = ~Y Moreover, we can prove Theorem 1.

Theorem 1. If ∀ k ∀ i ∀ j (( j = i ) → ( Sim Y ( y k , Y i ) = Sim Y ( y k , Y j ))) , then SS must hold. When a categorization result is not proper, there aresome objects theoretically belonging to two and morecategories. In other words, some objects are in the bor-derline of some category. Based on this fact, boundaryset can be deﬁned as follows.

Boundary set:

For a categorization result(Y,V,

Y , Sim Y ), the boundary set for ( Y, Y , Sim Y ) isdeﬁned as follows. BS ( Y,Y ,Sim Y ) = { y k | card ( e y k ) > } where card ( e y k ) represents the cardinality of a set e y k .Transparently, the above analysis also holds for thecategorization input ( X, U, X, Sim X ). Therefore,( X, U, X, Sim X ) should also satisfy SS, CS and CE.For brevity, we will not repeat the similar result. More interestingly, some relation can be established betweenUCR and CE by Theorem 2. Theorem 2.

If the categorization input ( X, U, X, Sim X ) and the categorization result(Y,V, Y , Sim Y ) satisfy CE, then e X = e Y is equivalentto ~X = ~Y . As noted above, the input x k and the corresponding y k represent the same object o k . Generally speak-ing, the input x and the corresponding output y represents the same object o , therefore, it is natu-rally assume that there exists a mapping θ from x to y , i.e. y = θ ( x ). When X = Y , it is easyto know that Sim Y ( y k , Y i ) = Sim Y ( θ ( x k ) , Y i ) = Sim Y ( θ ( x k ) , X i ). Hence, Sim X ( x k , X i ) can be de-ﬁned by Sim Y ( θ ( x k ) , X i ). Therefore, it is easy toknow that X = Y implies that ˜ X = ˜ Y when Sim X ( x k , X i ) is deﬁned by Sim Y ( θ ( x k ) , X i ). By The-orem 2 and the above analysis, X = Y play an es-sential role in UCR. In particular, when c =1, it iseasy to know that e X = e Y and ~X = ~Y hold triv-ially, X = Y is the only meaningful requirement inUCR. Moreover, categorization axioms and UCR oﬀerthe conditions that category similarity mapping shouldsatisfy, and states that the input category similaritymapping should be equivalent to the output categorymapping with respect to categorization, which is calledsimilarity assumption. For categorization, it is verychallenging to design a proper output category simi-larity mapping satisfying UCR and categorization ax-ioms. Usually, the input category similarity mappingis not equivalent to the output category mapping withrespect to categorization in practice, which is calledsimilarity paradox. If similarity paradox occurs, thecategorization error will be not zero. According to theabove analysis, the key to solve similarity paradox is tokeep X = Y to be true. As a matter of fact, it is oftentrue that X = Y . Therefore, how to solve similarityparadox is an eternal problem in categorization.In summary, category representation axioms and cat-egorization axioms have established the relationshipsamong all the parts related to categorization input andcategorization output, as shown in Figure 1. UCR es-tablishes the categorization equivalence between theinput and the corresponding output. Categorizationaxioms only establish the relationships between theouter representation and the corresponding inner rep-resentation and do not reﬂect the relation between theinput and the output. If the object representation canbe theoretically generated by the corresponding cogni-tive representation, then the corresponding cognitiverepresentation is called generative. If the object repre-sentation can not be theoretically generated by the cor-responding cognitive representation but can decide thecorresponding cognitive representation, then the cor-esponding cognitive representation is called discrim-inative. If the cognitive representation is generative,the corresponding learning model is called generativemodel. If the cognitive representation is generative,the corresponding learning model is called discrimina-tive model.In particular, Let X = Y and UCR be true,( X, Sim X ) and ( Y , Sim Y ) are exchangeable with re-spect to categorization. Under such assumptions,( X, U, X, Sim X ) can be used to represent the cate-gorization results, where ( X, Sim X ) actually denotes( Y , Sim Y ). In Yu and Xu (2014), ECR and UCRare implicitly assumed to be true, such an assump-tion makes CE not be true as it is very diﬃcult for( Y , Sim Y ) to have the same categorization capacityas ( X, U ) in practice, especially for U is given a priori. All the above analysis does not discuss how to evalu-ate the categorization result (

Y, V, Y , Sim Y ). Franklyspeaking, it is very challenging to test the performanceof a categorization algorithm. When estimating thecategorization performance, a test set ( X T , U T ) is usu-ally provided and ( X, U ) is called the training set.According to the analysis in section 2, ( X T , Sim X T )exists. Similarly, if ( X T , U T , X T , Sim X T ) is used thecategorization input, the corresponding categorizationoutput can be represented by ( Y T , V T , Y T , Sim Y T ).It is easy to know the test set and the training set aresupposed to represent the same categorization for thesame categorization algorithm. Therefore, Categoriza-tion Test Axiom can be expressed as follows: Categorization Test Axiom:

For a categorizationalgorithm, if its training test is (

X, U ) and its test setis ( X T , U T ), then ( X, Sim X )=( X T , Sim X T ).Certainly, categorization test axiom oﬀers the prereq-uisite condition that a categorization algorithm hasgeneralization ability, which is a demanding require-ment for categorization. It is easy to prove that catego-rization test axiom can infer the objects in the trainingset and the test set should be independent and identi-cally distributed if objects are random variables.Usually, X only approximates X T . Sometimes, thediﬀerence between X and X T is so big that X and X T cannot be considered to represent the same categoriza-tion. In this case, the test result will be not credibleand it can not be checked whether the correspondingcategorization algorithm has generalization ability ornot.In fact, X and X T are unobservable and unknown, itis very diﬃcult to measure the diﬀerence between X and X T . Instead of measuring the diﬀerence between X and X T , one estimation method is to compute thediﬀerence between ( X, U ) and ( X T , U T ), the other es-timation method is to compute the diﬀerence between Y and Y T assuming that UCR holds or approximatelyholds at least. Theoretically, the diﬀerence between X and X T should be proportional to the diﬀerence be-tween Y and Y T in the ideal case. Therefore, the cate-gorization robustness assumption can be described asfollows: Categorization Robustness Assumption:

A cat-egorization algorithm is called robust if there exist twoconstants k and k such that k | Y − Y T | ≤ | X − X T | ≤ k | Y − Y T | , where 0 < k ≤ k .Categorization robustness assumption demonstratesthe global condition that the corresponding categoriza-tion algorithm has generalization ability when catego-rization test axiom does not hold. If categorizationtest axiom holds, a good categorization axiom shouldmake | Y − Y T | as small as possible, When categoriza-tion test axiom dose not hold, it is very challengingto check whether or not categorization robustness as-sumption holds as X and X T are usually not known.Therefore, a substitutional method is to compute thedistance between the outer representations. Such anidea leads to local categorization robustness assump-tion as follows: Local Categorization Robustness Assumption:

A categorization algorithm is called locally robustif there exist two constants k and k such that k | ( Y, V ) − ( Y T , V T ) | ≤ | ( X, U ) − ( X T , U T ) | ≤ k | ( Y, V ) − ( Y T , V T ) | , where 0 < k ≤ k , ( X, U ) isa training test and ( X T , U T ) is a test set.Transparently, if local categorization robustness as-sumption is satisﬁed with respect to | ( X, U ) − ( X T , U T ) | < ε where ε is a very small positive number,the corresponding algorithm can be stably evaluatedin theory. When categorization axioms are proposed byYu and Xu (2014), three design principles of cluster-ing methods have also been proposed by Yu and Xu(2014). However, three design principles of clusteringmethods proposed by Yu and Xu (2014) need to bereinterpreted when categorization is investigated. Itis easy to guess that ﬁve axioms are also useful fordeveloping categorization methods when ﬁve axiomsare proposed to deal with categorization algorithms.Clearly, ﬁve axioms do not have equal importancewhen designing a categorization method. ECR onlyigure 1: Relationship between a categorization input (

X, U, X, Sim X ) and its corresponding categorizationresult ( Y, V, Y , Sim Y )tells us how to represent the categorization input andthe categorization output. CE is always supposedto be true for a categorization algorithm since theouter referring and the corresponding inner referringshould represent the same referring, in a word, theexplicit function of a categorization algorithm shouldbe the same as its internally implemented function.As pointed by Yu and Xu (2014), SS and CS oﬀera very low bar for clustering results. Similarly, SSand CS are also loose requirements for categorization.UCR is far demanding as it requires three equivalenceconditions are true simultaneously. Therefore, threedesign principles of categorization methods can beinferred from SS, CS and UCR. In the following,we will carefully investigate such three principlesrespectively under the proposed axiomatic framework. Theorem 1 shows that the conditions of SS are nearlyno requirement as the conditions of Theorem 1 are of-ten true in general case for a well designed categorysimilarity. Following the same analysis in Yu and Xu(2014), SS should be enhanced into category compact-ness principle as follows:

Category Compactness Principle:

A categoriza-tion method should make its categorization result ascompact as possible.Category compactness principle says that every cate-gory should be as much compact as possible. Underthe proposed representation of the categorization re-sult, category compactness criterion can be deﬁned asfollows.

Category Compactness Criterion: J C : { Y, V } ×{

Y , Ds Y } 7→ R + is called category compactness crite-rion if the optimum of J C ( Y, V, Y , Ds Y ) correspondsto the categorization result with the largest category compactness.According to categorization axioms, category com-pactness criterion can be equivalently deﬁned by J C ( X, U, X, Ds X ). In the literature, it is often seenthat J C ( X, U, X, Ds X ) = P i P k u ik Ds X ( x k , X i ).As the relevance among ( X, U, X, Ds X ), J C ( X, U, X, Ds X ) can be further simpliﬁed into J C ( X, X, Ds X ) or J C ( U ). Noticing the deﬁnition ofcategory similarity mapping, category compactnessprinciple is still available for categorization when c = 1. If a categorization result (

Y, V, Y , Sim Y ) satisﬁes CS,then ∀ ≤ i = j ≤ c, Y i = Y j . According to the samereason in Yu and Xu (2014), CS can be enhanced intocategory separation principle as follows: Category Separation Principle:

A good catego-rization result should have the maximum distance be-tween categories.Under the proposed representation of the categoriza-tion result, category compactness criterion can be de-ﬁned as follows.

Category Separation Criterion: J S : { Y, V } × { Y , Y , · · · , Y c } 7→ R + is calledcategory separation criterion if the optimum of J S ( Y, V, { Y , Y , · · · , Y c } ) corresponds to the catego-rization result with maximal category separation.Category separation principle requires that c >

1. Inother words, when c = 1, category separation principleis unavailable. .3 Categorization Consistency Principle If the categorization input (

X, U, X, Sim X ) and itscorresponding categorization result ( Y, U, Y , Sim Y )satisfy UCR, the categorization error is zero. How-ever, even for human recognition systems, UCR cannot be always guaranteed to be true. Generally, hu-man recognition systems always try to make catego-rization error as small as possible. Therefore, UCR isthe most demanding requirement for categorization. IfUCR does not hold, a reasonable categorization crite-rion should make UCR hold as approximately as pos-sible, which result in categorization consistency prin-ciple as follows: Categorization Consistency Principle:

WhenUCR does not hold, a good categorization resultshould make UCR as approximately correct as pos-sible.When UCR does not hold, categorization consistencyprinciple can be used to design some categorizationcriterion as follows:

Categorization Consistency Criterion: J E : { X, ~X, X, e X }} × { Y, ~Y , Y , e Y } 7→ R + is called cat-egorization consistency criterion if the optimum of J E ( X, ~X, X, e X, Y, ~Y , Y , e Y ) corresponds to the catego-rization result with the minimum diﬀerence between( ~X, X, e X ) and ( ~Y , Y , e Y ).Clearly, if UCR can not be true, categorization con-sistency principle should be the ﬁrst principle whendesigning a categorization algorithm no matter whatthe number of categories is. Frankly speaking, it isnot usually expected that ( X, Sim X ) and ( Y , Sim Y )are obtained simultaneously. Usually, ( X, Sim X ) isinterchanged or approximated by ( Y , Sim Y ) when de-signing a categorization algorithm. In many catego-rization algorithms, UCR is supposed to be true but isnot actually true. Under such an assumption, categorycompactness principle and category separation princi-ple should be used to design categorization methods. For a speciﬁc categorization problem, there existsmany categorization models. Category compactnessprinciple, category separation principle and catego-rization consistency principle just select the optimalparameters in the candidate models with the sameinner category representation, and cannot choose theoptimal models among diﬀerent inner category repre-sentations. How to select an appropriate categoriza-tion model among diﬀerent inner category represen-tations? Occams razor principle is a popular tool forhuman being to choose models among diﬀerent repre-sentations, which states that ”plurality should not be posited without necessity”. Therefore, a simpler cate-gorization model should be selected among the candi-date models with the same performance.What is a simple categorization model? As the cate-gorization problem can be represented by the catego-rization input (

X, U, X, Sim X ) and the correspondingcategorization output ( Y, V, Y , Sim Y ), a model withthe simple categorization input and output will beconsidered simple. When c =1, then ∀ k, ~x k = 1 and ∀ k, e x k = 1. Therefore, it is enough to study X and Y in order to obey UCR or its approximated version: cat-egorization consistency principle, ( U, Sim X , V, Sim Y )can be omitted when designing a categorization model.If such an assumption holds, it can be considered asa simple categorization problem. Otherwise, if c ≥ Y = X , V can be replaced by Sim Y becauseCE always hold, hence, ( Y, V, Y , Sim Y ) can be repre-sented by ( Y , Sim Y ). Similarly, ( X, U, X, Sim X ) canbe represented by ( X, U ). In this case, it is enoughto deal with (

X, U, Y , Sim Y ) for such a categorizationproblem. Clearly, it is also a simple categorizationcase. Of course, such simpliﬁed categorization modelscan be further simpliﬁed by selecting simpler Y . Insummary, Occam’s razor can be used to discuss cate-gorization model complexity. In the following, we willstudy categorization models according to model com-plexity in the Occam’s razor point of view. In this section, we will study categorization modelsaccording to analysis in section 5.4. When c = 1, cat-egorization becomes one category problem, includingdensity estimation, regression and some dimensional-ity reduction methods. When c >

1, categorizationis multiple category problem, including clustering andclassiﬁcation. When U is not know for c > U is known for c > In the following, we will give several examples to showhow to interpret dimensionality reduction methodsbased on the proposed axioms and principles.For simplicity, assume that X = [ x kr ] n × p are sampledfrom some underlying structure in a space with dimen-sionality p , and such a sample can also be representedby Y = [ y kr ] n × d in a low dimensional space with di-mensionality d , where p >> d . Such a categorizationproblem is called dimensionality reduction.f U is not known, such a problem is called unsuper-vised dimensionality reduction. It is easy to know thatunsupervised dimensionality reduction has the cate-gorization input ( X, U, X, Ds X ) and the categoriza-tion output ( Y, V, Y , Ds Y ). Therefore, unsuperviseddimensionality reduction can be considered a catego-rization problem. In this section, we further assumethat c = 1. Under this assumption, it is easy to knowthat ˜ X = ˜ Y and ~X = ~Y . UCR only requires that X = Y . If UCR does not hold, categorization consis-tent principle naturally requires that X approximates Y as much as possible. If UCR does hold, categorycompactness principle implies that the best X shouldmake the underlying category the most compact. PCA(Pearson, 1901; Hotelling, 1933;Abdi and Williams, 2010):

Let X = Y =  x w w · · · w d  represent the ordered orthonormal basis { w , w , · · · , w d } with the origin x , Y = [ y kr ] n × d arethe coordinates of the objects O = { o , o , · · · , o n } inthe ordered orthonormal basis { w , w , · · · , w d } withthe origin x . Then we know that w i w Tj = δ ij , δ ij = 1if i = j , δ ij = 0 if i = j , y kr = ( x k − x ) w Tr , x , w i are1 × p vector.Let Ds X ( x, X ) = ( x − x − P i ( x − x ) w Ti w i )( x − x − P i ( x − x ) w Ti w i ) T represent the dissimilaritybetween x and the category representation X , itis easy to prove that Ds X ( x, X ) = ( x − x )( x − x ) T − P i w i ( x − x ) T ( x − x ) w Ti . Obviously, if x can be a linear combination of the ordered orthonor-mal basis { w , w , · · · , w d } with the origin x , then Ds X ( x, X ) = 0 means x can be perfectly representedby Y . If ∀ x k , Ds X ( x k , X ) = 0, then ∀ x k have the co-ordinates of the objects O = { o , o , · · · , o n } in theordered orthonormal basis { w , w , · · · , w d } with theorigin x with zero residual. In general cases, it is nottrue that ∀ x k , Ds X ( x k , X ) = 0.As UCR holds, category compactness principle will beused to seek the best X , which means that a good X should minimize the objective function (1) subject to ∀ i ∀ j, w i w Tj = δ ij .min X X k Ds X ( x k , X )= X k ( x k − x )( x k − x ) T − X i w i X k ( x k − x ) T ( x k − x )) w Ti (1)By Lagrange multiplier method, the objective function can be rewritten as (2) L = X k ( x k − x )( x k − x ) T − X i w i X k ( x k − x ) T ( x k − x ) w Ti − X i λ i ( w i w Ti −

1) (2)The equations (3) can be obtained by diﬀerentiating(2). ∂L∂x = − X k ( x k − x )( I p − X i w Ti w i ) = 0 ∂L∂w i = 2 w i X k ( x k − x ) T ( x k − x ) − λ i w i = 0 (3)Hence, the solution of minimizing (1) subject to ∀ i ∀ j, w i w Tj = δ ij is as (4). x = X k x k Nw i X k ( x k − x ) T ( x k − x ) = λ i w i (4)The equation (4) and minimizing (1) can introducethe traditional principle component analysis. The pro-posed axiomatic framework of categorization has of-fered a new interpretation of principle component anal-ysis. NMF(Lee and Seung, 1999):

Let Y = H =[ h kr ] n × d , X = Y = W =  w w · · · w d  represent the or-dered basis { w , w , · · · , w d } , Y = [ h kr ] n × d are thecoordinates of the objects O = { o , o , · · · , o n } in theordered basis { w , w , · · · , w d } , where all the elementsin w i are negative and ∀ k, r, h kr are negative.Let Ds X ( x k , X ) = ( x k − P i h ki w i )( x k − P i h ki w i ) T .As UCR holds, category compactness principle will beused to seek the best X , which means that a good X should minimize the objective function (5).min X X k Ds X ( x k , X )= X k ( x k − X i h ki w i )( x k − X i h ki w i ) T = k X − HW k (5)Minimizing (5) introduces nonnegative matrix factor-ization (Lee and Seung, 1999). CA(Hotelling, 1936):

Let X = Xa T | Xa T | and Y = Y b T | Y b T | , where a is 1 × p vector, b is 1 × d vector. How-ever, X = Y does not hold in general, UCR is nottrue. Therefore, we should use categorization consis-tence principle, which means to minimize the objectivefunction (6).min a,b L ( X, Y ) = | X − Y | = (cid:12)(cid:12)(cid:12)(cid:12) Xa T | Xa T | − Y b T | Y b T | (cid:12)(cid:12)(cid:12)(cid:12) = 2 − Xa T , Y b T ) | Xa T || Y b T | (6)Obviously, minimizing (6) is equivalent to maximizing(7) ( Xa T , Y b T ) | Xa T || Y b T | = aX T Y b T √ aX T Xa T √ bY T Y b T (7)Hence, canonical correlation analysis is introduced bymaximizing (7). LLE(Roweis and Saul, 2000):

Let X = W X = [ w kl ] n × n , Ds X ( x k , X )= Ds X ( x k , W )= | x k − P j ∈ N ( k ) w kj x j | , where P l w kl = 1, w kl ≥ w kl = 0if l / ∈ N ( k ), N ( k ) = { j | x j is the neighbor of x k } .As UCR holds, category compactness principle willbe used to seek the best X . According to categorycompactness principle, a good category representation X = W should minimize the objective function (8):min W X k Ds X ( x k , W ) = X k | x k − X j ∈ N ( k ) w kj x j | (8)According to UCR, X = Y implies that Y = W . Set Ds Y ( y k , Y )= Ds Y ( y k , W )= | y k − P j ∈ N ( k ) w kj y j | , cat-egory compactness principle tells us that a good Y should minimize the objective function (9)min Y X k Ds Y ( y k , W ) = X k | y k − X j ∈ N ( k ) w kj y j | (9)By this way, local linear embedding algorithm can beresulted by minimizing (8) and (9). MDS(Kruskal and Wish, 1978):

Let X = D X =[ d Xkl ] n × n , Y = D Y = [ d Ykl ] n × n ,where d Xkl = | x k − x l | , d Ykl = | y k − y l | . It is easy to know that X = Y can-not hold. Therefore, categorization consistence princi-ple will be used, which requires that a good Y shouldminimize the objective function (10).min Y L ( X, Y ) = L ( D X , D Y ) (10)Naturally, multidimensional scaling (MDS) algorithmcan be introduced by minimizing the objective func-tion (10). ISOMAP(Tenenbaum et al., 2000):

Let X = D X = [ d Xkl ] n × n , Y = D Y = [ d Ykl ] n × n ,where d Xkl rep-resents the geodesic distance between x k and x l , d Ykl = | y k − y l | . It is impossible for X = Y . Cate-gorization consistence principle requires to minimize(10). According to the above analysis, multidimen-sional scaling (MDS) algorithm can be used to com-pute Y .By this way, ISOMAP algorithm is introduced. If n points x , x , · · · , x n are sampled from a randomvariable with unknown probability density function f ,then f is expected to be constructed from the observeddata X = { x , x , · · · , x n } , which is called density es-timation. f is called expected density function.Set X = Y , X = f , Y = ˆ f , U = [1 , , · · · , T × n , V = [1 , , · · · , T × n , density estimation can be con-sidered as a categorization problem with the catego-rization input ( X, U, X, Ds X ) and the categorizationoutput ( Y, V, Y , Ds Y ), i.e. density estimation is a cat-egorization problem with only one category. In thefollowing, ˆ f is called density estimator.Because all points belong to one category, ~U = ~V and e X = e Y hold. However, X = Y . Therefore, UCR doesnot hold.One method of density estimation is parametric esti-mation. If p ( x ) is supposed to belong to the distri-bution family p ( x | θ ), density estimation will be trans-formed into estimating θ . In other words, density es-timation will become parametric estimation. In thiscase, X = θ , Ds X ( x, θ ) = − log( p ( x | θ )). Let ˆ θ bethe estimation of θ , we have Y = ˆ θ , Ds Y ( x, ˆ θ ) = − log( p ( x | ˆ θ )) Therefore, category compactness princi-ple requires to minimize intra category variance, whichresults in the objective function (11).min ˆ θ n X k =1 Ds Y ( x k , ˆ θ ) = min ˆ θ n X k =1 − log( p ( x k | ˆ θ )) (11)It is easy to know that maximum likelihood method isequivalent to minimizing (11).For example, let ∀ k, x k ∈ R p , x ∈ R p , p ( x | ˆ θ ) = √ π p ˆ σ p exp[ −

12 ( x − ˆ µ ) T ( x − ˆ µ )ˆ σ p ], where ˆ θ = { ˆ µ, ˆ σ p } . Ac-cording to Equation (11), the objective function (12)can be inferred. L = n X k =1 − log( p ( x k | ˆ θ ))= n X k =1 ( 12 | x k − ˆ µ | ˆ σ p + log √ π p ˆ σ p ) (12)inimizing (12) can lead to the estimation ofˆ θ = { ˆ µ, ˆ σ p } , where ˆ µ = P nk =1 x k n , ˆ σ p = P nk =1 | x − ˆ µ | n .Another method of density estimation is non paramet-ric estimation. In this method, less rigid assumptionsare made about f . In the literature (Silverman, 1986),non parametric density estimators include histograms,kernel density estimation, k-nearest neighbor method,etc.Clearly, the key problem for density estimation is toestimate the diﬀerence between ˆ f and f . In the-ory, the minimum diﬀerence between ˆ f and f shouldbe expected according to categorization consistencyprinciple. In the literature, theoretical conditions forˆ f = f have been well studied in the limit point ofview(Silverman, 1986). Generally, if n points (ˆ x , f (ˆ x )), (ˆ x , f (ˆ x )), · · · ,(ˆ x n , f (ˆ x n )) are sampled from (ˆ x, f (ˆ x )) and f is notknown but is expected to be learned, such a problemis called regression. Usually, f is called expected re-gression function.Set X =  ˆ x f (ˆ x )ˆ x f (ˆ x ) · · · · · · ˆ x n f (ˆ x n )  , Y =  ˆ x F (ˆ x )ˆ x F (ˆ x ) · · · · · · ˆ x n F (ˆ x n )  , X = (ˆ x, f (ˆ x )), Y = (ˆ x, F (ˆ x )), where F is calledpredicted regression function, U = [1 , , · · · , T × n , V = [1 , , · · · , T × n , it is easy to know that regressionhas the categorization input ( X, U, X, Ds X ) and thecategorization output ( Y, V, Y , Ds Y ). In other words,regression can be considered as a categorization prob-lem with only one category.Because all points belong to one category, it is easyto prove that ~U = ~V and e X = e Y . However, X = Y in general cases. Therefore, UCR does not hold. Ac-cording to categorization consistency principle, a goodcategory representation Y should minimize the follow-ing objective function: | X − Y | = D ( f (ˆ x ) , F (ˆ x )) (13)It is impossible to directly compute D ( f (ˆ x ) , F (ˆ x ))as f is unknown. Therefore, diﬀerent deﬁnitions of D ( f (ˆ x ) , F (ˆ x )) lead to diﬀerent regression algorithms.For example, set f (ˆ x ) ∈ R and F (ˆ x ) = ˆ w ˆ x T + b . As-sume that the dimensionality of ˆ x is τ .If D ( f (ˆ x ) , F (ˆ x )) = P nk =1 k f (ˆ x k ) − F (ˆ x k ) k , linear re-gression is obtained by minimizing (13) if n >> τ .When n << τ , it is easy to know that many feasi-ble solutions can reach the same minimum of (13) as n << τ implies that minimizing (13) faces singularproblem.How to select the optimal solution from many feasi-ble solutions of minimizing (13)? A natural idea is toselect the feasible solution with minimum norm.If using Euclidean norm, then D ( f (ˆ x ) , F (ˆ x )) can bedeﬁned by nk =1 k f (ˆ x k ) − F (ˆ x k ) k + λ k w k . Hence, ridgeregression is obtained by minimizing (13) .When using L norm, then D ( f (ˆ x ) , F (ˆ x )) can be de-ﬁned by P nk =1 k ( f (ˆ x k ) − F (ˆ x k ) k + λ k w k L . By thisway, Lasso regression is obtained by minimizing (13)(Tibshirani, 1994). For clustering, (

X, U, X, Sim X ) is called clustering in-put, ( Y, V, Y , Sim Y ) is called clustering result. Since U and V are unknown a priori for clustering, it isalways supposed that the inner input and the corre-sponding inner output should be the same. It meansthat ( X, Sim X )=( Y , Sim Y ). Under that assumption,it is assumed that U = V for clustering.When Y = X , the outer input and the outer out-put are the same, which implies that ( X, U, X, Sim X )and ( Y, V, Y , Sim Y ) are exchangeable with respect toclustering. In a word, ( X, U, X, Sim X ) also repre-sents clustering result. As Sim X and Sim Y are thesame, Sim can denote

Sim X and Sim Y for clustering.Hence, theoretical analysis on clustering in Yu and Xu(2014) is also true under new categorization interpre-tation of this paper.Even if Y = X , ( U, X, e X )=( V, Y , e Y ) also holds forclustering, which means that ECR and UCR are stilltrue. In other words, ECR and UCR can always beomitted for clustering so that SS, CS and CE playmore important role for clustering. Frankly speaking,SS, CS and CE are enough for clustering. Of course,when Y = X , such clustering algorithms usually havefeature extraction step such as spectral clustering. For classiﬁcation, a category is called a class. In orderto be consistent with the literature, (

X, U, X, Sim X ) iscalled classiﬁcation training input and categorizationresult ( Y, V, Y , Sim Y ) is called classiﬁcation trainingoutput in this section. More speciﬁcally, ( X, U ) iscalled the training set, (

X, Sim X ) is called the ex-pected classiﬁer, ( Y, V ) is called the training result,(

Y , Sim Y ) is called the learned classiﬁer. ECR andcategorization axioms are usually true for classiﬁca-tion. However, UCR is usually not true.If UCR is true, the classiﬁcation error will be zero.n practice, a classiﬁcation method can only make itsclassiﬁcation result to reach the minimum classiﬁca-tion error, but usually its classiﬁcation error is notzero. Therefore, UCR should be as a constraint for aclassiﬁcation problem. In other words, when dealingwith a classiﬁcation problem, UCR should be true asmuch as possible in probability.When U is a proper partition, the corresponding clas-siﬁcation problem is standard classiﬁcation problem.When U is a overlapping partition, the correspond-ing classiﬁcation problem is multi label classiﬁcationproblem. For multi label classiﬁcation, SS should begeneralized as ∀ k ∃ i ( i ∈ e x k )). Under such a generaliza-tion, multi label classiﬁcation also follows SS.When classiﬁcation result ( Y, V, Y , Sim Y ) is out-putted, we can predict which category a new objectshould be assigned to. In theory, the decision regionfor a classiﬁcation result ( Y, V, Y , Sim Y ) can be de-ﬁned as follows: Decision Region:

Ω = { x |∃ i (˜ y = i ) ∧ ( y = θ ( x ) } .In particular, the decision region for a class Y i can bedeﬁned as follows: Decision Region for a Class Y i : Ω i = { x | (˜ y = i ) ∧ ( y = θ ( x ) } .Therefore, it is easy to know that ∪ i Ω i = Ω.The boundary for a classiﬁcation result( Y, V, Y , Sim Y ) can be deﬁned as follows: Boundary: ∂ Ω = Ω − Ω ⋄ , where Ω represents theclosure of Ω, Ω ⋄ represents the interior of Ω.The training decision region can be deﬁned as follows: Training Decision Region: Ω ( Y ,Sim Y ) = { x |∃ i ∃ k (( x ∈ Ω i ) ∧ ( x k ∈ Ω i ) ∧ ( Sim Y ( θ ( x ) , Y i ) ≥ Sim Y ( θ ( x k ) , Y i ))) } . Training Decision Region for a class Y i : Ω Y i = { x |∃ k (( x ∈ Ω i ) ∧ ( x k ∈ Ω i ) ∧ ( Sim Y ( θ ( x ) , Y i ) ≥ Sim Y ( θ ( x k ) , Y i ))) } .The support vector for a classiﬁcation result( Y, V, Y , Sim Y ) can be deﬁned as follows: Support Vector: If x k ∈ ∂ Ω ( Y ,Sim Y ) , then x k iscalled a support vector for the classiﬁcation result( Y, V, Y , Sim Y ).The margin for a classiﬁcation result ( Y, V, Y , Sim Y )can be deﬁned as follows: Margin ( Y ,Sim Y ) = min i = j d (Ω Y i , Ω Y j ), where d (Ω Y i , Ω Y j ) represents the distance between Ω X i andΩ Y j . Transparently, decision region is used to judge whichcategory one object should be assigned to, and the goalof the training decision region focuses on judging thequality of the classiﬁcation result. In the literature, one common idea of designing a clas-siﬁcation algorithm is to transform classiﬁcation to re-gression. In order to do this, regression function needsto be deﬁned. In the following, we will do this accord-ing to the proposed axiomatic framework.Expected regression function can be deﬁned as ρ ( k ) = ~x k , where U is a proper partition. Under this circum-stance, CE states that ρ ( k ) = e x k holds for a classiﬁca-tion result. Similarly, when V is a proper partition, weset H ( k ) = ~x k , then CE guarantees that H ( k ) = ˜ y k holds.Generally speaking, x denotes the input object rep-resentation and y denotes the corresponding outputobject representation. As y = θ ( x ), ρ ( x ) denotes ~x ,the predicted regression function can be deﬁned as h ( x ) = H ( θ ( x )) = H ( y ) = ˜ y , i.e h ( x ) represents thepredicted label.Set X =  x ρ ( x ) x ρ ( x ) · · · · · · x n ρ ( x n )  , Y =  x h ( x ) x h ( x ) · · · · · · x n h ( x n )  , X =( x, ρ ( x )), Y = ( x, h ( x )). Therefore, classiﬁcation canbe considered regression.Using such denotation, UCR requires that X = Y ,which means ∀ x ( ρ ( x ) = h ( x )). In practice, it is im-possible as ρ ( x ) is not known a priori but only ρ ( x k )is known for k ∈ { , , · · · , n } . Therefore, it is naturalto relax ∀ x ( ρ ( x ) = h ( x )) as P ( ρ ( x ) = h ( x )) ≤ ε . PACtheory has provided a theoretical investigation on suf-ﬁcient conditions of making P ( ρ ( x ) = h ( x )) ≤ ε holdwith a probability not less than 1 − δ (Valiant, 1984).Therefore, UCR is very important for classiﬁ-cation. For developing a classiﬁcation method,categorization consistency principle requires that P nk =1 L ( ρ ( x k ) , h ( x k )) reaches the minimum, which isusually called minimizing empirical risk. Transpar-ently, neural networks can be introduced by minimiz-ing empirical risk. Usually, the more complexity of h ( x ), the more small the empirical risk. Therefore, thetradeoﬀ between the empirical risk and the functioncomplexity will lead to the structural risk (Vapnik,2000).In particular, when c=2, ρ ( x ) ∈ { , } . Set h ( x ) = 1 + π ( x ) and L ( ρ ( x ) , h ( x )) = − ( ρ ( x ) −

1) log( h ( x ) − − (2 − ρ ( x )) log (2 − h ( x )) = − ( ρ ( x ) − ) log( π ( x )) − (2 − ρ ( x )) log (1 − π ( x )) where π ( x ) = exp ( wx T + b )1+ exp ( wx T + b ) , equation (13) tells us that the objec-tive function of binomial logistic regression model(Hosmer Jr and Lemeshow, 2004) can be expressed asfollows: min Y n X k =1 L ( ρ ( x k ) , h ( x k ))= − n X k =1 ( ρ ( x k ) − wx T + b )+ n X k =1 log(1 + exp ( wx T + b ) (14) X = Y However, many classiﬁcation methods are not devel-oped by transforming classiﬁcation to regression. Inorder to show this clearly, we simply assume Y = X ,then classiﬁcation result will omit Y as X is knowna priori. By analysis in Section 5.4, it is enough tostudy ( X, U, Y , Sim Y ) under such simpliﬁcation.Since( X, U ) is known for classiﬁcation, the simplest Y should be preferred according to Occam’s razor. Inthe following, U = [ u ik ] c × n is a hard partition. Example 1:

It is the simplest to set Y = X , whichmeans that ∀ i, Y i = X i . Under such assumption,we do not know any essential information about Y except for X . When ∀ i, Y i = X i , it is natural toset Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = | N i ( x ) | K , N i ( x ) = { x l | x l ∈ X i ∧ x l ∈ K-nearest neighborhood of x } . Un-der the above assumption, K-nearest neighbor classi-ﬁcation method (Cover and Hart, 1967) is introduced.It is easy to know that the categorization result ofK-nearest neighbor classiﬁcation follows categorizationaxioms in general cases. Clearly, UCR does not holdfor K-nearest neighbor classiﬁcation in general. Example 2:

Let X = [ x kr ] n × p , Sim Y ( y, Y i ) = Sim Y ( x, Y i ) and g i ( x ) = log Sim Y ( x, Y i ) be discrimi-nant function, SS requires that object x is assigned toclass Y i if g i ( x ) = max j g j ( x ). Occam’s razor statesthat simpler Y is preferred. In theory, if ∀ i, Y i is rep-resented by ( w i , w i ) where w i is a 1 × p vector and w i ∈ R , g i ( x ) = log Sim Y ( x, Y i ) = w i x T + w i . Such acategorization model is simpler, which is called lineardiscriminant analysis (Fisher, 1936). Transparently,linear discriminant analysis also satisﬁes categoriza-tion axioms. Example 3:

In particular, when c=2, it is natural toset ∀ i, Y i = ( w i , w i ). Occam’s razor states that lessparameters should be preferred. If set Y = ( w, b − Y = ( − w, − b − Y = ( w, b −

1) and Y = ( − w, − b −

1) are the simplest linear classiﬁca-tion representation according to Occam’s razor. Inthis case, g ( x ) = log Sim Y ( x, Y ) = wx T + b − g ( x ) = log Sim Y ( x, Y ) = − wx T − b −

1. Set wx T + b − ≥ ∀ x k ∈ X and − wx T − b − ≥ ∀ x k ∈ X , categorization axioms hold. Therefore,category separation principle states that the optimallinear discrimination should keep the distance betweenthe two parallel hyperplanes as large as possible whenUCR holds, which leads to the famous support vectormachine.It is easy to know that the training decision region forsupport vector machine is Ω ( Y ,Sim Y ) = { x | wx T + b − ≥ ∀ x k ∈ X and − wx T − b − ≥ ∀ x k ∈ X } . It is easy to prove that Margin ( Y ,Sim Y ) = √ ww T . Larger Margin ( Y ,Sim Y ) means a better gener-alization for support vector machine, which has beenproved by statistical learning theory (Vapnik, 2000). Example 4:

Let Y i = ( w i , w i ) where 1 ≤ i ≤ c − Y c is unknown, and Sim Y ( x, Y i ) = exp ( w i x T + w i )1+ P c − i =1 exp ( w i x T + w i ) if 1 ≤ i ≤ c − Sim Y ( x, Y c ) = P c − i =1 exp ( w i x T + w i ) . According to category compact-ness principle, we should maximize the objective func-tion can be expressed as follows:max Y ,Y , ··· ,Y c − n X k =1 c X i =1 u ik log Sim Y ( x k , Y i )= n X k =1 c − X i =1 u ik ( w i x Tk + w i ) − n X k =1 log(1 + c − X i =1 exp ( w i x Tk + w i )) (15)Such categorization model is called logisticregression(Cox, 1958). According to Occam’s ra-zor, logistic regression is more complex than lineardiscriminant analysis. When c >

2, logistic regressionshould not be considered as a regression model as noregression function can be deﬁned. Moreover, the c th class can be considered noise in logistic regression. Example 5:

For a categorization model, we do notneed a concrete form ∀ i, Y i explicitly. No matter howcomplicated Y is, it is enough to compute Sim Y . If Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = P ( x, Y i ) and v ik = P ( Y i | x k ), it is easy to know that Bayes classiﬁer almostfollows categorization axioms as the output y = x ∈ Y i just because Sim Y ( x, Y i ) = max j Sim Y ( x, Y j ) =max j P ( x, Y j ) = P ( x, Y i ) and Bayes theorem guaran-tees that arg max i P ( x, Y i ) = arg max i P ( Y i | x ). There-fore, it is very important for Bayes classiﬁer to esti-mate Sim Y or V by ( X, U ).n particular, assume that X = [ x kr ] n × p repre-sents n objects and x = [ x ∗ , x ∗ , · · · , x ∗ p ] repre-sents an object, where x ∗ r is the r th feature. Ac-cording to categorization axioms, it is enough tocalculate max i P ( x, Y i ) in order to classify x . Ac-cording to Occam’s razor, we should select the sim-plest way to calculate P ( x, Y i ). The simplest wayto estimate P ( x | Y i ) is to assume that each featureis conditionally independent of every other featuresfor given Y i , then P ( x | Y i ) = Q pr =1 P ( x ∗ r | Y i ). Let P ( Y i ) = card ( X i ) n , then Sim Y ( x, Y i ) can be computedby P ( Y i ) Q pr =1 P ( x ∗ r | Y i ). Based on the above analy-sis, naive Bayes classiﬁer (Duda et al., 1973) can clas-sify x according to categorization axioms. Therefore,naive Bayes classiﬁer is the simplest Bayes classiﬁerwith respect to Occam’s razor. As v ik = P ( Y i | x k ) canbe computed and V is a probability partition, Bayesclassiﬁer can be considered soft categorization. Example 6:

Let Ds Y ( y, Y i ) = Ds Y ( x, Y i ) = R ( α i | x ) = P cj =1 λ ij P ( Y j | x ), where the action α i de-notes the decision to assign the output y to class Y i and λ ij denotes the cost incurred for taking the action α i when the input x belongs to Y j . Transparently, thecategorization result of minimum risk classiﬁcation al-most abides by categorization axioms. Example 7:

Let

Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = U ( α i | x ) = P cj =1 U ij P ( Y j | x ), where the action α i de-notes the decision to assign the output y to class Y i and U ij measures how good it is to take the action α i whenthe input x belongs to Y j . Maximum expected utilityclassiﬁer also almost follows categorization axioms. Example 8:

In the above examples, ∀ i, Y i is repre-sented by one unique prototype, no matter what im-plicit or explicit. If assume that ∀ i, Y i can be repre-sented by several prototypes, such a classiﬁer is morecomplex. In decision tree classiﬁer, ∀ i, Y i usually isrepresented by several mutual exclusive rules. It canbe proved that decision tree classiﬁer also follows cat-egorization axioms. X = Y When X = Y with p > d , supervised dimensionalityreduction is proposed to deal with the correspondingcategorization. When X = Y with p < d , kernel meth-ods are proposed for categorization. In the following,we will discuss them respectively. Supervised Dimensionality Reduction

For X = Y with p > d , it is easy to know that y = θ ( x )such that ∀ k, y k = θ ( x k ). The simplest θ is a projec-tion mapping. If θ () is a projection mapping, super-vised dimensionality reduction becomes feature selec- tion. Feature selection methods can be easily inter-preted by categorization consistency principle.If θ is not a projection mapping, the simplest θ is a linear mapping from R p to R . If there ex-ists a direction w such that all categories in ( X, U )can be linearly separable when all points in (

X, U )is vertically projected into the direction w , we set Y i = X i = v i w T w , where w is a 1 × p vec-tor. Y = [ z k ] n × ,where ww T = 1, z k = x k w T , v i = P xk ∈ Xi x k | X i | . Ds X ( x, X i )=( xw T w − X i )( xw T w − X i ) T = w ( x − v i ) T ( x − v i ) w T , Ds Y ( z, Y i )=( zw − Y i )( zw − Y i ) T , it is easy to know that Ds X ( x, X i ) = Ds Y ( z, Y i ).According to category compactness principle, we needto minimize P i P x k ∈ X i Ds ( x k , X i ) = nwS W w T . Ac-cording to category separation principle, we need tomaximize P ci =1 | X i | w ( v i − x ) T ( v i − x ) w T = nwS B w T ,where x = n − P nk =1 x k . Combining the above twofunctions, wS W w T wS B w T should be minimized, which leadsto the generalized Fisher linear discriminant analysis.In particular, when c = 2, it is easy to prove that( X − X )( X − X ) T = w ( v − v ) T ( v − v ) w T = wS B w T . Since | X | w ( v − x ) T ( v − x ) w T + | X | w ( v − x ) T ( v − x ) w T = | X || X | | X | w ( v − v ) T ( v − v ) w T + | X | | X || X | w ( v − v ) T ( v − v ) w T = | X || X || X | w ( v − v ) T ( v − v ) w T , it is easy to know that to mini-mize w ( v − v ) T ( v − v ) w T is equivalent to mini-mize P i =1 | X i | w ( v i − x ) T ( v i − x ) w T , Therefore, when c = 2, generalized Fisher linear discriminant analy-sis becomes Fisher linear discriminant analysis. Cer-tainly, Fisher linear discriminant analysis follows UCRif ( X, U ) is linear separable in a direction w . Kernel Methods

For X = Y with p < d , assume that Y is linearlyseparable and but X is not linearly separable, it iseasy to know that θ () is a nonlinear mapping suchthat ∀ k, y k = θ ( x k ). Sometimes, the dimensionality of Y is inﬁnite. In this case, it is impossible to decide θ ()by ( X, U ) and (

Y, V ). Fortunately, when (

Y , Sim Y )is obtained, ( X, Sim X ) can be obtained by the kernelfunction K ( x, x k ) = ( θ ( x ) , θ ( x k )), where ( θ ( x ) , θ ( x k ))represents the inner product.By deﬁning K ( x, x k ), most categorization algorithmscan be reinvented in kernel methods. Interested read-ers can read the article (Scholkopf and Smola, 2011).In summary, classiﬁcation models almost follow cate-gorization axioms. But diﬀerent classiﬁcation modelshave diﬀerent model complexity. It should be pointedout that a complex model may be easily interpretedbut a simple one may be diﬃcult to be interpreted.ometimes, a simple categorization model is very dif-ﬁcult to be discovered especially when it is not easy tobe interpreted. Yu and Xu (2014) have presented categorization ax-ioms based on the assumption that any categoryshould have two kinds of representation. The maindrawback of (Yu and Xu, 2014) is to ignore the clus-tering input by implicitly assuming the the clusteringresult and the clustering input should have the samecategory representation. However, the input and theoutput may not have the same category representa-tion, even for some clustering algorithms. Therefore,categorization axioms cannot directly be applied to ageneral learning algorithm. In particular, categoriza-tion axioms assume that the number of categories isgreater than one, which is invalid for regression andmanifold learning.In order to generalize categorization axioms intogeneral categorization methods, we represent cate-gorization problems by redeﬁning categorization in-put as (

X, U, X, Sim X ) and categorization result as( Y, V, Y , Sim Y ). Based on this proposed representa-tions of categorization input and categorization re-sult, similarity (inner referring) operator and assign-ment (outer referring) operator are deﬁned. Such twoproposed operators are helpful not only for present-ing UCR but also for reinterpreting categorization ax-ioms. ECR, UCR, SS,CS and CE indeed delimit thetheoretical constraints for categorization. In particu-lar, UCR oﬀers the theoretical constraints for a perfectcategorization algorithm, which guarantees that ex-pected to be learned is equivalent to actually learned,i.e. there are no gap between teaching and learn-ing. More interestingly, if taking ( X, U, X, Sim X ) and( Y, V, Y , Sim Y ) as a conversation between two per-sons, CE states that the outer category representa-tion is equivalent to the inner category representationwith respect to categorization, which is consistent withmaxim of quality in conversation: do not say what youbelieve to be false (Grice, 1975). UCR states that theinput and the output should refer to the same cat-egorization, which is also consistent with maxim ofrelation in conversation: make your contribution rele-vant (Grice, 1975). When a dialogue can be eﬃcientlycarried out, UCR and CE should be true in daily life.As the same as Yu and Xu (2014), a clustering resultsatisfying SS and CS cannot be guaranteed to be agood clustering result as SS and CS are too weak.Similarly, when developing a categorization algorithm,SS and CS also need to be enhanced, which respec-tively result in the category compactness principle and Figure 2: Relationship between Axioms and designprinciples for categorizationthe category separation principle under new proposedrepresentation. In this paper, it is proposed that acategorization method should follow UCR in theory.However, UCR is too demanding for a categorizationmethod. In many cases, UCR cannot hold and needsto be relaxed, which can lead to one design principleof categorization methods: the categorization consis-tency principle. The relation between the proposedaxioms and design principles for categorization can beshown in Figure 2.After the learning process, how to evaluate the catego-rization algorithm is very important. Categorizationtest axiom provides the prerequisite condition that theperformance of the categorization algorithm can beevaluated and local categorization robustness assump-tion has oﬀered the condition that the performance ofthe categorization algorithm can be guaranteed to bestable.When c = 1, ECR, SS, CS and CE trivially hold, butUCR oﬀers the theoretical condition for categoriza-tion. When c = 1, categorization becomes some di-mensionality reduction methods, density estimation,and regression. Some dimensionality reduction meth-ods, density estimation and regression and can be in-troduced by UCR or its approximated version (cat-egorization consistency principle), such as principalcomponent analysis, nonnegative matrix factorization,canonical correlation analysis, local linear embedding,multidimensional scaling, Isomap, parametric densityestimation, nonparametric density estimation, linearregression, ridge regression and lasso, and so on. The-oretically, when c = 1, categorization mainly discusseshow to represent a category, which lays on a founda-tion for categorization with c > U is not known a priori for c >

1, categoriza-tion becomes clustering. ECR,UCR are always sup-posed to be true for any clustering algorithm in ordero simplify clustering process. Consequently, cluster-ing result and clustering input are exchangeable when X = Y . Therefore, SS, CS and CE are enough forclustering when X = Y . Therefore, theoretical analy-sis of clustering in (Yu and Xu, 2014) still is true when X = Y .When U is known a priori for c >

1, categorizationbecomes classiﬁcation. As for classiﬁcation, ECR andCE are always true for a classiﬁcation result, but SSand CS are true for a proper classiﬁcation result andUCR just holds for a classiﬁcation result with zero er-ror. Therefore, SS, CS and UCR are more importantconstraints for classiﬁcation. With respect to a classiﬁ-cation result (

Y, V, Y , Sim Y ), decision region,trainingdecision region and margin are deﬁned by SS. For cat-egorization methods, category compactness principlecan result in K-nearest classiﬁcation,linear discrimi-nant analysis,support vector machine, logistic regres-sion, Bayesian classiﬁcation, Minimum risk classiﬁca-tion, Maximum expected utility classiﬁcation,decisiontree,etc. Category separation principle can lead tosupport vector machine and Fisher linear discriminantanalysis. Categorization consistency principle can leadto empirical risk and structural risk, which can re-sult in neural networks and binomial logistic regressionmodel.UCR, SS, CS and CE play diﬀerent roles in dif-ferent categorization algorithms but all have some-thing to do with similarity. It is well known thatthat similarity plays a key role in human recognitionsystem (Murphy, 2004; Hahn, 2014). Furthermore,Kloos and Sloutsky (2008) revealed that children rep-resent categories based on similarity and similarity-based category representation is a development de-fault. The proposed axiomatic framework indeed es-tablishes the bridge between cognitive science and ma-chine learning through similarity (inner referring) op-erator.More interestingly, the proposed categorization frameclearly shows the range that a categorization algo-rithm can be reasonably applied. If the inner cate-gory representation is reasonable for the outer input,the corresponding categorization algorithm is feasible.Otherwise, more suitable inner category representa-tion should be used, which certainly introduces othercategorization algorithm. The analysis of categoriza-tion algorithms in this paper shows that the design ofcognitive category representation really needs powerfulimagination as the cognitive category representationsin the existing categorization algorithms are so diverse.In theory, a powerful categorization algorithm seemsto have a powerful cognitive category representation.It should be pointed out that there are many open questions needed to be done in the proposed axiomaticframework in the future. For example, how to de-sign an appropriate cognitive category representationfor a speciﬁc categorization algorithm? When c ≥ X, U ) is partial known or noise, what is the re-lation between categorization axioms and categoriza-tion algorithms?

Acknowledgements

Zongben Xu, Xinbo Gao, Wensheng Zhang, Bao-gang Hu, Jiangshe Zhang, Jufu Feng, Shaoping Ma,Qing He, Xuegang Hu, Liping Jing, Bianfang Chai,Jia Li and all my colleagues in CAAI MachineLearning Technical Committee are appreciated verymuch, their valuable discussions and suggestions havegreatly improved the presentation of the this pa-per. This work was supported by the NSFC grant(61370129), Ph.D Programs Foundation of Ministry ofEducation of China (20120009110006), PCSIRT(IRT201206), Beijing Committee of Science and Technol-ogy,China(Grant No. Z131110002813118).

References

Abdi, H. and Williams, L. J. (2010). Principal com-ponent analysis.

Wiley Interdisciplinary Reviews:Computational Statistics , 2(4):433–459.Cover, T. and Hart, P. (1967). Nearest neighbor pat-tern recognition.

IEEE Transactions on InformationTheory , 13(1):21–27.Cox, D. (1958). The regression analysis of binary se-quences (with discussion).

J. Roy. Stat. Soc. B ,20:215–242.Duda, R. O., Hart, P. E., et al. (1973).

Pattern clas-siﬁcation and scene analysis , volume 3. Wiley NewYork.Fisher, R. A. (1936). The use of multiple measure-ments in taxonomic problems.

Annals of eugenics ,7(2):179–188.Grice, P. (1975). Logic and conversation. In

P.Coleand J. Morgan eds. Syntax and Semantics, vol.3,New York . Academic Press.Hahn, U. (2014). Similarity.

Wiley InterdisciplinaryReviews: Cognitive Science , 5(3):271–280.Hosmer Jr, D. W. and Lemeshow, S. (2004).

Appliedlogistic regression . John Wiley & Sons.Hotelling, H. (1933). Analysis of a complex of statisti-cal variables into principal components.

Journal ofeducational psychology , 24(6):417.Hotelling, H. (1936). Relations between two sets ofvariates.

Biometrika , pages 321–377.loos, H. and Sloutsky, V. M. (2008). What’s behinddiﬀerent kinds of kinds: Eﬀects of statistical densityon learning and representation of categories.

Journalof Experimental Psychology: General , 137(1):52.Kruskal, J. B. and Wish, M. (1978).

Multidimensionalscaling , volume 11. Sage.Lee, D. and Seung, H. (1999). Learning the parts ofobjects by non negative matrix factorization.

Na-ture , 401(6755):788–791.Murphy, G. L. (2004).

The big book of concepts . MITpress.Pearson, K. (1901). On lines and planes of closest ﬁt tosystems of points in space.

Philosophical Magazine ,2(11):559C572.Roweis, S. T. and Saul, L. K. (2000). Nonlinear di-mensionality reduction by locally linear embedding.

Science , 290(5500):2323–2326.Scholkopf, B. and Smola, A. J. (2011). Learning withkernels: Support vector machines, regularization,optimization, and beyond.

Journal of the Ameri-can Statistical Association , 98(3):781.Silverman, B. (1986).

Density Estimation for Statisticsand Data Analysis . Chapman & Hall/CR,New York.Tenenbaum, J. B., De Silva, V., and Langford, J. C.(2000). A global geometric framework for nonlineardimensionality reduction.

Science , 290(5500):2319–2323.Tibshirani, R. (1994). Regression shrinkage and selec-tion via the lasso.

Journal of the Royal StatisticalSociety, Series B , 58(1):267–288.Valiant, L. G. (1984). A theory of the learnable.

Com-munications of the ACM , 27(11):1134–1142.Vapnik, V. (2000).

The nature of statistical learningtheory . springer.Yu, J. and Xu, Z. (2014). Categorization axioms forclustering results. eprint arXiv:1403.2065eprint arXiv:1403.2065