aa r X i v : . [ c s . L G ] J a n Generalized Categorization Axioms
Jian YU
Beijing Key Lab of Traffic Data Analysis and MiningBeijing Jiaotong University,Beijing, ChinaEmail: [email protected]
Abstract
Categorization axioms have been proposed toaxiomatizing clustering results, which offersa hint of bridging the difference between hu-man recognition system and machine learn-ing through an intuitive observation: an ob-ject should be assigned to its most similarcategory. However, categorization axiomscannot be generalized into a general machinelearning system as categorization axioms be-come trivial when the number of categoriesbecomes one. In order to generalize cate-gorization axioms into general cases, cate-gorization input and categorization outputare reinterpreted by inner and outer categoryrepresentation. According to the categoriza-tion reinterpretation, two category represen-tation axioms are presented. Category repre-sentation axioms and categorization axiomscan be combined into a generalized catego-rization axiomatic framework, which accu-rately delimit the theoretical categorizationconstraints and overcome the shortcomingof categorization axioms. The proposed ax-iomatic framework not only discuses catego-rization test issue but also reinterprets manyresults in machine learning in a unified way,such as dimensionality reduction, density es-timation, regression, clustering and classifi-cation.
Keywords:
Similarity, Categorization, CategoryRepresentation, Dimensionality Reduction,DensityEstimation, Regression, Clustering, Classification
Up to now, many elegant but complex machine learn-ing theories are developed for categorization, such as PAC theory (Valiant, 1984), statistical learning the-ory (Vapnik, 2000) and so on. However, a six or sevenyear old child can easily and correctly categorize manyobjects and does not understand about the above men-tioned machine learning theories. Therefore, there ex-ists a clear gap between human recognition system andmachine learning theories.In Yu and Xu (2014), categorization axioms have beenproposed to axiomatizing clustering results, which the-oretically offers a hint of bridging the difference be-tween human recognition system and machine learn-ing by an intuitive observation: an object should beassigned to its most similar category. Assumed that c >
In cognitive sciences, a basic principle for humanrecognition system is that an object should be assignedto its most similar category. For human being, mem-bership explicitly represents that an object is assignedto some category and must be observed by others, sim-ilarity between an object and a category may be im-plicit and may not be observed by others. In otherwords, human beings has two category representationsfor categorization, membership is explicit and is calledouter category representation, similarity may be im-plicit and belongs to inner category representation.According to cognitive science, inner category repre-sentation for a category is in the mind of human be-ings, which may be different from the outer categoryrepresentation. Human being establish the relationbetween objects in the world and corresponding con-cepts in the mind by two category representations forcategorization. For categories, a categorization algo-rithm should also have inner and outer category rep-resentations in order to reflect the relation betweenobjects in the world and the corresponding categoriesas Yu and Xu (2014) have done for clustering results.Considered the limits of the proposed representationin (Yu and Xu, 2014), we will reinterpret how to de-fine the inner and outer category representation in acategorization algorithm in the following.Any algorithm has the input and the output. For a cat-egorization algorithm, the input is called categoriza-tion input and the output is called categorization re-sult. Categorization input should have inner and outerrepresentation. Inner categorization input is expected to be learned with respect to the outer categorizationinput. Similarly, Categorization output should haveinner and outer representation. Inner categorizationoutput is actually learned with respect to the outercategorization output.The outer categorization input is about the prede-fined categorization information of the sampling ob-jects O = { o , o , · · · , o n } , including the input objectrepresentation and the corresponding outer categoryrepresentation.The input object representation is represented by X = { x , x , · · · , x n } with c subsets X , X , · · · , X c , where x k represents the k th object o k , X i is a set that con-sists of all the objects of the i th category in the dataset X . The outer category representation for the catego-rization input can be represented by U = [ u ik ] c × n , ∀ i ∀ k, u ik ≥ x k belonging to the i th category. Hence, the outer cat-egorization input can be represented by ( X, U ). Moredetailed can be seen in (Yu and Xu, 2014) . When U is known, one object should be assigned to the cat-egory with biggest membership. Therefore, assign-ment (outer referring) operator → can be defined as ~X = { ~x , ~x , · · · , ~x n } , where ~x k = arg max i u ik .Similarly, the outer categorization result can beexpressed by ( Y, V ), where Y = { y , y , · · · , y n } represents the object representation for the out-put, y k also represents the k th object o k , and Y , Y , · · · , Y c represents the corresponding input c subsets X , X , · · · , X c , V is the outer categoryrepresentation for the output, V = [ v ik ] c × n =[ v , v , · · · , v n ] is a partition matrix, ∀ i ∀ k, v ik ≥ y k belong-ing to the i th category and v k = [ v k , v k , · · · , v ck ] T .Similarly, assignment operator → is defined as ~Y = { ~y , ~y , · · · , ~y n } , where ~y k = arg max i v ik . If ~x k , ~y k aresingle value, x k belongs to the ~x thk category, y k belongsto the ~y thk category. In common sense, assignment op-erator → represents outer referring and reflects theexternal relation between the object and the category.As pointed out by Yu and Xu (2014), the cognitiverepresentation of a category is always supposed to ex-ist, even in an implicit state when designing a cat-egorization algorithm. For simplicity, when the in-put X = { x , x , · · · , x n } is categorized into c subsets X , X , · · · , X c , ∀ i , X i is supposed to be the cogni-tive representation of the i th category , and the out-put Y = { y , y , · · · , y n } is categorized into c subsets Y , Y , · · · , Y c , ∀ i , Y i is supposed to the cognitive rep-resentation of i th category.As pointed out by Yu and Xu (2014), when the cogni-tive representation for any category is defined, objectscan be categorized based on the similarity between ob-ects and categories. As the input is usually differentfrom the output, the input category similarity map-ping and the output category similarity mapping canbe defined by computing the similarity between ob-jects and categories as follows. Input Category Similarity Mapping:
Sim X : X × { X , X , · · · , X c } 7→ R + is called categorysimilarity mapping if an increase in Sim X ( x k , X i ) indi-cates greater similarity between x k and X i , a decreasein Sim X ( x k , X i ) indicates less similarity between x k and X i . Output Category Similarity Mapping:
Sim Y : Y × { Y , Y , · · · , Y c } 7→ R + is called categorysimilarity mapping if an increase in Sim Y ( y k , Y i ) indi-cates greater similarity between Y k and Y i , a decreasein Sim Y ( y k , Y i ) indicates less similarity between y k and Y i .For input category similarity mapping, similarity (in-ner referring) operator ∼ can be defined as e X = { e x , e x , · · · , e x n } , where e x k = arg max i Sim X ( x k , X i ).Similarly, for output category similarity mapping,similarity operator ∼ can be defined as e Y = { e y , e y , · · · , e y n } , where e y k = arg max i Sim Y ( y k , Y i ).It is easy to know that if e y k is single value, the larger Sim Y ( y k , Y e y k ), the better Sim Y . Similarly, if e x k is sin-gle value, the larger Sim X ( x k , X e x k ), the better Sim X .Similarly, the input category dissimilarity mappingand the output category dissimilarity mapping can bedefined as follows: Input Category Dissimilarity Mapping: Ds X : X × { X , X , · · · , X c } 7→ R + is called categorydissimilarity mapping if an increase in Ds X ( x k , X i )indicates less similarity between x k and X i , a decreasein Ds X ( x k , X i ) indicates greater similarity between x k and X i . Output Category Dissimilarity Mapping: Ds Y : Y × { Y , Y , · · · , Y c } 7→ R + is called categorydissimilarity mapping if an increase in Ds Y ( y k , Y i ) in-dicates less similarity between y k and Y i , a decreasein Ds Y ( y k , Y i ) indicates greater similarity between y k and Y i . For input category dissimilarity mapping, similarityoperator ∼ can be defined as e X = { e x , e x , · · · , e x n } ,where e x k = arg min i Ds X ( x k , X i ). Similarly, for out-put category dissimilarity mapping, similarity opera-tor ∼ can be defined as e Y = { e y , e y , · · · , e y n } , where e y k = arg min i Ds Y ( y k , Y i ). If e x k is single value, the In order to be consistent with the intuition, categorysimilarity mapping and category dissimilarity mapping areusually supposed to be non negative in this section. In ap-plications, category similarity mapping and category dis-similarity mapping can be negative. less Ds Y ( y k , Y e y k ), the better Ds Y . Similarly, the less Ds X ( x k , Y e x k ), the better Ds X ,where e x k is single value.If e x thk and e y k are single value, x k is said to be similarto the e x thk category, y k is said to be similar to the e y thk category. In daily life, similarity operator ∼ repre-sents inner referring and established the latent relationbetween the object in the world and the cognitive cat-egory representation.According to the above analysis, when the outer cate-gorization input is ( X, U ), its corresponding inner cat-egorization input can be represented by (
X, Sim X )or by ( X, Ds X ), where X = { X , X , · · · , X c } .For brevity, ( X, U, X, Sim X ) or by ( X, U, X, Ds X )is called the categorization input. ( X, Sim X ) or( X, Ds X ) are the inner category representation for theinput, simply, called inner input.Likely, when the outer categorization result is ( Y, V ),its corresponding inner categorization result can berepresented by (
Y , Sim Y ) or by ( Y , Ds Y ), where Y = { Y , Y , · · · , Y c } . For brevity, ( Y, V, Y , Sim Y ) orby ( Y, V, Y , Ds Y ) is called the categorization result.( Y , Sim Y ) or ( Y , Ds Y ) are the inner category repre-sentation for the output, simply, called inner output.If a categorization algorithm can explicitly output Y ,such a categorization algorithm can be called whitebox. If a categorization algorithm can not explicitlyoutput Y but only explicitly output ( Y, V ), such a cat-egorization algorithm can be called black box. If a cat-egorization algorithm can explicitly output parts butnot full of Y , such a categorization algorithm can becalled grey box.For a categorization algorithm, its outer input andouter output should have the corresponding inner cat-egory representations. Therefore, we call it ExistenceAxiom of Category Representation (ECR). More ac-curately, it can be expressed as follows:
1) ECR :
For a categorization algorithm, if its outer input is(
X, U ) and its outer output is (
Y, V ), then there existsthe corresponding inner input (
X, Sim X ) and inneroutput ( Y , Sim Y ).For a categorization algorithm, the input is expectedto have the same category representation as the out-put with respect to categorization. ( X, Sim X ) andthe corresponding output ( Y , Sim Y ) is considered tohave the same category representation with respect tocategorization if ( X, e X ) = ( Y , e Y ). ( X, U ) and (
Y, V )is considered to have the same category representa-tion with respect to categorization if ~X = ~Y . Suchan assumption is called Uniqueness Axiom of Cate-gory Representation (UCR), which can be expressedas follows: ) UCR : For a categorization algorithm, its categorization input(
X, U, X, Sim X ) and its corresponding categorizationoutput ( Y, V, Y , Sim Y ) should satisfy ( ~X, X, e X ) =( ~Y , Y , e Y ).ECR and UCR are called category representa-tion axioms. ( X, U, X, Sim X ) represents the cate-gory information by the outer information provider,( Y, V, Y , Sim Y ) represents the category informationby the categorization algorithm, ( X, Sim X ) is ex-pected to be learned and represents the inner categoryrepresentation of the outer information provider, and( Y , Sim Y ) is actually learned and represents the in-ner category representation of the categorization algo-rithm. UCR offers the conditions that learning can beperfectly accomplished, which states that the catego-rization input and the categorization output have thesame categorization semantics. Sometimes, ~X = ~Y can be further enhanced into U = V . According to Yu and Xu (2014), categorization axiomsincludes Sample Separation Axiom (SS), CategorySeparation Axiom(CS) and Categorization Equiv-alency Axiom (CE). For a categorization result(
Y, V, Y , Sim Y ), SS, CS and CE can be reinterpretedby similarity operator and assignment operator as fol-lows.
1) SS: ∀ k ∃ i (˜ y k = i )
2) CS: ∀ i ∃ k (˜ y k = i ))
3) CE: e Y = ~Y Moreover, we can prove Theorem 1.
Theorem 1. If ∀ k ∀ i ∀ j (( j = i ) → ( Sim Y ( y k , Y i ) = Sim Y ( y k , Y j ))) , then SS must hold. When a categorization result is not proper, there aresome objects theoretically belonging to two and morecategories. In other words, some objects are in the bor-derline of some category. Based on this fact, boundaryset can be defined as follows.
Boundary set:
For a categorization result(Y,V,
Y , Sim Y ), the boundary set for ( Y, Y , Sim Y ) isdefined as follows. BS ( Y,Y ,Sim Y ) = { y k | card ( e y k ) > } where card ( e y k ) represents the cardinality of a set e y k .Transparently, the above analysis also holds for thecategorization input ( X, U, X, Sim X ). Therefore,( X, U, X, Sim X ) should also satisfy SS, CS and CE.For brevity, we will not repeat the similar result. More interestingly, some relation can be established betweenUCR and CE by Theorem 2. Theorem 2.
If the categorization input ( X, U, X, Sim X ) and the categorization result(Y,V, Y , Sim Y ) satisfy CE, then e X = e Y is equivalentto ~X = ~Y . As noted above, the input x k and the corresponding y k represent the same object o k . Generally speak-ing, the input x and the corresponding output y represents the same object o , therefore, it is natu-rally assume that there exists a mapping θ from x to y , i.e. y = θ ( x ). When X = Y , it is easyto know that Sim Y ( y k , Y i ) = Sim Y ( θ ( x k ) , Y i ) = Sim Y ( θ ( x k ) , X i ). Hence, Sim X ( x k , X i ) can be de-fined by Sim Y ( θ ( x k ) , X i ). Therefore, it is easy toknow that X = Y implies that ˜ X = ˜ Y when Sim X ( x k , X i ) is defined by Sim Y ( θ ( x k ) , X i ). By The-orem 2 and the above analysis, X = Y play an es-sential role in UCR. In particular, when c =1, it iseasy to know that e X = e Y and ~X = ~Y hold triv-ially, X = Y is the only meaningful requirement inUCR. Moreover, categorization axioms and UCR offerthe conditions that category similarity mapping shouldsatisfy, and states that the input category similaritymapping should be equivalent to the output categorymapping with respect to categorization, which is calledsimilarity assumption. For categorization, it is verychallenging to design a proper output category simi-larity mapping satisfying UCR and categorization ax-ioms. Usually, the input category similarity mappingis not equivalent to the output category mapping withrespect to categorization in practice, which is calledsimilarity paradox. If similarity paradox occurs, thecategorization error will be not zero. According to theabove analysis, the key to solve similarity paradox is tokeep X = Y to be true. As a matter of fact, it is oftentrue that X = Y . Therefore, how to solve similarityparadox is an eternal problem in categorization.In summary, category representation axioms and cat-egorization axioms have established the relationshipsamong all the parts related to categorization input andcategorization output, as shown in Figure 1. UCR es-tablishes the categorization equivalence between theinput and the corresponding output. Categorizationaxioms only establish the relationships between theouter representation and the corresponding inner rep-resentation and do not reflect the relation between theinput and the output. If the object representation canbe theoretically generated by the corresponding cogni-tive representation, then the corresponding cognitiverepresentation is called generative. If the object repre-sentation can not be theoretically generated by the cor-responding cognitive representation but can decide thecorresponding cognitive representation, then the cor-esponding cognitive representation is called discrim-inative. If the cognitive representation is generative,the corresponding learning model is called generativemodel. If the cognitive representation is generative,the corresponding learning model is called discrimina-tive model.In particular, Let X = Y and UCR be true,( X, Sim X ) and ( Y , Sim Y ) are exchangeable with re-spect to categorization. Under such assumptions,( X, U, X, Sim X ) can be used to represent the cate-gorization results, where ( X, Sim X ) actually denotes( Y , Sim Y ). In Yu and Xu (2014), ECR and UCRare implicitly assumed to be true, such an assump-tion makes CE not be true as it is very difficult for( Y , Sim Y ) to have the same categorization capacityas ( X, U ) in practice, especially for U is given a priori. All the above analysis does not discuss how to evalu-ate the categorization result (
Y, V, Y , Sim Y ). Franklyspeaking, it is very challenging to test the performanceof a categorization algorithm. When estimating thecategorization performance, a test set ( X T , U T ) is usu-ally provided and ( X, U ) is called the training set.According to the analysis in section 2, ( X T , Sim X T )exists. Similarly, if ( X T , U T , X T , Sim X T ) is used thecategorization input, the corresponding categorizationoutput can be represented by ( Y T , V T , Y T , Sim Y T ).It is easy to know the test set and the training set aresupposed to represent the same categorization for thesame categorization algorithm. Therefore, Categoriza-tion Test Axiom can be expressed as follows: Categorization Test Axiom:
For a categorizationalgorithm, if its training test is (
X, U ) and its test setis ( X T , U T ), then ( X, Sim X )=( X T , Sim X T ).Certainly, categorization test axiom offers the prereq-uisite condition that a categorization algorithm hasgeneralization ability, which is a demanding require-ment for categorization. It is easy to prove that catego-rization test axiom can infer the objects in the trainingset and the test set should be independent and identi-cally distributed if objects are random variables.Usually, X only approximates X T . Sometimes, thedifference between X and X T is so big that X and X T cannot be considered to represent the same categoriza-tion. In this case, the test result will be not credibleand it can not be checked whether the correspondingcategorization algorithm has generalization ability ornot.In fact, X and X T are unobservable and unknown, itis very difficult to measure the difference between X and X T . Instead of measuring the difference between X and X T , one estimation method is to compute thedifference between ( X, U ) and ( X T , U T ), the other es-timation method is to compute the difference between Y and Y T assuming that UCR holds or approximatelyholds at least. Theoretically, the difference between X and X T should be proportional to the difference be-tween Y and Y T in the ideal case. Therefore, the cate-gorization robustness assumption can be described asfollows: Categorization Robustness Assumption:
A cat-egorization algorithm is called robust if there exist twoconstants k and k such that k | Y − Y T | ≤ | X − X T | ≤ k | Y − Y T | , where 0 < k ≤ k .Categorization robustness assumption demonstratesthe global condition that the corresponding categoriza-tion algorithm has generalization ability when catego-rization test axiom does not hold. If categorizationtest axiom holds, a good categorization axiom shouldmake | Y − Y T | as small as possible, When categoriza-tion test axiom dose not hold, it is very challengingto check whether or not categorization robustness as-sumption holds as X and X T are usually not known.Therefore, a substitutional method is to compute thedistance between the outer representations. Such anidea leads to local categorization robustness assump-tion as follows: Local Categorization Robustness Assumption:
A categorization algorithm is called locally robustif there exist two constants k and k such that k | ( Y, V ) − ( Y T , V T ) | ≤ | ( X, U ) − ( X T , U T ) | ≤ k | ( Y, V ) − ( Y T , V T ) | , where 0 < k ≤ k , ( X, U ) isa training test and ( X T , U T ) is a test set.Transparently, if local categorization robustness as-sumption is satisfied with respect to | ( X, U ) − ( X T , U T ) | < ε where ε is a very small positive number,the corresponding algorithm can be stably evaluatedin theory. When categorization axioms are proposed byYu and Xu (2014), three design principles of cluster-ing methods have also been proposed by Yu and Xu(2014). However, three design principles of clusteringmethods proposed by Yu and Xu (2014) need to bereinterpreted when categorization is investigated. Itis easy to guess that five axioms are also useful fordeveloping categorization methods when five axiomsare proposed to deal with categorization algorithms.Clearly, five axioms do not have equal importancewhen designing a categorization method. ECR onlyigure 1: Relationship between a categorization input (
X, U, X, Sim X ) and its corresponding categorizationresult ( Y, V, Y , Sim Y )tells us how to represent the categorization input andthe categorization output. CE is always supposedto be true for a categorization algorithm since theouter referring and the corresponding inner referringshould represent the same referring, in a word, theexplicit function of a categorization algorithm shouldbe the same as its internally implemented function.As pointed by Yu and Xu (2014), SS and CS offera very low bar for clustering results. Similarly, SSand CS are also loose requirements for categorization.UCR is far demanding as it requires three equivalenceconditions are true simultaneously. Therefore, threedesign principles of categorization methods can beinferred from SS, CS and UCR. In the following,we will carefully investigate such three principlesrespectively under the proposed axiomatic framework. Theorem 1 shows that the conditions of SS are nearlyno requirement as the conditions of Theorem 1 are of-ten true in general case for a well designed categorysimilarity. Following the same analysis in Yu and Xu(2014), SS should be enhanced into category compact-ness principle as follows:
Category Compactness Principle:
A categoriza-tion method should make its categorization result ascompact as possible.Category compactness principle says that every cate-gory should be as much compact as possible. Underthe proposed representation of the categorization re-sult, category compactness criterion can be defined asfollows.
Category Compactness Criterion: J C : { Y, V } ×{
Y , Ds Y } 7→ R + is called category compactness crite-rion if the optimum of J C ( Y, V, Y , Ds Y ) correspondsto the categorization result with the largest category compactness.According to categorization axioms, category com-pactness criterion can be equivalently defined by J C ( X, U, X, Ds X ). In the literature, it is often seenthat J C ( X, U, X, Ds X ) = P i P k u ik Ds X ( x k , X i ).As the relevance among ( X, U, X, Ds X ), J C ( X, U, X, Ds X ) can be further simplified into J C ( X, X, Ds X ) or J C ( U ). Noticing the definition ofcategory similarity mapping, category compactnessprinciple is still available for categorization when c = 1. If a categorization result (
Y, V, Y , Sim Y ) satisfies CS,then ∀ ≤ i = j ≤ c, Y i = Y j . According to the samereason in Yu and Xu (2014), CS can be enhanced intocategory separation principle as follows: Category Separation Principle:
A good catego-rization result should have the maximum distance be-tween categories.Under the proposed representation of the categoriza-tion result, category compactness criterion can be de-fined as follows.
Category Separation Criterion: J S : { Y, V } × { Y , Y , · · · , Y c } 7→ R + is calledcategory separation criterion if the optimum of J S ( Y, V, { Y , Y , · · · , Y c } ) corresponds to the catego-rization result with maximal category separation.Category separation principle requires that c >
1. Inother words, when c = 1, category separation principleis unavailable. .3 Categorization Consistency Principle If the categorization input (
X, U, X, Sim X ) and itscorresponding categorization result ( Y, U, Y , Sim Y )satisfy UCR, the categorization error is zero. How-ever, even for human recognition systems, UCR cannot be always guaranteed to be true. Generally, hu-man recognition systems always try to make catego-rization error as small as possible. Therefore, UCR isthe most demanding requirement for categorization. IfUCR does not hold, a reasonable categorization crite-rion should make UCR hold as approximately as pos-sible, which result in categorization consistency prin-ciple as follows: Categorization Consistency Principle:
WhenUCR does not hold, a good categorization resultshould make UCR as approximately correct as pos-sible.When UCR does not hold, categorization consistencyprinciple can be used to design some categorizationcriterion as follows:
Categorization Consistency Criterion: J E : { X, ~X, X, e X }} × { Y, ~Y , Y , e Y } 7→ R + is called cat-egorization consistency criterion if the optimum of J E ( X, ~X, X, e X, Y, ~Y , Y , e Y ) corresponds to the catego-rization result with the minimum difference between( ~X, X, e X ) and ( ~Y , Y , e Y ).Clearly, if UCR can not be true, categorization con-sistency principle should be the first principle whendesigning a categorization algorithm no matter whatthe number of categories is. Frankly speaking, it isnot usually expected that ( X, Sim X ) and ( Y , Sim Y )are obtained simultaneously. Usually, ( X, Sim X ) isinterchanged or approximated by ( Y , Sim Y ) when de-signing a categorization algorithm. In many catego-rization algorithms, UCR is supposed to be true but isnot actually true. Under such an assumption, categorycompactness principle and category separation princi-ple should be used to design categorization methods. For a specific categorization problem, there existsmany categorization models. Category compactnessprinciple, category separation principle and catego-rization consistency principle just select the optimalparameters in the candidate models with the sameinner category representation, and cannot choose theoptimal models among different inner category repre-sentations. How to select an appropriate categoriza-tion model among different inner category represen-tations? Occams razor principle is a popular tool forhuman being to choose models among different repre-sentations, which states that ”plurality should not be posited without necessity”. Therefore, a simpler cate-gorization model should be selected among the candi-date models with the same performance.What is a simple categorization model? As the cate-gorization problem can be represented by the catego-rization input (
X, U, X, Sim X ) and the correspondingcategorization output ( Y, V, Y , Sim Y ), a model withthe simple categorization input and output will beconsidered simple. When c =1, then ∀ k, ~x k = 1 and ∀ k, e x k = 1. Therefore, it is enough to study X and Y in order to obey UCR or its approximated version: cat-egorization consistency principle, ( U, Sim X , V, Sim Y )can be omitted when designing a categorization model.If such an assumption holds, it can be considered asa simple categorization problem. Otherwise, if c ≥ Y = X , V can be replaced by Sim Y becauseCE always hold, hence, ( Y, V, Y , Sim Y ) can be repre-sented by ( Y , Sim Y ). Similarly, ( X, U, X, Sim X ) canbe represented by ( X, U ). In this case, it is enoughto deal with (
X, U, Y , Sim Y ) for such a categorizationproblem. Clearly, it is also a simple categorizationcase. Of course, such simplified categorization modelscan be further simplified by selecting simpler Y . Insummary, Occam’s razor can be used to discuss cate-gorization model complexity. In the following, we willstudy categorization models according to model com-plexity in the Occam’s razor point of view. In this section, we will study categorization modelsaccording to analysis in section 5.4. When c = 1, cat-egorization becomes one category problem, includingdensity estimation, regression and some dimensional-ity reduction methods. When c >
1, categorizationis multiple category problem, including clustering andclassification. When U is not know for c > U is known for c > In the following, we will give several examples to showhow to interpret dimensionality reduction methodsbased on the proposed axioms and principles.For simplicity, assume that X = [ x kr ] n × p are sampledfrom some underlying structure in a space with dimen-sionality p , and such a sample can also be representedby Y = [ y kr ] n × d in a low dimensional space with di-mensionality d , where p >> d . Such a categorizationproblem is called dimensionality reduction.f U is not known, such a problem is called unsuper-vised dimensionality reduction. It is easy to know thatunsupervised dimensionality reduction has the cate-gorization input ( X, U, X, Ds X ) and the categoriza-tion output ( Y, V, Y , Ds Y ). Therefore, unsuperviseddimensionality reduction can be considered a catego-rization problem. In this section, we further assumethat c = 1. Under this assumption, it is easy to knowthat ˜ X = ˜ Y and ~X = ~Y . UCR only requires that X = Y . If UCR does not hold, categorization consis-tent principle naturally requires that X approximates Y as much as possible. If UCR does hold, categorycompactness principle implies that the best X shouldmake the underlying category the most compact. PCA(Pearson, 1901; Hotelling, 1933;Abdi and Williams, 2010):
Let X = Y = x w w · · · w d represent the ordered orthonormal basis { w , w , · · · , w d } with the origin x , Y = [ y kr ] n × d arethe coordinates of the objects O = { o , o , · · · , o n } inthe ordered orthonormal basis { w , w , · · · , w d } withthe origin x . Then we know that w i w Tj = δ ij , δ ij = 1if i = j , δ ij = 0 if i = j , y kr = ( x k − x ) w Tr , x , w i are1 × p vector.Let Ds X ( x, X ) = ( x − x − P i ( x − x ) w Ti w i )( x − x − P i ( x − x ) w Ti w i ) T represent the dissimilaritybetween x and the category representation X , itis easy to prove that Ds X ( x, X ) = ( x − x )( x − x ) T − P i w i ( x − x ) T ( x − x ) w Ti . Obviously, if x can be a linear combination of the ordered orthonor-mal basis { w , w , · · · , w d } with the origin x , then Ds X ( x, X ) = 0 means x can be perfectly representedby Y . If ∀ x k , Ds X ( x k , X ) = 0, then ∀ x k have the co-ordinates of the objects O = { o , o , · · · , o n } in theordered orthonormal basis { w , w , · · · , w d } with theorigin x with zero residual. In general cases, it is nottrue that ∀ x k , Ds X ( x k , X ) = 0.As UCR holds, category compactness principle will beused to seek the best X , which means that a good X should minimize the objective function (1) subject to ∀ i ∀ j, w i w Tj = δ ij .min X X k Ds X ( x k , X )= X k ( x k − x )( x k − x ) T − X i w i X k ( x k − x ) T ( x k − x )) w Ti (1)By Lagrange multiplier method, the objective function can be rewritten as (2) L = X k ( x k − x )( x k − x ) T − X i w i X k ( x k − x ) T ( x k − x ) w Ti − X i λ i ( w i w Ti −
1) (2)The equations (3) can be obtained by differentiating(2). ∂L∂x = − X k ( x k − x )( I p − X i w Ti w i ) = 0 ∂L∂w i = 2 w i X k ( x k − x ) T ( x k − x ) − λ i w i = 0 (3)Hence, the solution of minimizing (1) subject to ∀ i ∀ j, w i w Tj = δ ij is as (4). x = X k x k Nw i X k ( x k − x ) T ( x k − x ) = λ i w i (4)The equation (4) and minimizing (1) can introducethe traditional principle component analysis. The pro-posed axiomatic framework of categorization has of-fered a new interpretation of principle component anal-ysis. NMF(Lee and Seung, 1999):
Let Y = H =[ h kr ] n × d , X = Y = W = w w · · · w d represent the or-dered basis { w , w , · · · , w d } , Y = [ h kr ] n × d are thecoordinates of the objects O = { o , o , · · · , o n } in theordered basis { w , w , · · · , w d } , where all the elementsin w i are negative and ∀ k, r, h kr are negative.Let Ds X ( x k , X ) = ( x k − P i h ki w i )( x k − P i h ki w i ) T .As UCR holds, category compactness principle will beused to seek the best X , which means that a good X should minimize the objective function (5).min X X k Ds X ( x k , X )= X k ( x k − X i h ki w i )( x k − X i h ki w i ) T = k X − HW k (5)Minimizing (5) introduces nonnegative matrix factor-ization (Lee and Seung, 1999). CA(Hotelling, 1936):
Let X = Xa T | Xa T | and Y = Y b T | Y b T | , where a is 1 × p vector, b is 1 × d vector. How-ever, X = Y does not hold in general, UCR is nottrue. Therefore, we should use categorization consis-tence principle, which means to minimize the objectivefunction (6).min a,b L ( X, Y ) = | X − Y | = (cid:12)(cid:12)(cid:12)(cid:12) Xa T | Xa T | − Y b T | Y b T | (cid:12)(cid:12)(cid:12)(cid:12) = 2 − Xa T , Y b T ) | Xa T || Y b T | (6)Obviously, minimizing (6) is equivalent to maximizing(7) ( Xa T , Y b T ) | Xa T || Y b T | = aX T Y b T √ aX T Xa T √ bY T Y b T (7)Hence, canonical correlation analysis is introduced bymaximizing (7). LLE(Roweis and Saul, 2000):
Let X = W X = [ w kl ] n × n , Ds X ( x k , X )= Ds X ( x k , W )= | x k − P j ∈ N ( k ) w kj x j | , where P l w kl = 1, w kl ≥ w kl = 0if l / ∈ N ( k ), N ( k ) = { j | x j is the neighbor of x k } .As UCR holds, category compactness principle willbe used to seek the best X . According to categorycompactness principle, a good category representation X = W should minimize the objective function (8):min W X k Ds X ( x k , W ) = X k | x k − X j ∈ N ( k ) w kj x j | (8)According to UCR, X = Y implies that Y = W . Set Ds Y ( y k , Y )= Ds Y ( y k , W )= | y k − P j ∈ N ( k ) w kj y j | , cat-egory compactness principle tells us that a good Y should minimize the objective function (9)min Y X k Ds Y ( y k , W ) = X k | y k − X j ∈ N ( k ) w kj y j | (9)By this way, local linear embedding algorithm can beresulted by minimizing (8) and (9). MDS(Kruskal and Wish, 1978):
Let X = D X =[ d Xkl ] n × n , Y = D Y = [ d Ykl ] n × n ,where d Xkl = | x k − x l | , d Ykl = | y k − y l | . It is easy to know that X = Y can-not hold. Therefore, categorization consistence princi-ple will be used, which requires that a good Y shouldminimize the objective function (10).min Y L ( X, Y ) = L ( D X , D Y ) (10)Naturally, multidimensional scaling (MDS) algorithmcan be introduced by minimizing the objective func-tion (10). ISOMAP(Tenenbaum et al., 2000):
Let X = D X = [ d Xkl ] n × n , Y = D Y = [ d Ykl ] n × n ,where d Xkl rep-resents the geodesic distance between x k and x l , d Ykl = | y k − y l | . It is impossible for X = Y . Cate-gorization consistence principle requires to minimize(10). According to the above analysis, multidimen-sional scaling (MDS) algorithm can be used to com-pute Y .By this way, ISOMAP algorithm is introduced. If n points x , x , · · · , x n are sampled from a randomvariable with unknown probability density function f ,then f is expected to be constructed from the observeddata X = { x , x , · · · , x n } , which is called density es-timation. f is called expected density function.Set X = Y , X = f , Y = ˆ f , U = [1 , , · · · , T × n , V = [1 , , · · · , T × n , density estimation can be con-sidered as a categorization problem with the catego-rization input ( X, U, X, Ds X ) and the categorizationoutput ( Y, V, Y , Ds Y ), i.e. density estimation is a cat-egorization problem with only one category. In thefollowing, ˆ f is called density estimator.Because all points belong to one category, ~U = ~V and e X = e Y hold. However, X = Y . Therefore, UCR doesnot hold.One method of density estimation is parametric esti-mation. If p ( x ) is supposed to belong to the distri-bution family p ( x | θ ), density estimation will be trans-formed into estimating θ . In other words, density es-timation will become parametric estimation. In thiscase, X = θ , Ds X ( x, θ ) = − log( p ( x | θ )). Let ˆ θ bethe estimation of θ , we have Y = ˆ θ , Ds Y ( x, ˆ θ ) = − log( p ( x | ˆ θ )) Therefore, category compactness princi-ple requires to minimize intra category variance, whichresults in the objective function (11).min ˆ θ n X k =1 Ds Y ( x k , ˆ θ ) = min ˆ θ n X k =1 − log( p ( x k | ˆ θ )) (11)It is easy to know that maximum likelihood method isequivalent to minimizing (11).For example, let ∀ k, x k ∈ R p , x ∈ R p , p ( x | ˆ θ ) = √ π p ˆ σ p exp[ −
12 ( x − ˆ µ ) T ( x − ˆ µ )ˆ σ p ], where ˆ θ = { ˆ µ, ˆ σ p } . Ac-cording to Equation (11), the objective function (12)can be inferred. L = n X k =1 − log( p ( x k | ˆ θ ))= n X k =1 ( 12 | x k − ˆ µ | ˆ σ p + log √ π p ˆ σ p ) (12)inimizing (12) can lead to the estimation ofˆ θ = { ˆ µ, ˆ σ p } , where ˆ µ = P nk =1 x k n , ˆ σ p = P nk =1 | x − ˆ µ | n .Another method of density estimation is non paramet-ric estimation. In this method, less rigid assumptionsare made about f . In the literature (Silverman, 1986),non parametric density estimators include histograms,kernel density estimation, k-nearest neighbor method,etc.Clearly, the key problem for density estimation is toestimate the difference between ˆ f and f . In the-ory, the minimum difference between ˆ f and f shouldbe expected according to categorization consistencyprinciple. In the literature, theoretical conditions forˆ f = f have been well studied in the limit point ofview(Silverman, 1986). Generally, if n points (ˆ x , f (ˆ x )), (ˆ x , f (ˆ x )), · · · ,(ˆ x n , f (ˆ x n )) are sampled from (ˆ x, f (ˆ x )) and f is notknown but is expected to be learned, such a problemis called regression. Usually, f is called expected re-gression function.Set X = ˆ x f (ˆ x )ˆ x f (ˆ x ) · · · · · · ˆ x n f (ˆ x n ) , Y = ˆ x F (ˆ x )ˆ x F (ˆ x ) · · · · · · ˆ x n F (ˆ x n ) , X = (ˆ x, f (ˆ x )), Y = (ˆ x, F (ˆ x )), where F is calledpredicted regression function, U = [1 , , · · · , T × n , V = [1 , , · · · , T × n , it is easy to know that regressionhas the categorization input ( X, U, X, Ds X ) and thecategorization output ( Y, V, Y , Ds Y ). In other words,regression can be considered as a categorization prob-lem with only one category.Because all points belong to one category, it is easyto prove that ~U = ~V and e X = e Y . However, X = Y in general cases. Therefore, UCR does not hold. Ac-cording to categorization consistency principle, a goodcategory representation Y should minimize the follow-ing objective function: | X − Y | = D ( f (ˆ x ) , F (ˆ x )) (13)It is impossible to directly compute D ( f (ˆ x ) , F (ˆ x ))as f is unknown. Therefore, different definitions of D ( f (ˆ x ) , F (ˆ x )) lead to different regression algorithms.For example, set f (ˆ x ) ∈ R and F (ˆ x ) = ˆ w ˆ x T + b . As-sume that the dimensionality of ˆ x is τ .If D ( f (ˆ x ) , F (ˆ x )) = P nk =1 k f (ˆ x k ) − F (ˆ x k ) k , linear re-gression is obtained by minimizing (13) if n >> τ .When n << τ , it is easy to know that many feasi-ble solutions can reach the same minimum of (13) as n << τ implies that minimizing (13) faces singularproblem.How to select the optimal solution from many feasi-ble solutions of minimizing (13)? A natural idea is toselect the feasible solution with minimum norm.If using Euclidean norm, then D ( f (ˆ x ) , F (ˆ x )) can bedefined by nk =1 k f (ˆ x k ) − F (ˆ x k ) k + λ k w k . Hence, ridgeregression is obtained by minimizing (13) .When using L norm, then D ( f (ˆ x ) , F (ˆ x )) can be de-fined by P nk =1 k ( f (ˆ x k ) − F (ˆ x k ) k + λ k w k L . By thisway, Lasso regression is obtained by minimizing (13)(Tibshirani, 1994). For clustering, (
X, U, X, Sim X ) is called clustering in-put, ( Y, V, Y , Sim Y ) is called clustering result. Since U and V are unknown a priori for clustering, it isalways supposed that the inner input and the corre-sponding inner output should be the same. It meansthat ( X, Sim X )=( Y , Sim Y ). Under that assumption,it is assumed that U = V for clustering.When Y = X , the outer input and the outer out-put are the same, which implies that ( X, U, X, Sim X )and ( Y, V, Y , Sim Y ) are exchangeable with respect toclustering. In a word, ( X, U, X, Sim X ) also repre-sents clustering result. As Sim X and Sim Y are thesame, Sim can denote
Sim X and Sim Y for clustering.Hence, theoretical analysis on clustering in Yu and Xu(2014) is also true under new categorization interpre-tation of this paper.Even if Y = X , ( U, X, e X )=( V, Y , e Y ) also holds forclustering, which means that ECR and UCR are stilltrue. In other words, ECR and UCR can always beomitted for clustering so that SS, CS and CE playmore important role for clustering. Frankly speaking,SS, CS and CE are enough for clustering. Of course,when Y = X , such clustering algorithms usually havefeature extraction step such as spectral clustering. For classification, a category is called a class. In orderto be consistent with the literature, (
X, U, X, Sim X ) iscalled classification training input and categorizationresult ( Y, V, Y , Sim Y ) is called classification trainingoutput in this section. More specifically, ( X, U ) iscalled the training set, (
X, Sim X ) is called the ex-pected classifier, ( Y, V ) is called the training result,(
Y , Sim Y ) is called the learned classifier. ECR andcategorization axioms are usually true for classifica-tion. However, UCR is usually not true.If UCR is true, the classification error will be zero.n practice, a classification method can only make itsclassification result to reach the minimum classifica-tion error, but usually its classification error is notzero. Therefore, UCR should be as a constraint for aclassification problem. In other words, when dealingwith a classification problem, UCR should be true asmuch as possible in probability.When U is a proper partition, the corresponding clas-sification problem is standard classification problem.When U is a overlapping partition, the correspond-ing classification problem is multi label classificationproblem. For multi label classification, SS should begeneralized as ∀ k ∃ i ( i ∈ e x k )). Under such a generaliza-tion, multi label classification also follows SS.When classification result ( Y, V, Y , Sim Y ) is out-putted, we can predict which category a new objectshould be assigned to. In theory, the decision regionfor a classification result ( Y, V, Y , Sim Y ) can be de-fined as follows: Decision Region:
Ω = { x |∃ i (˜ y = i ) ∧ ( y = θ ( x ) } .In particular, the decision region for a class Y i can bedefined as follows: Decision Region for a Class Y i : Ω i = { x | (˜ y = i ) ∧ ( y = θ ( x ) } .Therefore, it is easy to know that ∪ i Ω i = Ω.The boundary for a classification result( Y, V, Y , Sim Y ) can be defined as follows: Boundary: ∂ Ω = Ω − Ω ⋄ , where Ω represents theclosure of Ω, Ω ⋄ represents the interior of Ω.The training decision region can be defined as follows: Training Decision Region: Ω ( Y ,Sim Y ) = { x |∃ i ∃ k (( x ∈ Ω i ) ∧ ( x k ∈ Ω i ) ∧ ( Sim Y ( θ ( x ) , Y i ) ≥ Sim Y ( θ ( x k ) , Y i ))) } . Training Decision Region for a class Y i : Ω Y i = { x |∃ k (( x ∈ Ω i ) ∧ ( x k ∈ Ω i ) ∧ ( Sim Y ( θ ( x ) , Y i ) ≥ Sim Y ( θ ( x k ) , Y i ))) } .The support vector for a classification result( Y, V, Y , Sim Y ) can be defined as follows: Support Vector: If x k ∈ ∂ Ω ( Y ,Sim Y ) , then x k iscalled a support vector for the classification result( Y, V, Y , Sim Y ).The margin for a classification result ( Y, V, Y , Sim Y )can be defined as follows: Margin ( Y ,Sim Y ) = min i = j d (Ω Y i , Ω Y j ), where d (Ω Y i , Ω Y j ) represents the distance between Ω X i andΩ Y j . Transparently, decision region is used to judge whichcategory one object should be assigned to, and the goalof the training decision region focuses on judging thequality of the classification result. In the literature, one common idea of designing a clas-sification algorithm is to transform classification to re-gression. In order to do this, regression function needsto be defined. In the following, we will do this accord-ing to the proposed axiomatic framework.Expected regression function can be defined as ρ ( k ) = ~x k , where U is a proper partition. Under this circum-stance, CE states that ρ ( k ) = e x k holds for a classifica-tion result. Similarly, when V is a proper partition, weset H ( k ) = ~x k , then CE guarantees that H ( k ) = ˜ y k holds.Generally speaking, x denotes the input object rep-resentation and y denotes the corresponding outputobject representation. As y = θ ( x ), ρ ( x ) denotes ~x ,the predicted regression function can be defined as h ( x ) = H ( θ ( x )) = H ( y ) = ˜ y , i.e h ( x ) represents thepredicted label.Set X = x ρ ( x ) x ρ ( x ) · · · · · · x n ρ ( x n ) , Y = x h ( x ) x h ( x ) · · · · · · x n h ( x n ) , X =( x, ρ ( x )), Y = ( x, h ( x )). Therefore, classification canbe considered regression.Using such denotation, UCR requires that X = Y ,which means ∀ x ( ρ ( x ) = h ( x )). In practice, it is im-possible as ρ ( x ) is not known a priori but only ρ ( x k )is known for k ∈ { , , · · · , n } . Therefore, it is naturalto relax ∀ x ( ρ ( x ) = h ( x )) as P ( ρ ( x ) = h ( x )) ≤ ε . PACtheory has provided a theoretical investigation on suf-ficient conditions of making P ( ρ ( x ) = h ( x )) ≤ ε holdwith a probability not less than 1 − δ (Valiant, 1984).Therefore, UCR is very important for classifi-cation. For developing a classification method,categorization consistency principle requires that P nk =1 L ( ρ ( x k ) , h ( x k )) reaches the minimum, which isusually called minimizing empirical risk. Transpar-ently, neural networks can be introduced by minimiz-ing empirical risk. Usually, the more complexity of h ( x ), the more small the empirical risk. Therefore, thetradeoff between the empirical risk and the functioncomplexity will lead to the structural risk (Vapnik,2000).In particular, when c=2, ρ ( x ) ∈ { , } . Set h ( x ) = 1 + π ( x ) and L ( ρ ( x ) , h ( x )) = − ( ρ ( x ) −
1) log( h ( x ) − − (2 − ρ ( x )) log (2 − h ( x )) = − ( ρ ( x ) − ) log( π ( x )) − (2 − ρ ( x )) log (1 − π ( x )) where π ( x ) = exp ( wx T + b )1+ exp ( wx T + b ) , equation (13) tells us that the objec-tive function of binomial logistic regression model(Hosmer Jr and Lemeshow, 2004) can be expressed asfollows: min Y n X k =1 L ( ρ ( x k ) , h ( x k ))= − n X k =1 ( ρ ( x k ) − wx T + b )+ n X k =1 log(1 + exp ( wx T + b ) (14) X = Y However, many classification methods are not devel-oped by transforming classification to regression. Inorder to show this clearly, we simply assume Y = X ,then classification result will omit Y as X is knowna priori. By analysis in Section 5.4, it is enough tostudy ( X, U, Y , Sim Y ) under such simplification.Since( X, U ) is known for classification, the simplest Y should be preferred according to Occam’s razor. Inthe following, U = [ u ik ] c × n is a hard partition. Example 1:
It is the simplest to set Y = X , whichmeans that ∀ i, Y i = X i . Under such assumption,we do not know any essential information about Y except for X . When ∀ i, Y i = X i , it is natural toset Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = | N i ( x ) | K , N i ( x ) = { x l | x l ∈ X i ∧ x l ∈ K-nearest neighborhood of x } . Un-der the above assumption, K-nearest neighbor classi-fication method (Cover and Hart, 1967) is introduced.It is easy to know that the categorization result ofK-nearest neighbor classification follows categorizationaxioms in general cases. Clearly, UCR does not holdfor K-nearest neighbor classification in general. Example 2:
Let X = [ x kr ] n × p , Sim Y ( y, Y i ) = Sim Y ( x, Y i ) and g i ( x ) = log Sim Y ( x, Y i ) be discrimi-nant function, SS requires that object x is assigned toclass Y i if g i ( x ) = max j g j ( x ). Occam’s razor statesthat simpler Y is preferred. In theory, if ∀ i, Y i is rep-resented by ( w i , w i ) where w i is a 1 × p vector and w i ∈ R , g i ( x ) = log Sim Y ( x, Y i ) = w i x T + w i . Such acategorization model is simpler, which is called lineardiscriminant analysis (Fisher, 1936). Transparently,linear discriminant analysis also satisfies categoriza-tion axioms. Example 3:
In particular, when c=2, it is natural toset ∀ i, Y i = ( w i , w i ). Occam’s razor states that lessparameters should be preferred. If set Y = ( w, b − Y = ( − w, − b − Y = ( w, b −
1) and Y = ( − w, − b −
1) are the simplest linear classifica-tion representation according to Occam’s razor. Inthis case, g ( x ) = log Sim Y ( x, Y ) = wx T + b − g ( x ) = log Sim Y ( x, Y ) = − wx T − b −
1. Set wx T + b − ≥ ∀ x k ∈ X and − wx T − b − ≥ ∀ x k ∈ X , categorization axioms hold. Therefore,category separation principle states that the optimallinear discrimination should keep the distance betweenthe two parallel hyperplanes as large as possible whenUCR holds, which leads to the famous support vectormachine.It is easy to know that the training decision region forsupport vector machine is Ω ( Y ,Sim Y ) = { x | wx T + b − ≥ ∀ x k ∈ X and − wx T − b − ≥ ∀ x k ∈ X } . It is easy to prove that Margin ( Y ,Sim Y ) = √ ww T . Larger Margin ( Y ,Sim Y ) means a better gener-alization for support vector machine, which has beenproved by statistical learning theory (Vapnik, 2000). Example 4:
Let Y i = ( w i , w i ) where 1 ≤ i ≤ c − Y c is unknown, and Sim Y ( x, Y i ) = exp ( w i x T + w i )1+ P c − i =1 exp ( w i x T + w i ) if 1 ≤ i ≤ c − Sim Y ( x, Y c ) = P c − i =1 exp ( w i x T + w i ) . According to category compact-ness principle, we should maximize the objective func-tion can be expressed as follows:max Y ,Y , ··· ,Y c − n X k =1 c X i =1 u ik log Sim Y ( x k , Y i )= n X k =1 c − X i =1 u ik ( w i x Tk + w i ) − n X k =1 log(1 + c − X i =1 exp ( w i x Tk + w i )) (15)Such categorization model is called logisticregression(Cox, 1958). According to Occam’s ra-zor, logistic regression is more complex than lineardiscriminant analysis. When c >
2, logistic regressionshould not be considered as a regression model as noregression function can be defined. Moreover, the c th class can be considered noise in logistic regression. Example 5:
For a categorization model, we do notneed a concrete form ∀ i, Y i explicitly. No matter howcomplicated Y is, it is enough to compute Sim Y . If Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = P ( x, Y i ) and v ik = P ( Y i | x k ), it is easy to know that Bayes classifier almostfollows categorization axioms as the output y = x ∈ Y i just because Sim Y ( x, Y i ) = max j Sim Y ( x, Y j ) =max j P ( x, Y j ) = P ( x, Y i ) and Bayes theorem guaran-tees that arg max i P ( x, Y i ) = arg max i P ( Y i | x ). There-fore, it is very important for Bayes classifier to esti-mate Sim Y or V by ( X, U ).n particular, assume that X = [ x kr ] n × p repre-sents n objects and x = [ x ∗ , x ∗ , · · · , x ∗ p ] repre-sents an object, where x ∗ r is the r th feature. Ac-cording to categorization axioms, it is enough tocalculate max i P ( x, Y i ) in order to classify x . Ac-cording to Occam’s razor, we should select the sim-plest way to calculate P ( x, Y i ). The simplest wayto estimate P ( x | Y i ) is to assume that each featureis conditionally independent of every other featuresfor given Y i , then P ( x | Y i ) = Q pr =1 P ( x ∗ r | Y i ). Let P ( Y i ) = card ( X i ) n , then Sim Y ( x, Y i ) can be computedby P ( Y i ) Q pr =1 P ( x ∗ r | Y i ). Based on the above analy-sis, naive Bayes classifier (Duda et al., 1973) can clas-sify x according to categorization axioms. Therefore,naive Bayes classifier is the simplest Bayes classifierwith respect to Occam’s razor. As v ik = P ( Y i | x k ) canbe computed and V is a probability partition, Bayesclassifier can be considered soft categorization. Example 6:
Let Ds Y ( y, Y i ) = Ds Y ( x, Y i ) = R ( α i | x ) = P cj =1 λ ij P ( Y j | x ), where the action α i de-notes the decision to assign the output y to class Y i and λ ij denotes the cost incurred for taking the action α i when the input x belongs to Y j . Transparently, thecategorization result of minimum risk classification al-most abides by categorization axioms. Example 7:
Let
Sim Y ( y, Y i ) = Sim Y ( x, Y i ) = U ( α i | x ) = P cj =1 U ij P ( Y j | x ), where the action α i de-notes the decision to assign the output y to class Y i and U ij measures how good it is to take the action α i whenthe input x belongs to Y j . Maximum expected utilityclassifier also almost follows categorization axioms. Example 8:
In the above examples, ∀ i, Y i is repre-sented by one unique prototype, no matter what im-plicit or explicit. If assume that ∀ i, Y i can be repre-sented by several prototypes, such a classifier is morecomplex. In decision tree classifier, ∀ i, Y i usually isrepresented by several mutual exclusive rules. It canbe proved that decision tree classifier also follows cat-egorization axioms. X = Y When X = Y with p > d , supervised dimensionalityreduction is proposed to deal with the correspondingcategorization. When X = Y with p < d , kernel meth-ods are proposed for categorization. In the following,we will discuss them respectively. Supervised Dimensionality Reduction
For X = Y with p > d , it is easy to know that y = θ ( x )such that ∀ k, y k = θ ( x k ). The simplest θ is a projec-tion mapping. If θ () is a projection mapping, super-vised dimensionality reduction becomes feature selec- tion. Feature selection methods can be easily inter-preted by categorization consistency principle.If θ is not a projection mapping, the simplest θ is a linear mapping from R p to R . If there ex-ists a direction w such that all categories in ( X, U )can be linearly separable when all points in (
X, U )is vertically projected into the direction w , we set Y i = X i = v i w T w , where w is a 1 × p vec-tor. Y = [ z k ] n × ,where ww T = 1, z k = x k w T , v i = P xk ∈ Xi x k | X i | . Ds X ( x, X i )=( xw T w − X i )( xw T w − X i ) T = w ( x − v i ) T ( x − v i ) w T , Ds Y ( z, Y i )=( zw − Y i )( zw − Y i ) T , it is easy to know that Ds X ( x, X i ) = Ds Y ( z, Y i ).According to category compactness principle, we needto minimize P i P x k ∈ X i Ds ( x k , X i ) = nwS W w T . Ac-cording to category separation principle, we need tomaximize P ci =1 | X i | w ( v i − x ) T ( v i − x ) w T = nwS B w T ,where x = n − P nk =1 x k . Combining the above twofunctions, wS W w T wS B w T should be minimized, which leadsto the generalized Fisher linear discriminant analysis.In particular, when c = 2, it is easy to prove that( X − X )( X − X ) T = w ( v − v ) T ( v − v ) w T = wS B w T . Since | X | w ( v − x ) T ( v − x ) w T + | X | w ( v − x ) T ( v − x ) w T = | X || X | | X | w ( v − v ) T ( v − v ) w T + | X | | X || X | w ( v − v ) T ( v − v ) w T = | X || X || X | w ( v − v ) T ( v − v ) w T , it is easy to know that to mini-mize w ( v − v ) T ( v − v ) w T is equivalent to mini-mize P i =1 | X i | w ( v i − x ) T ( v i − x ) w T , Therefore, when c = 2, generalized Fisher linear discriminant analy-sis becomes Fisher linear discriminant analysis. Cer-tainly, Fisher linear discriminant analysis follows UCRif ( X, U ) is linear separable in a direction w . Kernel Methods
For X = Y with p < d , assume that Y is linearlyseparable and but X is not linearly separable, it iseasy to know that θ () is a nonlinear mapping suchthat ∀ k, y k = θ ( x k ). Sometimes, the dimensionality of Y is infinite. In this case, it is impossible to decide θ ()by ( X, U ) and (
Y, V ). Fortunately, when (
Y , Sim Y )is obtained, ( X, Sim X ) can be obtained by the kernelfunction K ( x, x k ) = ( θ ( x ) , θ ( x k )), where ( θ ( x ) , θ ( x k ))represents the inner product.By defining K ( x, x k ), most categorization algorithmscan be reinvented in kernel methods. Interested read-ers can read the article (Scholkopf and Smola, 2011).In summary, classification models almost follow cate-gorization axioms. But different classification modelshave different model complexity. It should be pointedout that a complex model may be easily interpretedbut a simple one may be difficult to be interpreted.ometimes, a simple categorization model is very dif-ficult to be discovered especially when it is not easy tobe interpreted. Yu and Xu (2014) have presented categorization ax-ioms based on the assumption that any categoryshould have two kinds of representation. The maindrawback of (Yu and Xu, 2014) is to ignore the clus-tering input by implicitly assuming the the clusteringresult and the clustering input should have the samecategory representation. However, the input and theoutput may not have the same category representa-tion, even for some clustering algorithms. Therefore,categorization axioms cannot directly be applied to ageneral learning algorithm. In particular, categoriza-tion axioms assume that the number of categories isgreater than one, which is invalid for regression andmanifold learning.In order to generalize categorization axioms intogeneral categorization methods, we represent cate-gorization problems by redefining categorization in-put as (
X, U, X, Sim X ) and categorization result as( Y, V, Y , Sim Y ). Based on this proposed representa-tions of categorization input and categorization re-sult, similarity (inner referring) operator and assign-ment (outer referring) operator are defined. Such twoproposed operators are helpful not only for present-ing UCR but also for reinterpreting categorization ax-ioms. ECR, UCR, SS,CS and CE indeed delimit thetheoretical constraints for categorization. In particu-lar, UCR offers the theoretical constraints for a perfectcategorization algorithm, which guarantees that ex-pected to be learned is equivalent to actually learned,i.e. there are no gap between teaching and learn-ing. More interestingly, if taking ( X, U, X, Sim X ) and( Y, V, Y , Sim Y ) as a conversation between two per-sons, CE states that the outer category representa-tion is equivalent to the inner category representationwith respect to categorization, which is consistent withmaxim of quality in conversation: do not say what youbelieve to be false (Grice, 1975). UCR states that theinput and the output should refer to the same cat-egorization, which is also consistent with maxim ofrelation in conversation: make your contribution rele-vant (Grice, 1975). When a dialogue can be efficientlycarried out, UCR and CE should be true in daily life.As the same as Yu and Xu (2014), a clustering resultsatisfying SS and CS cannot be guaranteed to be agood clustering result as SS and CS are too weak.Similarly, when developing a categorization algorithm,SS and CS also need to be enhanced, which respec-tively result in the category compactness principle and Figure 2: Relationship between Axioms and designprinciples for categorizationthe category separation principle under new proposedrepresentation. In this paper, it is proposed that acategorization method should follow UCR in theory.However, UCR is too demanding for a categorizationmethod. In many cases, UCR cannot hold and needsto be relaxed, which can lead to one design principleof categorization methods: the categorization consis-tency principle. The relation between the proposedaxioms and design principles for categorization can beshown in Figure 2.After the learning process, how to evaluate the catego-rization algorithm is very important. Categorizationtest axiom provides the prerequisite condition that theperformance of the categorization algorithm can beevaluated and local categorization robustness assump-tion has offered the condition that the performance ofthe categorization algorithm can be guaranteed to bestable.When c = 1, ECR, SS, CS and CE trivially hold, butUCR offers the theoretical condition for categoriza-tion. When c = 1, categorization becomes some di-mensionality reduction methods, density estimation,and regression. Some dimensionality reduction meth-ods, density estimation and regression and can be in-troduced by UCR or its approximated version (cat-egorization consistency principle), such as principalcomponent analysis, nonnegative matrix factorization,canonical correlation analysis, local linear embedding,multidimensional scaling, Isomap, parametric densityestimation, nonparametric density estimation, linearregression, ridge regression and lasso, and so on. The-oretically, when c = 1, categorization mainly discusseshow to represent a category, which lays on a founda-tion for categorization with c > U is not known a priori for c >
1, categoriza-tion becomes clustering. ECR,UCR are always sup-posed to be true for any clustering algorithm in ordero simplify clustering process. Consequently, cluster-ing result and clustering input are exchangeable when X = Y . Therefore, SS, CS and CE are enough forclustering when X = Y . Therefore, theoretical analy-sis of clustering in (Yu and Xu, 2014) still is true when X = Y .When U is known a priori for c >
1, categorizationbecomes classification. As for classification, ECR andCE are always true for a classification result, but SSand CS are true for a proper classification result andUCR just holds for a classification result with zero er-ror. Therefore, SS, CS and UCR are more importantconstraints for classification. With respect to a classifi-cation result (
Y, V, Y , Sim Y ), decision region,trainingdecision region and margin are defined by SS. For cat-egorization methods, category compactness principlecan result in K-nearest classification,linear discrimi-nant analysis,support vector machine, logistic regres-sion, Bayesian classification, Minimum risk classifica-tion, Maximum expected utility classification,decisiontree,etc. Category separation principle can lead tosupport vector machine and Fisher linear discriminantanalysis. Categorization consistency principle can leadto empirical risk and structural risk, which can re-sult in neural networks and binomial logistic regressionmodel.UCR, SS, CS and CE play different roles in dif-ferent categorization algorithms but all have some-thing to do with similarity. It is well known thatthat similarity plays a key role in human recognitionsystem (Murphy, 2004; Hahn, 2014). Furthermore,Kloos and Sloutsky (2008) revealed that children rep-resent categories based on similarity and similarity-based category representation is a development de-fault. The proposed axiomatic framework indeed es-tablishes the bridge between cognitive science and ma-chine learning through similarity (inner referring) op-erator.More interestingly, the proposed categorization frameclearly shows the range that a categorization algo-rithm can be reasonably applied. If the inner cate-gory representation is reasonable for the outer input,the corresponding categorization algorithm is feasible.Otherwise, more suitable inner category representa-tion should be used, which certainly introduces othercategorization algorithm. The analysis of categoriza-tion algorithms in this paper shows that the design ofcognitive category representation really needs powerfulimagination as the cognitive category representationsin the existing categorization algorithms are so diverse.In theory, a powerful categorization algorithm seemsto have a powerful cognitive category representation.It should be pointed out that there are many open questions needed to be done in the proposed axiomaticframework in the future. For example, how to de-sign an appropriate cognitive category representationfor a specific categorization algorithm? When c ≥ X, U ) is partial known or noise, what is the re-lation between categorization axioms and categoriza-tion algorithms?
Acknowledgements
Zongben Xu, Xinbo Gao, Wensheng Zhang, Bao-gang Hu, Jiangshe Zhang, Jufu Feng, Shaoping Ma,Qing He, Xuegang Hu, Liping Jing, Bianfang Chai,Jia Li and all my colleagues in CAAI MachineLearning Technical Committee are appreciated verymuch, their valuable discussions and suggestions havegreatly improved the presentation of the this pa-per. This work was supported by the NSFC grant(61370129), Ph.D Programs Foundation of Ministry ofEducation of China (20120009110006), PCSIRT(IRT201206), Beijing Committee of Science and Technol-ogy,China(Grant No. Z131110002813118).
References
Abdi, H. and Williams, L. J. (2010). Principal com-ponent analysis.
Wiley Interdisciplinary Reviews:Computational Statistics , 2(4):433–459.Cover, T. and Hart, P. (1967). Nearest neighbor pat-tern recognition.
IEEE Transactions on InformationTheory , 13(1):21–27.Cox, D. (1958). The regression analysis of binary se-quences (with discussion).
J. Roy. Stat. Soc. B ,20:215–242.Duda, R. O., Hart, P. E., et al. (1973).
Pattern clas-sification and scene analysis , volume 3. Wiley NewYork.Fisher, R. A. (1936). The use of multiple measure-ments in taxonomic problems.
Annals of eugenics ,7(2):179–188.Grice, P. (1975). Logic and conversation. In
P.Coleand J. Morgan eds. Syntax and Semantics, vol.3,New York . Academic Press.Hahn, U. (2014). Similarity.
Wiley InterdisciplinaryReviews: Cognitive Science , 5(3):271–280.Hosmer Jr, D. W. and Lemeshow, S. (2004).
Appliedlogistic regression . John Wiley & Sons.Hotelling, H. (1933). Analysis of a complex of statisti-cal variables into principal components.
Journal ofeducational psychology , 24(6):417.Hotelling, H. (1936). Relations between two sets ofvariates.
Biometrika , pages 321–377.loos, H. and Sloutsky, V. M. (2008). What’s behinddifferent kinds of kinds: Effects of statistical densityon learning and representation of categories.
Journalof Experimental Psychology: General , 137(1):52.Kruskal, J. B. and Wish, M. (1978).
Multidimensionalscaling , volume 11. Sage.Lee, D. and Seung, H. (1999). Learning the parts ofobjects by non negative matrix factorization.
Na-ture , 401(6755):788–791.Murphy, G. L. (2004).
The big book of concepts . MITpress.Pearson, K. (1901). On lines and planes of closest fit tosystems of points in space.
Philosophical Magazine ,2(11):559C572.Roweis, S. T. and Saul, L. K. (2000). Nonlinear di-mensionality reduction by locally linear embedding.
Science , 290(5500):2323–2326.Scholkopf, B. and Smola, A. J. (2011). Learning withkernels: Support vector machines, regularization,optimization, and beyond.
Journal of the Ameri-can Statistical Association , 98(3):781.Silverman, B. (1986).
Density Estimation for Statisticsand Data Analysis . Chapman & Hall/CR,New York.Tenenbaum, J. B., De Silva, V., and Langford, J. C.(2000). A global geometric framework for nonlineardimensionality reduction.
Science , 290(5500):2319–2323.Tibshirani, R. (1994). Regression shrinkage and selec-tion via the lasso.
Journal of the Royal StatisticalSociety, Series B , 58(1):267–288.Valiant, L. G. (1984). A theory of the learnable.
Com-munications of the ACM , 27(11):1134–1142.Vapnik, V. (2000).
The nature of statistical learningtheory . springer.Yu, J. and Xu, Z. (2014). Categorization axioms forclustering results. eprint arXiv:1403.2065eprint arXiv:1403.2065