A generalization of the symmetrical and optimal probability-to-possibility transformations
AA generalization of the symmetrical and optimalprobability-to-possibility transformations
Esteve del Acebo , Yousef Alizadeh-Q , and Sayyed Ali Hossayni VICOROB Institute, Universitat de Girona Data Mining Lab, School of Computer Engineering.Iran University of Science and TechnologyJanuary 3, 2020
Abstract
Possibility and probability theories are alternative and complementary ways to deal withuncertainty, which has motivated over the last years an interest for the study of ways to trans-form probability distributions into possibility distributions and conversely. This paper studiesthe advantages and shortcomings of two well-known discrete probability to possibility trans-formations: the optimal transformation and the symmetrical transformation, and presents anovel parametric family of probability to possibility transformations which generalizes themand alleviate their shortcomings, showing a big potential for practical application. The paperalso introduces a novel fuzzy measure of specificity for probability distributions based on theconcept of fuzzy subsethood and presents a empirical validation of the generalized transfor-mation usefulness applying it to the text authorship attribution problem.
Possibility and probability theories are alternative ways to deal with uncertainty [Zad78], [DP07].They are in no way unrelated (actually, a degree of possibility can be viewed as an upper proba-bility bound [DPS93]) and are complementary in the sense that both can be useful under differentcircumstances. This has motivated over the last years the study of ways to obtain possibilitydistributions from probability distributions and conversely. These probability to possibility trans-formation are, citing Sudkamp [Sud92] purely mechanical transformations of probabilistic supportto possibilistic support and viceversa. That is, a conversion of the measure of support of one theoryinto that of the other that is independent of the problem domain.Several such transformations have been defined (see [Ous00] for a detailed account), two of themost well-known being the optimal transformation and the symmetrical transformation proposedby D. Dubois et al. [DP82, DPS93]. Both transformations have their drawbacks, which we willdiscuss: the discontinuity in the case of the optimal transformation and the low specificity inthe case of the symmetrical transformation. This paper presents a new parametric family ofprobability to possibility transformations which generalizes the optimal transformation and thesymmetrical transformation and can contribute to alleviate their shortcomings. The paper isorganized as follows: we first present the two transformations, examine their properties and exposeits possible deficiencies. In the next section we show how the two transformations can be seen asparticular cases of a parametric family of transformations and study their properties, introducinga novel fuzzy measure of specificity for possibility distributions. Section 4 deals with conversepossibility to probability transformations, giving a formulation of the converse of the optimaltransformation and also giving the system of equations to be solved in order to obtain the converseof the generalized transformation. Section 5 presents a empirical validation of the generalizedtransformation usefulness applying it to the text authorship attribution problem and, finally, thelast section contains several concluding remarks.1 a r X i v : . [ s t a t . O T ] D ec Probability to possibility transformations
As stated in the introduction, a number of probability to possibility and possibility to probabilitytransformations, both in the continuous and the discrete cases, have been proposed. In this section,we will study two well-known discrete probability to possibility transformations proposed by D.Dubois et al. : the so-called symmetric [DP82] and optimal [DPS93] transformations. We willexamine their properties and discuss their advantages and drawbacks.
Let W = { w , w , ..., w n } be the set of possible values taken by a discrete random variable X ,let p : W → [0 .. be the probability distribution of X and P : 2 W → [0 .. be the probabilitymeasure induced by p . A possibility distribution for X is a function π : W → [0 .. and can beseen as a fuzzy set over W . The possibility measure induced by π is defined, for all A ⊆ W as Π( A ) = max w i ∈ A ( π ( w i )) Following Zadeh [Zad78], Dubois and Prade propose several desirable properties for a probabilityto possibility transformation [DP82]: • Consistence.
An event must be possible prior to being probable. Degrees of possibility cannotbe less than degrees of probability. ∀ A ⊆ W Π( A ) ≥ P ( A ) • Order preservation.
Probabilities and possibilities must be equally ordered. ∀ w i , w j ∈ W π ( w i ) > π ( w j ) ⇐⇒ p ( w i ) > p ( w j ) In [DP83], Dubois et al. define a transformation with these properties. It is known as the sym-metrical transformation π S , and is defined as: π S ( w i ) = (cid:88) w j ∈ W min ( p ( w i ) , p ( w j )) (1)Another desirable property for a probability to possibility transformation is maximal specificity, in the sense that the possibility distribution should preserve as much information from the proba-bility distribution as possible. We will say that a possibility distribution π is more specific thananother possibility distribution π iif ∀ w i ∈ W π ( w i ) ≤ π ( w i ) . (That is, iif π ⊂ π , denoting ⊂ fuzzy inclusion.) The maximally specific probability to possibility transformation satisfying theproperties of consistence and order preservation was presented by D. Dubois et al. in [DPS93]; itis known as the optimal transformation (proof can be found in [DPS93], [DM87]) and is definedas: π O ( w i ) = (cid:88) w j ∈ W/p ( w j ) ≤ p ( w i ) p ( w j ) (2)or, equivalently: π O ( w i ) = (cid:88) w j ∈ W p ( w j ) · ≤ w i ( w j ) (3)where ≤ w i is the indicator function equal to one if w j ≤ w i and zero otherwise.Both transformations have their advantages and shortcomings. The symmetrical transforma-tion is intuitive and continuous (more on this later) but has as a shortcoming its low specificity, itpreserves less information from the underlying probability distribution than the optimal transfor-mation. The optimal transformation, on the other hand, is maximally specific, but is discontinuousin a way that makes it counter-intuitive in some cases. For example, let p = [0 . , . represent theprobability distribution function of a binary random variable X . The associated possibility distri-butions using the symmetrical and the optimal transformations would be π S = π O = [1 , . If wechange the probabilities to p = [0 . , . we will have π S = [1 , . , but π O = [1 , . . So,in the case of the optimal transformation, arbitrarily small changes in the probability distributionfunction can produce large changes in the possibility distribution function. This is not the casewith the symmetrical transformation.A further example which makes this discontinuity evident graphically can be seen in figure 1.It plots π ( w ) against p ( w ) for a binary random variable X taking values in W = { w , w } with2igure 1: Plot of π ( w ) against p ( w ) for a binary random variable X taking values in W = { w , w } with probabilities p ( w ) = p and p ( w ) = 1 − p . To the left, the symmetrical transformation π S ,to the right, the optimal transformation π O . p π G Figure 2: Plots of π G ( w ) against p ( w ) for a binary random variable taking values in W = { w , w } with probabilities p ( w ) = p and p ( w ) = 1 − p and for several values of the parameter n . Fromleft to right, n = 1 , , , , probabilities p ( w ) = p and p ( w ) = 1 − p . To the left, the plot of π S is continuous, to the right,the plot of π O shows a discontinuity at p = 0 . As we have seen in the previous section, both transformations have shortcomings: the symmetricaltransformation is not specific enough and the discontinuity of the optimal transformation is aptto produce counter-intuitive results. It could be desirable to have the possibility of trading partof the specificity of the later for the continuity of the former. Our proposal is to generalize bothtransformations by means of the following family of parametric transformations π G ( w i ) = (cid:88) w j ∈ W p ( w j ) · min (cid:18) , (cid:16) p ( w i ) p ( w j ) (cid:17) n (cid:19) (4)It is easy to see that π S and π O are particular cases of π G . For n = 1 , clearly π G = π S . On theother hand, when n tends to infinity, min (cid:0) , (cid:0) p ( w i ) p ( w j ) (cid:1) n (cid:1) tends to the indicator function ≤ w i and,consequently, π G tends to π O .In Fig. 2 we can see plots of π G ( w ) against p ( w ) for a binary random variable taking valuesin W = { w , w } with probabilities p ( w ) = p and p ( w ) = 1 − p and for several values of theparameter n . From left to right, n = 1 , , , , . It can be seen the increase of specificity3 .2 0.4 0.6 0.8 p p p p p p p p Figure 3: color maps showing π G ( w ) against p = p ( w ) and p = p ( w ) for a ternary randomvariable taking values in W = { w , w , w } for different values of n . In the top row, to the left, n = 100 ( π G ≈ π O ). To the right, n = 1 , ( π G = π S . In the bottom row, to the right n = 2 , to theleft n = 5 .without lose of continuity. Similarly, in Figure 3 we can see color maps showing π G ( w ) against p = p ( w ) and p = p ( w ) for a ternary random variable taking values in W = { w , w , w } fordifferent values of n . Only the regions where p + p ≤ (that is, under the main diagonal) aremeaningful. In the top row, to he left, for n = 100 , π G ≈ π O . It is easy to see the discontinuity linesof π O . To the right, for n = 1 , π G = π S and no discontinuities exist, but there is low specificity.In the bottom row, π G for two intermediate values of n : left, n = 2 , right, n = 5 . It is easy to prove that the generalized transformation has the properties of consistence and orderpreservation for any value n > . Consistence can be proved from the properties of the optimaltransformation: it holds that min (cid:0) , (cid:0) p ( w i ) p ( w j ) (cid:1) n (cid:1) ≥ ≤ w i ( w j ) for all n . So, for all A ⊆ W , we willhave Π G ( A ) = max w i ∈ A π G ( w i ) ≥ Π O ( A ) ≥ P ( A ) . In order to prove order preservation, we canobserve in Eq. 4 that, for any pair of probabilities p ( w ) and p ( w ) : • p ( w ) = p ( w ) = ⇒ π ( w ) = π ( w ) • If p ( w ) > p ( w ) , every term in the sum for the computation of π G ( w ) is greater or equalthan the corresponding term for π G ( w ) and, particularly, the term corresponding to w j = w is strictly greater in the former. So p ( w ) > p ( w ) = ⇒ π ( w ) > π ( w ) In this section, we will try to measure and compare the specificity of the generalized transformationfor different values of the parameter n and different probability distributions. We have said thata possibility distribution π is more specific than another possibility distribution π if π ( w i ) <π ( w i ) for all w i ∈ W . This is the same than saying that a possibility distribution π is morespecific than another possibility distribution π if π ⊂ π , considering π and π fuzzy sets anddenoting ⊂ fuzzy inclusion. Following this idea, it makes sense to make use of a fuzzy subsethoodrelationship to extend this definition of specificity to the fuzzy domain and talk about the degree4able 1: Mean and standard deviation of the specificity of a set of possibility distributions fordifferent values of the parameter n in equation 4. To the left, results when the probability distri-butions are obtained sampling a random variable distributed uniformly. To the right, results whenthe probability distributions are obtained sampling a random variable distributed following Zipf’slaw Uniform Zipf ’sn specificity SD specificity SD π is more specific than π . We will use the fuzzy subsethood relationship proposed byKosko [Kos90]: given two fuzzy sets A and B over the same universal set U , we define the degreeto which A is contained in B as: S ( A, B ) = M ( A ∩ B ) M ( A ) (5)where ∩ denotes fuzzy intersection and M ( X ) denotes the cardinality or measure of the fuzzy set X ,defined as M ( X ) = (cid:80) u i ∈ U m X ( u i ) . Making use of this definition and choosing the minimum op-erator to implement fuzzy intersection, we define, given a probability distribution p , the specificityof the possibility distribution π T obtained by applying a probability to possibility transformation T to p as the degree to which it is included into the maximally specific possibility distribution π O (the one obtained by applying the optimal transformation to p ): specif icity ( π T ) = S ( π T , π O ) = (cid:80) w i ∈ W min( π T ( w i ) , π O ( w i )) (cid:80) w i ∈ W π T ( w i ) (6)The results of a series of experiments can be seen in table 1. Each row shows, for a given valueof the exponent n in equation 4, the mean and the standard deviation of the specificity of 100possibility distributions resulting from the generalized transformation of a sample of 100 probabilitydistributions. Each probability distribution p i is obtained sampling 250000 times a discrete randomvariable V taking values over W = { w , w , ..., w } and then assigning to p i ( w i ) the proportionof occurrences of w i in the sample. To the left, the results when V is distributed uniformly over W , that is, p ( w i ) = . To the left, the results when the probability distribution of V follows apower law, where p ( w i ) is proportional to i α . This distribution is known as Zipf’s law or discretePareto distribution, and is known for its usefulness to model many types of data studied in thephysical and social sciences, from frequency of words in natural language corpuses to populationranks of cities in various countries, corporation sizes, income rankings, ranks of number of peoplewatching the same TV channel and so on [Wik18],The results show, as expected, how the specificity of the transformed possibility distributionincreases with the value of parameter n . It is interesting to observe the difference in the growthspeed depending upon the underlying probability distribution. The specificity of the possibilitydistributions obtained from the power law probability distribution increases much more rapidlythan the specificity of the possibility distributions obtained from uniform probability distribution.The experiment has been run several times, with different values of the number of samples anddifferent cardinality of W , the results being similar to those reported in the table. The symmetrical probability to possibility transformation defined in eq. 1 has a well-known [DP83]corresponding converse possibility to probability transformation given by: p S ( w i ) = (cid:88) i ≤ j ≤ n π S ( w j ) − π S ( w j +1 ) j (7)5onsidering, without loss of generality, that possibility (and, consequently, probability) values areordered decreasingly and taking π S ( w n +1 ) = 0 .In the case of the optimal probability to possibility transformation defined in eqs. 2 and 3, ifall the π ( w i ) are different, the converse possibility to probability transformation is as simple asfollows: p ( w i ) = π O ( w i ) − π O ( w i +1 ) (8)also considering π O ( w n +1 ) = 0 , but is very easy to see that it does not work when possibility valuesrepeat . As far as we know, no published formulation of the converse of the optimal transformationtaking into account the possibility of duplicated values exists. We give it as: p O ( w i ) = π O ( w i ) − max( π O ( w j ) /π O ( w j ) < π O ( w i )) reps ( π O ( w i )) (9)Where reps ( π O ( w i )) is the number of repetitions of the value π O ( w i ) in the possibility distributionand also considering π O ( w n +1 ) = 0 . In other words, given a decreasingly ordered list of possibilityvalues, the probability of w i , p O ( w i ) equals to its possibility π O ( w i ) , minus the next value in thepossibilities list different from π O ( w i ) , if such value exist, and divided by the number of times thevalue π O ( w i ) appears in the list.The general case is, however, much more involved. Suppose the probabilities p ( w ) ...p ( w M ) ordered decreasingly and let p i and π i denote p ( w i ) and π ( w i ) respectively. The generalizedtransformation can be written as: π i = i − (cid:88) j =1 p j (cid:16) p i p j (cid:17) n + M (cid:88) j = i p j = p ni i − (cid:88) j =1 p n − j + M (cid:88) j = i p j (10)This expression defines, supposing the π i ’s known, a system of M equations in M variables (the p i ’s) which must be solved under the restriction p ≥ p ≥ ... ≥ p M > . This is a hard problemwithout a general closed-form algebraic solution, to the best of our knowledge, which has to besolved with specialized mathematical software. Moreover, the existence of a solution is not clearto be guaranteed for every possibility distribution. In a previous work in forensic document analysis [HAT +
19, Hos18], the authors showed how thefusion of two author characteristics (the stylome or specific style of writing and the author’s hand-written signature) can be used in the text authorship attribution problem in order to improveattribution accuracy for several linear classificators. To this end, two biometric algorithms werecombined, a fuzzy signature recognition algorithm due to Kudlacik and Porwik [KP14] and a prob-abilistic authorship attribution algorithm from Sidorov et al. [SCS + +
19, Hos18] using the generalized transformation with different valuesof the parameter n .The mail database used in the experiments contains a total of 800 mails from 40 authors (20mails each). For each mail, the text body and the scanned author’s handwritten signature areavailable. 5 mails from each author are used for training and the remaining 15 are used for testing.For each test mail we compute both the fuzzy membership of its signature to the set of trainingsignatures of each author using Kudlacik and Porwik algorithm [KP14] and the probabilities ofauthorship of its text body by each author using Sidorov et al. algorithm [SCS +
14] and fivedifferent linear-time classifiers (the algorithm uses one classifier itself). Thereafter, the obtained In fact, this is the converse of the transformation π ( w i ) = (cid:80) j ≥ i p ( w j ) , a variation of the optimal transformationwhich replaces the order preservation condition with the weak order preservation condition given by: ∀ w i , w j ∈ W p ( w i ) > p ( w j ) ⇒ π ( w i ) > π ( w j ) . It is the most specific consistent transformation [DM87], but has the drawbacksthat it is not unique and that no possibility values can repeat, even if the corresponding probability values are equal.That is, for all i (cid:54) = j π ( w i ) (cid:54) = π ( w j ) +
19] for different values ofthe parameter n . There are five results for each value of n , corresponding to the use of five differentlinear-time classifiers inside the syntactic n-gram-based authorship attribution algorithm. Fromleft to right: MNB stands for Multinomial Naïve Bayes, SVM stands for Support Vector Machine,Ridge stands for Ridge-Regression classification PA stands for Passive-Aggressive classificationand BNB stands for Bernoulli Naïve Bayes classification. Results for n = 1 correspond to thesymmetrical transformation π S , results for n = inf correspond to the optimal transformation π O .probabilities are transformed to possibilities using the generalized transformation and, after fusion,the set (i.e. the author) with the maximum membership is chosen as the genuine author.We can see the outcome of the experiments in 4. The x axis represent the value of the generalizedtransformation parameter n and the y axis represent the accuracy of the corresponding classifier.There are five results for each value of n , corresponding to the use of five different linear-timeclassifiers inside the syntactic n-gram-based authorship attribution algorithm. From left to right:MNB stands for Multinomial Naïve Bayes, SVM stands for Support Vector Machine, Ridge standsfor Ridge-Regression classification [DW18], PA stands for Passive-Aggressive classification [MSY + ],and BNB stands for Bernoulli Naïve Bayes classification. Results for n = 1 correspond to thesymmetrical transformation π S and results for n = inf correspond to the optimal transformation π O )As can be seen in the figure, when the attribution algorithm uses Multinomial Naïve-Bayesor Ridge-Regression classification, the best results correspond to n = 6 ; when it uses Passive-Aggressive classification the best results correspond to n = 2 and when it uses Bernoulli Bayesthey correspond to n = 1 , that is, the symmetrical transformation π S . More important, the bestaccuracy overall is obtained with the attribution algorithm using Ridge Regression classificationwith values of n ranging from 4 to 10. We believe that this results provide strong empirical evidenceof the possible usefulness of the generalized transformation. This paper has presented a novel parametric family of discrete probability to possibility transforma-tions which generalize two well know transformations proposed by D. Dubois et al. [DP82, DPS93],the symmetrical transformation and the optimal transformation, making possible to combine todifferent degrees their advantages by increasing the specificity of the symmetric transformationwithout losing continuity and avoiding, in this way, possible artifacts caused by the lack of con-tinuity of the optimal transformation. This gives the presented generalized transformation a bigpotential for practical application. We have also proved that the generalised transformation has theproperties of consistence and order preservation for positive values of the exponent p , and deviseda fuzzy measure for possibility distributions specificity based on a fuzzy subsethood relationship.Finally, we have given empirical evidence of the usefulness of the generalized transformation bycomparing it with the symmetrical and optimal transformations in the context of an author attri-bution problem.It remains as further work to do to establish to what extent this generalized transformationrepresents a real improvement over the existing ones, analyzing the importance of the value of theparameter n and studying how to determine which values are best suited for a given application.It would also be interesting to determine which numerical or algebraic methods could be suitablefor the calculation of the dual possibility to probability transformation for different values of theexponent n . 7 cknowledgements This work has been partially granted by AfterDigital Consultants: Digitalización del consultordigital, RTC-2017-6370-7, CIEN Service Chain (Nuevas tecnologías basadas en blockchain paragestión de la identidad, confiabilidad y trazabilidad de las transacciones de bienes y servicios) andthe Grup de Recerca Consolidat ref. 2017 SGR 1648.
References [DM87] M. Delgado and S. Moral. On the concept of possibility-probability consistency.
FuzzySets and Systems , 21(3):311 – 318, 1987.[DP82] D. Dubois and H. Prade. On several representations of an uncertain body of evidence. InGupta and Sanchez, editors,
Fuzzy Information and Decision Processes , pages 167–181.North-Holland Publishing Company, 1982.[DP83] D. Dubois and H. Prade. Unfair coins and necessity measures: Towards a possibilisticinterpretation of histograms.
Fuzzy Sets Syst. , 10(1-3):15–20, January 1983.[DP07] D. Dubois and H. Prade. Possibility theory.
Scholarpedia , 2(10):2074, 2007. revision
Fuzzy Logic: State of the Art , pages 103–112.Springer Netherlands, Dordrecht, 1993.[DW18] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridgeregression and classification.
Ann. Statist. , 46(1):247–279, 02 2018.[HAT +
19] Sayyed-Ali Hossayni, Yousef Alizadeh-Q, Vahid Tavana, Seyed M. Hosseini Nejad,Mohammad-R. Akbarzadeh-T, Esteve del Acebo, Josep Lluís de la Rosa i Esteva,Enrico Grosso, Massimo Tistarelli, and Przemyslaw Kudlacik. A linear-complexitymulti-biometric forensic document analysis system, by fusing the stylome and signaturemodalities.
CoRR , abs/1902.02176, 2019.[Hos18] Sayyed Ali Hossayni.
Foundations of uncertainty management for text-based sentimentprediction . PhD thesis, Universitat de Girona, 2018.[Kos90] B. Kosko. Fuzziness vs. probability.
International Journal of General Systems , 17(2-3):211–240, 1990.[KP14] P. Kudlacik and P. Porwik. A new approach to signature recognition using the fuzzymethod.
Pattern Analysis and Applications , 17(3):451–463, Aug 2014.[MSY + ] Shin Matsushima, Nobuyuki Shimizu, Kazuhiro Yoshida, Takashi Ninomiya, and Hi-roshi Nakagawa. Exact Passive-Aggressive Algorithm for Multiclass Classification UsingSupport Class , pages 303–314.[Ous00] Mourad Ousalah. On the probability/possibility transformations: A comparative anal-ysis.
International Journal of General Systems , 29(5):671–718, 2000.[SCS +
14] Grigori Sidorov, Francisco Castillo, Efstathios Stamatatos, Alexander Gelbukh, andLiliana Chanona-Hernández. Syntactic n-grams as machine learning features for natu-ral language processing.
Expert Systems with Applications: An International Journal ,41:853–860, 02 2014.[Sud92] T. Sudkamp. On probability-possibility transformations.
Fuzzy Sets and Systems ,51(1):73 – 81, 1992.[Wik18] Wikipedia contributors. Zipf’s law — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Zipf%27s_law&oldid=859042231 , 2018. [On-line; accessed 13-September-2018].[Zad78] L. A. Zadeh. Fuzzy sets as a basis for a theory of possibility.