FField Formulation of Parzen Data Analysis
D. Horn Sackler School of Physics and Astronomy Tel Aviv University, Tel Aviv, Israel
Abstract
The Parzen window density is a well-known technique, associating Gaussian kernels with data points. It is a very useful tool in data exploration, with particular importance for clustering schemes and image analysis. This method is presented here within a formalism containing scalar fields, such as the density function and its potential, and their corresponding gradients. The potential is derived from the density through the dependence of the latter on the common scale parameter of all Gaussian kernels. The loci of extrema of the density and potential scalar fields are points of interest which obey a variation condition on a novel indicator function. They serve as focal points of clustering methods depending on maximization of the density, or minimization of the potential, accordingly. The mixed inter-dependencies of the different fields in d-dim data-space and 1-d scale-space, are discussed. They lead to a Schrődinger equation in d-dim, and to a diffusion equation in (d+1)-dim. Introduction
Data analysis has become a dominant occupation of modern science; to the extent it is often referred to as belonging to Data Science. Physics has always relied on data collection and analysis, and some of the modern tools of machine learning were motivated by physical intuition. We try to build on such intuition in this analysis of the Parzen-window distribution [1], an important tool which has been introduced in 1965, and still serves the goal of pattern recognition [2]. It has recently been shown [3] to allow for a natural decomposition into Weight and Shape components, where the former entails semi-global features and the latter depends on local features of the Parzen distribution. In particular, Shape is the exponential of the potential V that has been introduced in Quantum Clustering [4]. Although the formalism is developed for statistical data analysis, the Weight-Shape decomposition [3] makes use of an analogy with statistical information theory [5]. Here we present a novel formulation based on vector and scalar fields and establish their inter-relations both in data-space and in 1-d scale-space. The mixed behavior of the different variables leads to a Schrődinger equation in d-dim data space, and to a diffusion equation in the (d+1)-dim scale and data space. A special role is played by the potential function which is derived from the Parzen distribution function through its dependence on the scale parameter. The Parzen probability distribution and its potential function.
Consider the case of data points within some Euclidean data - or feature - space of dimension d, with possible positive attributes (e.g. intensities) I i , described by an experimental distribution 𝑄(𝐲) = ∑ 𝐼 𝑖 𝛿(𝐲 − 𝛍 𝑖 ) 𝑖 (1) where 𝑁 = ∑ 𝐼 𝑖𝑖 . In doing so, we assume that the measurement errors are small by comparison to the distances between the points and to the resolution with which we wish to study the data (otherwise replace the delta-functions by Gaussians with appropriate widths). In order to investigate the behavior of the data we make use of the Parzen-window [1] probability distribution 𝜓 𝑞 (𝐱) = ∫ 𝑑𝐲 𝑄(𝐲) 𝐾 𝑞 (𝐱 − 𝐲) (2) where we employ the Gaussian kernel 𝐾 𝑞 (𝐱 − 𝐲) = exp(−𝑞 (𝐱 − 𝐲) ) . (3) 𝑞 = is the resolution, with 𝜎 being the Gaussian width. 𝜓 𝑞 (𝐱) is the Weierstrass transform of 𝑄(𝐲) . As such, it may be defined for any probability distribution and it obeys ∫ 𝑑𝐳 𝐾 𝑞 (𝐱 − 𝐳) 𝜓 𝑞 (𝐳) = ( 𝜋𝑞 +𝑞 ) 𝑑/2 𝜓 𝑞 (𝐱) with 𝑞 = 𝑞 𝑞 𝑞 +𝑞 . (4) Following [3] we define a relative probability weight 𝑝 𝑞 (𝐱|𝐲) = 𝐾 𝑞 (𝐱−𝒚)𝜓 𝑞 (𝐱) , (5) which represents the influence of point y in Q ( y ) on point x in 𝜓(𝐱) , and is properly normalized at any point x through ∫ 𝑑𝐲 𝑄(𝐲) 𝑝 𝑞 (𝐱|𝐲) = 1 . (6) Let us introduce the potential function 𝑉 𝑞 (𝐱) through 𝑉 𝑞 (𝐱) = −𝑞 𝜕𝜕𝑞 log𝜓 𝑞 (𝐱) = 𝑞 ∫ 𝑑𝐲 𝑄(𝐲)(𝐱 − 𝐲) 𝑝 𝑞 (𝐱|𝐲) ≡ 𝑞 < (𝐱 − 𝐲) > (7) where the last equality serves as the definition of an expectation value under the probability defined by Eqs. (5) and (6). It follows then that 𝜓 𝑞 (𝐱) = exp (− ∫ 𝑉 𝑝 (𝐱) 𝑝 𝑑𝑝) 𝑞0 , (8) demonstrating that the potential 𝑉 𝑞 (𝐱) contains the information of the Parzen distribution function through its dependence on the scale parameter q . This implies that 𝜓 𝑞 (𝐱) = 𝜓 𝑞 (𝐱) exp (− ∫ 𝑉 𝑝 (𝐱) 𝑝 𝑑𝑝) 𝑞 𝑞 for 𝑞 < 𝑞 , relating low-resolution to high-resolution, in contradistinction to Eq. (4) which describes straight-forward relations of high-resolution to low resolution. Based on Eqs. (2) and (7) one can derive [3] the following differential equation − ∇ 𝜓 𝑞 + 𝑉 𝑞 𝜓 𝑞 = d2 𝜓 𝑞 (9) which is a Schrődinger equation obeyed by 𝜓 with the potential 𝑉 𝑞 (𝐱) in a d-dimensional Euclidean space. This coincides with the formalism of Quantum Clustering (QC) [4] without invoking any quantum mechanical interpretation. Entropy and the Weight-Shape decomposition . Following [3], we define an entropy function at point 𝐱 𝐻 𝑞 (𝐱) = − ∫ 𝑑𝐲 𝑄(𝒚) 𝑝 𝑞 (𝐱|𝐲) log 𝑝 𝑞 (𝐱|𝐲) = −< log 𝑝 𝑞 (𝐱|𝐲) > (10) and note that it can be rewritten as 𝐻 𝑞 (𝐱) = log 𝜓 𝑞 (𝐱) + 𝑉 𝑞 (𝐱) (11) Eq.(11) has an analog in statistical mechanics [5], where H is the entropy, V is the average energy of a canonical ensemble, and 𝜓 is its partition function. Also Eq. (7) has an analog in statistical mechanics. The partition function of the canonical ensemble is 𝑍 = ∑ 𝑒 −𝛽𝐸 𝑖 𝑖 and its average energy is 𝐸 𝑖 𝑒 −𝛽𝐸 𝑖 𝑖 = − 𝜕𝜕𝛽 log 𝑍 . Replacing 𝑍 by 𝜓 and 𝛽 by q, we find again that the potential V is analogous to average energy . Note, however, that the scalar fields 𝜓, 𝑉 and 𝐻 are functions of space, rather than constants of a physical system [3]. Employing the functions
𝑊(𝐱) = 𝑒
𝐻(𝐱) and
𝑆(𝐱) = 𝑒 −𝑉(𝐱) we rewrite Eq. (11) as 𝜓 𝑞 (𝐱) = 𝑊 𝑞 (𝐱)𝑆 𝑞 (𝐱) , (12) which has been named [3] the Weight-Shape decomposition of 𝜓(𝐱) . Since 𝑉(𝐱) ≥ 0 , it follows that
𝑆(𝐱) ≤ 1. Vector Fields
Let us introduce vector fields in order to deal with the locations of extrema of the scalar fields 𝜓 𝑞 (𝐱), 𝑉 𝑞 (𝐱) and 𝐻 𝑞 (𝐱) . We define the vector field D as 𝐃 = −∇log𝜓 𝑞 (𝐱) = 2𝑞 < 𝐱 − 𝐲 > (13) Applying the grad operator to the potential we find (see Appendix) ∇𝑉 𝑞 (𝐱) = 𝐃 + V𝐃 − 𝐄 where 𝐄 = < ( 𝐱 − 𝐲 ) > . (14) Finally, we find that the remaining interesting vector, the gradient of entropy, obeys ∇𝐻 𝑞 (𝐱) = ∇𝑉 𝑞 (𝐱) + ∇log𝜓 𝑞 (𝐱) = V𝐃 − 𝐄 . (15) The two new vector fields allow us to exhibit the different locations of the three extrema of the scalar fields. Local maxima (and minima) of the probability function occur at D =0, maxima (and minima) of entropy occur where 𝐄 = V𝐃 and minima (and maxima) of the potential (or maxima and minima of Shape) occur where
𝐄 = V𝐃 + 𝐃 . All extrema coincide if both D =0 and E =0 at the same value of x . Otherwise, one finds at maxima of the Parzen probability that ∇𝑉 𝑞 (𝐱) = −𝐄 . From Eq. (13) we conclude that 𝑞 𝜕𝜕𝑞 𝐃 = 𝐃 + 𝑉𝐃 − 𝐄 = ∇𝑉 (16) which implies that maxima of 𝜓 𝑞 , where 𝐃 = 0 , change with q in a direction determined by −∇𝑉 , i.e. toward close-by minima of V . Eq. (16) leads to a new interpretation of QC: clustering based on minimization of V may be interpreted as clustering based on stationarity of D with respect to changes in the scale-parameter q. Extrema of Scalar Fields
Examples of the behavior of log 𝜓 and of V are demonstrated below for a data set of 9000 observed galaxies (with red-shift in the domain 0.47 ± 0.005 ) regarded as points in spherical angles θ and φ within some limited range. Whereas for 𝜎 = = 2 (in angle degree units) the two fields exhibit many extrema, there exist clear differences for larger sigma, e.g. σ=10, where log 𝜓 has one maximum whereas V maintains several minima. A. Loci of 9000 Galaxies within some range of spherical angles. B. log 𝜓 (top) and V (bottom) displayed over the data plane A, using σ=2. C. Surfaces of V and log 𝜓 for increased values of the Gaussian widths. Fig.1. Demonstration of different structures of log 𝜓 and of V for a set of galaxies downloaded from the Sloan Digital Sky Server DR12. The different extrema can be used for clustering formation. Clustering methods based on probability maximization are known as Mean Shift algorithms [6] [7]. QC [4] is a clustering method based on otential minimization. They were generalized in [3] into three families of algorithms, together with another method based on entropy maximization. A hierarchical structure of V at various scales for any complex data problem, can be obtained by employing Hierarchical Maximum Shape Clustering (HMSC), as defined in [3]. Information regarding the different extrema of 𝜓 and 𝑉 can be united into one equation by considering another scalar function 𝑈 𝑞 (𝐱) ≡ 𝐃 = (2𝑞 < 𝐱 − 𝐲 >) . (17) Using the relations spelled out above, it follows that the condition 𝑞 𝜕𝜕𝑞 𝑈 𝑞 (𝐱) = −2∇log𝜓 𝑞 (𝐱) ∙ ∇𝑉 𝑞 (𝐱) = (18) implies that either 𝜓 or 𝑉 (or both ) reach their extrema at the x values for which Eq. (18) holds. U is non-negative, therefore U =0 is a minimum of q . It corresponds to extrema of 𝜓 which are associated with 𝐃 = 0.
Other values of U which obey eq. (18) are associated with extrema of 𝑉 which occur whenever 𝜕𝜕𝑞 𝐃 𝑞 = 0. Eq. (18) is a concise statement regarding the points of interest in our data-analysis method. In analogy with statistics, one may view this equation as an inference method finding the correct parameter q which leads to points of interest at given values of x . An interesting comparison can be made with the score function [8] in statistics, defined by 𝑢(𝜃)= 𝜕𝜕𝜃 log 𝑓(𝐱|θ) for a probability function f( x ) which depends on the parameter 𝜃 . The average of u vanishes and its variance is known as Fisher's information [9]. The analogy of 𝑓 with 𝜓 and u with 𝑉 is incomplete, because the integral of 𝜓 is not normalized to 1. As a result, whereas the expectation value of u vanishes, the analogous statement in our analysis becomes ( 𝑞𝜋 ) d2 ∫ 𝑑𝐱 𝜓 𝑞 (𝐱)𝑉 𝑞 (𝐱) = d2 . (19) Although all extrema may be regarded as points of interest, some are of more interest than others: extrema that remain fixed in x for a range of q values which is large compared with that of other points of interest. This criterion, introduced by Roberts [10], allows for searching for scales which correspond to important properties of the data. Thus it sub-serves the search for good clustering of the data [3,4,10]. Relation to the Diffusion Equation
The diffusion equation has served as the basis for various data analysis schemes such as the scale-space approach [11] and harmonic analysis expansion [12]. For the sake of completeness, we demonstrate in this section how it fits into our analysis. The Gaussian function 𝐺 𝑡 (𝐱) = ( 𝑞𝜋 ) d/2 exp(−𝑞 𝐱 ) = ( 𝑞𝜋 ) d/2 𝐾 𝑞 (𝐱) with 𝑞 = (20) serves as the Green's function of the diffusion equation 𝜕𝜕𝑡 𝐺 𝑡 (𝐱) + ∇ 𝐺 𝑡 (𝐱) = 0 . (21) It follows that the solution to the diffusion equation with the boundary condition 𝜑 (𝐱) = 𝑄(𝐱) at t= (4𝑡𝜋) −d/2 𝜑 𝑡 (𝐱) , where 𝜑 𝑡 (𝐱) = 𝜓 𝑞 (𝐱) for 𝑞 = . (22) The factor (4𝑡𝜋) −d/2 is needed to guarantee the correct limit at t=
0. Interesting conclusions are the connections with the Schrődinger equation (9), the potential 𝑉 𝑞 (𝐱) of Eq. (7) and the Weight-Shape decomposition (12). Similarly, we obtain 𝑣 𝑡 (𝐱) = 𝑉 𝑞 (𝐱) and ℎ 𝑡 (𝐱) = 𝐻 𝑞 (𝐱) for 𝑞 = . (23) Let us discuss a few examples. The first is ( 𝑞𝜋 ) d/2 𝜑 𝑡 (𝐱) = 𝐺 𝑡 (𝐱) where 𝑣 𝑡 (𝐱) = (𝐱) /4𝑡 , ℎ 𝑡 (𝐱) = 𝐻 𝑞 (𝐱) = 0, and W =1 which is to be expected from Q= 𝛿(𝐱). The second is derived from of a one-dimensional step-function Q= 𝜃(𝑥)𝜃(𝐿 − 𝑥)/𝐿. This leads to 𝜓(𝑥) = ∫ exp (− (𝑥−𝑦) ) 𝑑𝑦 𝐿0 = √𝜋𝑡 [erf ( 𝑥√4𝑡 ) − erf ( 𝑥−𝐿√4𝑡 )] . (24) This function has been displayed as Fig. 2 in [3], together with its W and S components, demonstrating that Shape accentuates the edge effects of the distribution. A third example is the plane wave solution of the diffusion equation: 𝜑 𝑡 (𝐱) = exp (𝑖𝒌 ∙ 𝐱 − 𝒌 𝑡) . It can be generated from 𝑄 = exp (𝑖𝒌 ∙ 𝐲) , and it leads to 𝑣 𝑡 (𝐱) = (𝐱) , ℎ 𝑡 (𝐱) = 𝑖𝒌 ∙ 𝐱 and, correspondingly, W = exp(𝑖𝒌 ∙ 𝐱) , 𝑆 = exp(−𝒌 𝑡) . Here the WS decomposition exhibits a separation of variables, with W and S being separate x and t dependent functions. Replacing the plane wave by a wave packet, 𝑄 = exp (𝑖𝒌 ∙ 𝐲 − (𝐲∆) /2) corresponding to average momentum 𝒌 and Gaussian widt ∆ , will result in 𝜑 𝑡 (𝐱) = exp[(𝑖𝒌 ∙ 𝐱 − 𝒌 𝑡 − 2𝑡∆ 𝐱 ) /(1 + 2𝑡∆ )] which no longer displays any separation of variables. Conclusions
The three equations (7) (13) and (16) define the behavior and inter-relations of the potential and the Parzen probability in both x and 𝑞 = : - Maxima (and minima) of the probability 𝜓 𝑞 (𝐱) occur where D vanishes, i.e. when < 𝐱 − 𝐲 > = 0 . - Minima (and maxima) of the potential 𝑉 𝑞 (𝐱) occur when ∇< (𝐱 − 𝐲) > = 0 , which is also when the variation of D with respect to q vanishes, 𝜕𝜕𝑞 𝐃 𝑞 (𝐱) = 0 . - Maxima of the Parzen probability 𝜓 𝑞 (𝐱) and minima of the potential 𝑉 𝑞 (𝐱) will only seldom co-occur. This happens if both vectors D and E vanish, i.e. if both < (𝐱 − 𝐲) > = 0 and < (𝐱 − 𝐲) > = 0 . The relation between these two types of extrema is reminiscent of the relation between phase-velocity 𝜔/𝑘 and group-velocity 𝜕𝜔/𝜕𝑘 of waves with frequency 𝜔 and wave-number k : whereas the maxima of 𝜓 𝑞 (𝐱) obey 𝐃 = 0 , the minima of 𝑉 𝑞 (𝐱) obey 𝜕𝐃/𝜕𝑞 = 0 , which is a very relevant condition for interesting loci. Different locations of point-minima of 𝑉(𝐱) , for fixed q , are, by reasons of continuity, separated by maxima of 𝑉(𝐱) , which form manifolds of dimension d-1. If a minimum ∇𝑉(𝐱) = 0 occurs along a 1-dim string, its separating manifold has dimension d-2. These separating manifolds serve to define clusters of data points. The latter can be calculated by manipulations of replica points, as in [4] and in [3]. Analogous statements hold for extrema of 𝜓 and of H , leading to different clustering methods [3]. We defined the points of interest to be the extrema of 𝜓 𝑞 (𝐱) and 𝑉 𝑞 (𝐱). All these extrema occur when the indicator function U is stationary with respect to q, 𝜕𝜕𝑞 𝑈 𝑞 (𝐱) ≡ 𝜕𝜕𝑞 (2𝑞 < 𝐱 − 𝐲 >) = 0 . In the investigation of data in the format of Eq. (1), particular attention should be given to extrema which remain stable for quite a large range of q . The latter should reveal important internal structure of these data. [10] Points of interest may also accumulate along 1- or 2-dim structures in data-space. Such phenomena may exist in big data, where a panorama of hills (well-separated Gaussian structures) may be replaced by an alpine crest in data-space [13]. Eq. (24) is an example of a linear string that can serve as an attractor of QC in data-space [3].
Appendix: Mathematical Relations
From the various definitions we can derive the following relations 𝑞 𝜕𝜕𝑞 𝜓 = − 𝜓 𝑞 𝜕𝜕𝑞 log 𝜓 = 𝑉 𝜓 ∇ 𝜓 = 𝐃 𝜓 ∇< 𝑓 > = 𝐃 < 𝑓 > + < ∇𝑓 > −2𝑞 < (𝐱 − 𝐲)𝑓 > 𝑞 𝜕𝜕𝑞 < 𝑓 > = 𝑉 < 𝑓 > +< 𝑞 𝜕𝜕𝑞 𝑓 > −𝑞 < (𝐱 − 𝐲) 𝑓 > Thus ∇𝑉 𝑞 (𝐱) = 2𝑞 < (𝐱 − 𝐲) >< 𝐱 − 𝐲 > +2𝑞 < 𝐱 − 𝐲 > − < (𝐱 − 𝐲) > which leads to Eq. (14). Further derivatives of the potential are 𝑞 𝜕𝜕𝑞 𝑉 = 𝑉 + 𝑉 − 𝑀 where 𝑀 = 𝑞 < (𝐱 − 𝐲 ) > i.e. 𝑞 𝜕𝜕𝑞 𝑉 = 𝑞 < (𝐱 − 𝐲) > +(𝑞 < (𝐱 − 𝐲) > ) − 𝑞 < (𝐱 − 𝐲 ) > and 𝑞 𝜕𝜕𝑞 𝛁𝑉 = 𝑞 𝜕𝜕𝑞 (𝐃 + V𝐃 − 𝐄) = (𝐃 + 𝑉𝐃 − 𝐄)(1 + 𝑉) + 𝐃(𝑉 + 𝑉 − 𝑀) − 𝑞 𝜕𝜕𝑞 𝐄 = (1 + 𝑉)∇𝑉 + +𝐃(𝑉 + 𝑉 − 𝑀) − (
2𝐄 +
𝑉𝐄 − 2𝑞 < (𝐱 − 𝐲 ) > ) . Note that 𝑞 𝜕𝑞 𝐃 𝑞 (𝐱) = 𝑞 𝜕𝜕𝑞 𝛁𝑉 − 𝛁𝑉 . Near the potential minimum this becomes 𝑞 𝜕𝑞 𝐃 𝑞 (𝐱) ≈ 𝐃(𝑉 + 𝑉 − 𝑀) − (
2𝐄 +
𝑉𝐄 − 2𝑞 < (𝐱 − 𝐲 ) > ) Low resolution limits.
Expanding the Gaussian kernel in q , one may express 𝜓 𝑞 (𝐱) in terms of a Taylor series of weighted moments of the data, 𝜓 𝑞 (𝐱) = ∑ (−𝑞) 𝑛 𝑛!𝑛 ∫ 𝑑𝐲 𝑄(𝐲) (𝐱 − 𝐲) , which, for the explicit data representation of Eq. (1) becomes 𝜓 𝑞 (𝐱) = ∑ (−𝑞) 𝑛 𝑛! ∑ 𝐼 𝑖 𝑖𝑛 (𝐱 − 𝛍 𝑖 ) . In particular, in the asymptotic limit of low q one finds 𝜓 𝑞 (𝐱) ≈ 1 − 𝑞𝑁 ∑ 𝐼 𝑖 (𝐱 − 𝛍 𝑖 )
2 𝑖 . The first term corresponds to the low limit of W ≈ 1 + 𝑂(𝑞 ) and the second to the low limit of S ≈ 1 − 𝑉 ≈ 1 − 𝑞 ∑ 𝐼 𝑖 (𝐱 − 𝛍 𝑖 ) /𝑁 + 𝑂(𝑞 ) 𝑖 . Acknowledgments
I thank Isaac Meilijson and Zeev Schuss for helpful discussions, and Itay Fisher for his help with numerical data analysis. This research was partially funded by the Blavatnik Cyber Center of Tel Aviv University.
References [1] E. Parzen, On estimation of a probability density function and mode, The annals of mathematical statistics 33 (3) (1962) 1065-1076. [2] D. Xu, Y. Tian, A comprehensive survey of clustering algorithms, Annals of Data Science 2 (2) (2015) 165-193. [3] L. Deutsch and D. Horn: The Weight-Shape Decomposition of Density Estimates: A Framework for Clustering and Image Analysis Algorithms.
Pattern Recognit.
81 (2018) 190-199. [4] D. Horn and A. Gottlieb. Algorithm for data clustering in pattern recognition problems based on quantum mechanics.
Phys. Rev. Lett.