On the apparently fixed dispersion of size distributions
1 On the apparently fixed dispersion of size distributions
Sascha Vongehr a, * , **, Shaochun Tang**, and Xiangkang Meng** * Department of Physics, Nanjing University, Nanjing 210093, P.R. China ** National Laboratory of Solid State Microstructures, Department of Materials Science and Engineering, Nanjing University, Nanjing 210093, P.R. China
Probability density functions (PDF) of statistical distributions of cluster sizes N , where N is the number of particles in the cluster, often seem to have less freedom than expected from considering the number of degrees of freedom at the clusters’ source. The full width at half maximum appears to be comparable to the average < N >. Such a hidden symmetry is intriguing theoretically and practically impairs size selection towards narrow distributions. However, reviewing the example of Helium cluster beams demonstrates that the origin of the apparent fixing is the assumption that the distributions should be log-normal or exponential and the subsequent use of these functions to fit the data in n = ln( N ) log-space. This demands more care when using parametric statistics. Alternatives to the traditionally employed fitting functions are discussed. Keywords: size distributions, helium droplets, clusters, parametric statistics
PACC: Introduction
Processes that randomize the results of previous random processes yet further give rise to typical statistical distributions. While decay or fractionation leads often to power laws and exponential (EXP) distributions, random grow processes like phase change aggregations give mostly rise to normal or lognormal (LN) results, be it in biology, economics or cluster physics . Clusters’ sizes can often be described by their mass or simply by the number N of atoms or molecules inside the cluster. Sub- and supercritical expansions produce gas-condensation and fluid-fractionation clusters whose sizes N seem a Corresponding author N (see reviews ), the clusters’ sizes have a full width at half maximum (FWHM) proportional to the average size N . Average and deviation are independent degrees of freedom (DOF) of a LN distribution. Nevertheless, for beams of He N from sub-critical expansions (i.e. gas condensations), LN size distributions and the restriction FWHM 1.12 0.05 N = ± (1) have been established . Similar holds for the supercritical regime and the EXP function. Such results are however less than perfectly empirical because of the assumptions that went into the data collection. Besides assumptions regarding the physics of particular detection cross sections and collision cross sections, the main assumption of our concern in this work is the employment of rather restricted fitting functions, i.e. the use of parametric statistics. We show that the latter is responsible for the apparent (i.e. not actually empirical) fixing of the dispersion relation. The DOF, which are here the source stagnation pressure P (20-100 Bar), temperature T (10-30 K) and to some extend the nozzle size (diameter ~ 5 µ m), seem reduced to the single parameter N . The fixing of dispersion relations is already interesting, but what makes this subject intriguing is the apparent numerical coincidence between different expansion regimes. It is very difficult to derive the size distribution within the complex physics of supersonic expansions, because the ideal gas law for example does not apply to cold helium. However, it is intriguing to notice that by cooling the cluster source down into the supercritical regime, the clusters become large fractionation clusters whose sizes seem 3 EXP distributed . The EXP distribution is equi-dispersed, i.e. the standard deviation N ∆ exactly obeys N N ∆ = (2) It is therefore often stated that “ N N ∆ ≈ ” holds for both expansion regimes. This fact would be very surprising and thus triggered our investigations. The size distribution may only be determined by where the expansion trajectory through pressure P versus temperature T phase space intersects the bi-nodal and from which side it does so: When climbing the bi-nodal along the vapor side (i.e. cooling the nozzle), N increases and continues to increase when turning around at the critical point and descending along the liquid side of the coexistence curve. Although fragmentation is quite the opposite of growth, “ N N ∆ ≈ ” suggest that cluster beam physics connects these regimes smoothly just by cooling the source, neglecting minor complications like that the supercritical expansion is somewhat bimodal due to re-condensing fragments. One may suspect that the proportionality between dispersion measures and averages has the same fundamental origin, or even that the LN distribution obtained when intersecting the bi-nodal from one side of the coexistence line can be derived from the EXP distribution present when intersecting from the other side. This makes the subject matter interesting for general cluster physics. A general coupling of the DOF in statistical growth processes would constitute a nuisance in need of exploration, because in nanotechnology, scaling up N and decreasing N ∆ simultaneously (size selection) are often both promised to be mere technicalities. This importance of so called size selection makes a fixed deviation very significant, which motivated us to investigate the dispersion relation 4 : N N d N = ∆ (3) in detail theoretically and to review the usually employed methods that determine the parameters involved. In the following, we demystify the origin of the apparent fixing of the dispersion in unprecedented clarity and provide again an overdue cautionary notice about the dangers of parametric statistics, as done before in the difficult context of distribution mixing . Because of the relevance to ongoing research and the ever important goal of size selection in cluster physics, we also provide alternative fitting functions and discuss their merits. Introduction to Probability Density Functions
Our main subject is the origin of an apparent symmetry between different expansion regimes and the transformation from N to log-space ln n N = will be part of the explanation. It is therefore worthwhile to pedagogically introduce the most convenient way to understand and manipulate probability density functions (PDF) of different distributions (like LN and EXP) and also their different expressions in N and n spaces together in one consistent notation. First we introduce the variables: Consider a statistical variable m in the real numbers with min max m m m ≤ ≤ and average denoted as m . The standard deviation m ∆ is defined via
22 2 m m m ∆ = − . Any linear transformation m an b = + leaves its normalized variable ( ) m n m m = − ∆ (cid:1) untouched, meaning that n n n n = ∆ + (cid:1) for all n and the parameter m n a = ∆ ∆ (4) n = (cid:1) and 1 n ∆ = (cid:1) . It helps to know that n is ln n N = of the number of atoms N in the cluster and that ln m M = is often the logarithm of the cluster’s cross section. In that case, Eq. (4) is a fraction of spatial dimensions, so it will be insightful to conserve the difference between m and n . The radii and cross sections of large clusters follow as R r N = and R σ π= with the liquid Wigner-Seitz radius of r s = 2.221Å and 2.44Å for He and He respectively . Nevertheless, these definitions are independent of the interpretations and with the further defining of ln b B = , the equation / ln( ) m n m BN ∆ ∆ = follows generally for any such m . Next we introduce the statistical distributions: A convenient origin for a probability distribution is the cumulative probability C . Cumulative means that ( ) min d (d / d ) d m C mm
C C C m m = = ∫ ∫ , i.e. the infinitesimal probability of any m is d C . The condition ( ) max m C = yields automatically normalized distributions for all m = an + b if they are expressed as C ( ñ ) . The expectation value of any observable Ψ is dC Ψ = Ψ ∫ , which, if boundaries are at infinity for example, equals (d / d ) d C m m +∞−∞
Ψ = Ψ ∫ (cid:1) (cid:1) . With (d / d ) m m m − = ∆ (cid:1) , the most likely value of m (the “modal value” MODAL m ) is the position of the maximum where (d d ) 0 C m = , as long as a maximum exists away from the boundaries at m min and m max . m M e = implies d d M M m = and the probability density functions for M are therefore simply equal to the ones for m yet divided by M : PDF: (d d ) (d d )
C M M C m − = = (5) The Size Distribution of Clusters from Sub-critical Expansions
When establishing the size distribution of sizes N of clusters from sub-critical expansions, the data are usually fitted with a LN distribution , which can be for general M easily gotten from the normal distribution (see appendix). The result is d 1PDF 2 expd 2 m m m mC MM π − − = = ∆ ∆ (6) and depends on two DOF, namely the fitting parameters m and m ∆ in log-space ln m M = . In case of our cluster sizes N , the average N and the FWHM are then calculated (not measured) employing Eq. (11) etc. For Helium droplets He N it follows the well known proportionality of Eq. (1). However, one should note something curious and important for the present work, namely that
12 2 m M M d M e −∆ = ∆ = − (7) , i.e. the dispersion of the LN is independent of m almost as if one DOF disappeared ! This is however only one aspect of several that make the FWHM an especially misleading measure. The FWHM can be expressed via (see derivation in appendix) / FW M exp 3 / 2 exp 2 ln exp 2 ln n n n N X X X = ∆ ∆ − − −∆ − (8)
Firstly, the choice of employing the FWHM, i.e. X = ½, determines the proportionality factor of -2ln X = ln[4]. There is nothing fundamental about the choice X = ½ and X = 0.44 would have lead to FW M
N X = exactly. Secondly, the data derived fitting parameter n is actually not inside Eq. (8) at all. Thirdly, the proportionality varies weakly with n ∆ . In fact, the surprising relation in Eq. (1) would be equally true after 7 setting the original data for n ∆ to a constant 0.61 instead. All this is easily overlooked, and the reason is a combination of the following: 1) The n -independence of the dispersion relation d N of LN distributions [from Eq. (7) ] is hidden by the usage of n and n ∆ as fitting parameters and the subsequent transformation (the move from n - to N -space) into two variables, namely N and FWHM, that depend strongly on n . 2) The FWHM naturally scales with N , is insensitive to the large “foot” of the distribution, and is an especially bad measure of dispersion for parameters like particle number N that have an absolute minimum. It can be convenient and insightful to manipulate large quantities with absolute zero points, like entropy, in log-space. However, one should avoid transforming forth and back, especially when introducing assumptions due to models at different stages, because a slight variation in n corresponds to a large change in N . In the sub-critical range n = ± of the helium experiments with continuous beams done to date holds n ∆ = ± (9) However, n ∆ actually does depend on n , for example : n n ≈ ∆ (10) This in turn makes the dispersion d N dependent on n ; i.e. it is not as independent as Eq. (7) and (1) suggest. Using the original data results in Figure 1, a plot of the square of the often presented FWHM N and of the dispersion relation d N [Eq. (3) ], which one might erroneously assume to be quite similar or proportional to each other. Both are presented versus n along the horizontal axis. While the former measure stays 8 surprisingly constant, the dispersion relation d N decreases strongly and illustrates the large experimental uncertainty better. Fig 1:
Two measures of the deviation of cluster sizes N are plotted versus the mean < n > of n = ln( N ): The square of an often presented measure, namely the average < N > divided by the FWHM, remains fairly constant (white dots) while the dispersion relation d N decreases strongly with < n > (black dots). Since parametric statistics is a convenient way of analyzing data, we need to offer alternative fitting functions. It has been said that the strong influence of atomic evaporation on Helium cluster sizes makes it immediately doubtful that one should assume simple size distributions. However, one should not think in terms of simplicity. It is insufficient to merely select PDF with more DOF. A PDF of a different kind with as many DOF may work better. For smaller droplets, the low density of the droplets' surface increases the geometrical cross section much over the simple liquid drop model σ l.d . This suggests that cross sections may not be LN distributed, yet, given the accuracy of 9 experiments, N , σ , both or even neither (in case of a certain correction to the cross section) deviate from a LN distribution. As long as 0.6 n ∆ ≤ (compare Eq. (9)), the Inverse Gaussian (IG) distribution traces the LN extremely well. If M is the cross section, then ( ) m n ∆ = ∆ [Eq. (4)] is even smaller. Therefore, sub-critical cluster size distributions or their statistical ensemble of cross sections etc. could have been modeled with the IG’s ( ) ( ) PDF 2 MM d M MM MMd M d M eM π − − = all along. This facilitates the Poisson mixture necessary to improve the prediction of cluster scattering and impurity pick-up statistics , which is also involved in the experiments that established the size distributions, because they do infer the size N from the deflection of the cluster when hit by a probe beam of fast particles. The IG is useful for cluster physics because its Poisson mixture is a closed expression; the LN’s mixture is not. The LN and the IG are special cases of the Power IG distribution, which has one DOF more, namely the power p with p = 0 and p = 1 yielding the LN and IG respectively. The Power IG can fit the distribution of M also in the supercritical regime. To employ a distribution with one more DOF seems complicated, but one may fix the DOF via a relation like Eq. (10). The Generalized IG distribution (GIG) includes many others as special cases ( Γ , Hyperbolic, IG, Reciprocal IG (RIG), …) and also allows tractable Poisson mixture . In any case, the assumption of an LN in order to fit the data is not without alternative and certainly not excluded by the data. 10 The Size Distribution of Clusters from Supercritical Expansions
In the supercritical regime, liquid helium fragments into droplets. At large N , the size distribution falls off exponentially . The average N determined by fitting an EXP to large clusters N N > results in a higher outcome than averaging all clusters : ( ) N N = ± (experiments were done with N ≈ , i.e. n ≈ if one assumes an EXP distribution). This disagreement is due to detection cross sections. Only the decay slope of the signal’s logarithm for large N above the average equals that of the original droplet distribution. Moreover, similarly large variations of 20% in calculated N at the same nozzle conditions argue for yet more unknown experimental artifacts. Considering that deflection experiments are moreover incorporating the assumption of a simple Poisson collision statistics, then all one can deduce in conclusion is that the original droplet size distribution is similar to a member of the exponential family. Using a linear exponential for N gives it a special status: If N is EXP distributed, M is not. This is different from the LN, where M is automatically LN distributed if N is. Therefore, one may use the gamma ( Γ ) distribution Eq. (12) instead. For d N = 1 it yields the EXP distribution EXP
N N e N − = , but it should be noted that it always has the exponential fall off that is observed at high arguments N . Similar to what is encountered in case of the LN distribution above, the assumption of a simple, linear EXP distribution introduces the fixed dispersion d N = 1. This fixing can be expressed in log-space (see appendix), where N N ∆ = translates into n π∆ = ≈ . This is maybe the clearest expression of that the use of the EXP in order to fit data is not an innocent assumption, since n ∆ is completely fixed by it. One should recall that n ∆ is a data derived fitting 11 parameter in all the research concerning the sub-critical regime and that it varies along with n . The by investigations of the supercritical regime implied 1.28 n ∆ = is purely due to the fitting procedure and it would be very surprising indeed if it were actually true even when generously allowing large errors due to measurement inaccuracy. Conclusion
We analyzed the apparent fixing of the dispersion of cluster size distributions and the numerical coincidence of that fixing across different expansion regimes. The origin of this curiosity lies partly in the use of parametric statistics, i.e. the strong influence of the assumed probability distributions used to fit the data. Focusing on the example of Helium droplets, it turned out that the assumption of an LN or EXP distribution in order to fit the data in log-space ln n N = is the actual origin of the apparent symmetry between the expansion regimes. The assumptions of the fitting functions are not innocent and we have shown that alternative distributions are not excluded by the data sets and their experimental uncertainties. The LN, EXP and also the simple Poisson distribution (via the modeling of cluster scattering) are always implicit in the data. These distributions dominate practically without alternatives in the Helium droplet community, but this is partially a historical fact and stabilized by that one cannot find any data that not already depends on the implicit use of the assumptions when fitting curves. Mathematical convenience is not a sufficient justification for neglecting to consider other distributions, because using an IG distribution in the sub-critical or Gamma functions in the supercritical regimes for instances, can be similarly or even more convenient, for example when the distribution 12 has to be folded with formulas for detection efficiencies, beam depletion, impurity pick-up etc. We recommend these alternative distributions. As a further conclusion of more practical value to the experimentalist, we cannot support the suspicion of a hidden symmetry that fixes the dispersion of size distributions and the desired sharpening of size selection via manipulating the physical degrees of freedom at the cluster source should be approached with less pessimism. Appendix
Normal and Lognormal Distributions
The normal probability’s cumulative is ( ) norm
1: 1 erf n 22 C = + (cid:1) . The normal distribution follows as ( ) d d 2 exp[ 2] C n n π − = (cid:1) (cid:1) . The modal equals the mean MODAL m m = . The distribution of the M is according to Eq. (5) just the one for m but divided by M and thus leads to the LN distribution in Eq.(6). Most variables of interest, like the mean or modal, are easiest calculated by going back to the normal expressions. The expansion ( ) ! m ii M e m i ∞= = = ∑ is necessary to express the p th moment / 2 m p m pp M e + ∆ = (11) Hence, the mean < M > is larger than the modal MODAL m mM e −∆ = and transforms always with a shift ( ) m p p pp M M e − ∆ = . For example, given a LN cluster size distribution with number expectation < N >, the expectation for the liquid drop model 13 radius R r N = becomes n R r N e −∆ = and is smaller than r N . At this point one can straightforwardly derive Eq. (7) or equivalently [ ] ln 1 1 m M d ∆ = + . A s tandard deviation is generally the preferred measure of deviation; the famous “two sigma” is the standard width. However, this can become problematic for strongly skewed functions and those peaked close to the limit of their support. Here, the length 2 ∆ M , when centered at the modal, likely reaches back below M < 0, i.e. MODAL M M − ∆ may be negative. If M cannot be negative, the FWHM is often employed instead. This popular measure of deviation is centered at Modal( M ) and extends laterally to where the PDF is only X = ½ of that maximum. This makes it difficult to relate it to standard deviations in case of the LN: The “full width at X maximum” is FW X M = ( E + − E - ), with exp[ ] E e ± ± = and m m e m X ± = − ∆ ± ∆ − . From the normal distribution’s point of view, all this is unnecessary. M < 0 is excluded because no choice of measure stretches below m = − ∞ . The general deviation FW X M is in m -space simply ( ) m e e X + − − = ∆ − (but not centered at < m >). Exponential and Gamma Distributions
The regularized gamma ( Γ ) function is the cumulative: ( ) ( ) N N N
C d d Z d Γ = − Γ Γ . Z is the normalized variable Z N N = . From this, the Γ distribution ( ) ( ) -1 PDF
NN N dd Z N d
Ne d Z −Γ = Γ (12) is derived as shown in Section 2. For d N > 1 there is a modal value and the Γ distribution looks similar to a LN in that case. d N = 1 implies the cumulative probability 14 ( ) EXP 1 N zd C C e −Γ = = = − . The distribution of M follows as ( ) EXP d d d d Z C M e Z M − = or EXP d d Zm ZC M eM − = ∆ . We yield EXP EXP
PDF d d Z C N e N − = = and this is again just n ’s distribution divided by N [Eq. (5)], therefore ( ) d d n n e N C n N e −− = . This could be called the “exp-exponential” distribution to be consistent with the usual “log” that is added to “normal”. While a monotonic exponential decline has no modal value (maximum), in n -space the modal equals ln< N >. It holds furthermore n N e γ + = (similar to LN’s n n N e +∆ = ), where γ is the Euler-Mascheroni constant: lim ln 1 0.5772 GgG
G g γ =→∞ = − + ≈ ∑ . Thus, EXP d d n n n n e
C n e γ γ − − − − − = and the standard deviation is derived to be n π∆ = ≈ . T. Ishii and M. Matsushita, J. Phys. Soc. Jpn. L. Oddershede, P. Dimon, and J. Bohr, Phys. Rev. Lett. D. E. Grady and M. E. Kipp, J. Appl. Phys. B. L. Holian and D. E. Grady, Phys. Rev. Lett. M. Villarica, M. J. Casey, J. Goodisman, J. Chaiken, J. Chem. Phys. C. R. Wang, R. B. Huang, Z. Y. Liu, L. S. Zheng, Chem. Phys. Lett.
103 (1994) E. Limpert, W. A. Stahel, and M. Abbt, BioScience
341 (2001) J. P. Toennies and A. F. Vilesov, Angew. Chem. Int. Ed. M. Barranco, R. Guardiola, S. Hernandez, R. Mayol, J. Navarro, M. Pi, J. Low Temp. Phys. F. Stienkemeier and K. K. Lehmann, J. Phys. B
R127 (2006) J. Harms, J. P. Toennies, and F. Dalfovo, Phys. Rev. B T. Jiang and J. A. Northby, Phys. Rev. Lett. E. L. Knuth and U. Henne, J. Chem. Phys. (5), 2664 (1999) S. Vongehr, S. C. Tang, and X. K. Meng, Chin. Phys. B (2), 023602-1 (2010) D. M. Brink and S. Stringari, Z. Phys. D: Atomic, Mol. Clusters
257 (1990) M. Lewerenz, B. Schilling and J. P. Toennies, Chem. Phys. Lett.
381 (1993) S. Vongehr, USC Dissertation, ProQuest AAT 3219885 (Los Angeles, 2005) V. Seshadri, Can. J. Stat.
131 (1983) 15 V. Seshadri, The Inverse Gaussian Distribution (Springer, New York, 1998) G. E. Willmot, ASTIN Bulletin
59 (1986) K. Iwase and K. Hirano, Jpn. J. Appl. Stat.
163 (1990) T. Kawamura and K. Iwase, J. Jpn. Stat. Soc.
95 (2003) H. S. Sichel, Information Processing & Management,28