Method of fractal diversity in data science problems
MMETHOD OF FRACTAL DIVERSITY IN DATA SCIENCE PROBLEMS Β© 2018 V.V. Vladimirov , E.V. Vladimirova [email protected] Lomonosov
Moscow State University, Moscow, Russia [email protected]
The parameter (SNR) is obtained for distinguishing the Gaussian function, the distribution of random variables in the absence of cross correlation, from other functions, which makes it possible to describe collective states with strong cross-correlation of data. The signal-to-noise ratio (SNR) in one-dimensional space is determined and a calculation algorithm based on the fractal variety of the Cantor dust in a closed loop is given. The algorithm is invariant for linear transformations of the initial data set, has renormalization-group invariance, and determines the intensity of cross-correlation (collective effect) of the data. The description of the collective state is universal and does not depend on the nature of the correlation of data, nor is the universality of the distribution of random variables in the absence of data correlation. The method is applicable for large sets of non-Gaussian or strange data obtained in information technology. In confirming the hypothesis of Koshland, the application of the method to the intensity data of digital X-ray diffraction spectra with the calculation of the collective effect makes it possible to identify a conformer exhibiting biological activity.
Key words:
Cantor's fractal dust, collective effect, strange kinetics, biological activity.
The key steps in the derivation of the formula for the signal-to-noise ratio, which allows a quantitative comparison, are given in the article. Fractal Cantor dust or a geometric progression with an arbitrary value (in the classical fractal of the Cantor set π = 2/3 ) has the form:
πΉ = 1 β (1 β π) β (1 β π)π β (1 β π)π β (1 β π)π β β― [1] A method for constructing a fractal manifold is proposed. The fractal manifold for π = 5 of an arbitrary set of five ordered numbers π π has the form: π Μ (π, 5) = π β (1 β π)π β (1 β π)ππ β (1 β π)π π β (1 β π)π π β (1 β π)π π β (1 β π)π π β (1 β π)π π β β― π Μ (π, 5) = π β (1 β π)π β (1 β π)ππ β (1 β π)π π β (1 β π)π π β (1 β π)π π β (1 β π)π π β (1 β π)π π β β― [2] With each fractal cycle π , where π β β , a new value π π appears from the sample of non-Gaussian data π and then along the closed contour. The left and right directions of the contour are different. In general form: π ππ Μ (π, π) = π π β 1 β π1 β π π+1 [β (π π π πππ (π+1+π,π+1) ) ππ=1 ] [3] Similarly, for π ππΏ Μ(π, π) , is obtained: π ππΏ Μ(π, π) = π π β 1 β π1 β π π+1 [β(π πβπ π πππ(π+π,π+1) ) ππ=1 ] [4] The sets {π ππ Μ (π, π) β π ππΏ Μ(π, π)} and {π ππ Μ (π, π) + π ππΏ Μ(π, π)} form fractal varieties. The expression for the signal-to-noise ratio is defined: πππ (π, π) = π(π, π)π(π, π) = β (π ππ Μ (π, π) β π ππΏ Μ(π, π)) β (π ππ Μ (π, π) + π ππΏ Μ(π, π)) [5] The uniqueness of the Gaussian and Bessel functions is that the SNR [5] signal-to-noise ratio does not depend on the value of π . When the data are approximated by Bessel functions, the collective effect is not manifested. Modeling non-Gaussian data with half-wave a π = sin (π ππ ) , used in calculations with preliminary approximation of the data by a finite Fourier series, for sufficiently large values of π , the expression for the signal-to-noise ratio is: π(π, π) β (1 β π) (1 + π) π β 3 2π (1 + 4π + β― ) [6] This and other formulas in the form for Mathcad.
π(π, π) β (1 β π) (1 + π) (π β 3) (1 + 4π + β― ) [7] and πππ (π, π) = (1 β π(π)) (π β 3) [8] We require the fulfillment of the invariance condition
πππ (π, π) , which approximates strange data to Gaussian: πππ πππ (π, π(π)) = 0 [9]
The solution of the differential equation has the form: π(π) = 1 β β ππ β 3 [10]
The choice of a constant π determines the scale of the signal-to-noise ratio. Preliminary calculations are performed for π = 0 by formulas [15] β [17] . At the preliminary stage of calculations, when comparing different sets of ordered data, the critical sizes of the descriptors π ππ1 , π ππ2 are obtained, which provide the maximum collective states in the data sets. Then the value π = [πππ(π ππ1 , π ππ2 ) β 3] in formula [10] and the value is calculated more accurately πππ (πππ₯(π ππ1 , π ππ2 )) taking into account the invariance ([11] β [14]) of π . Comparison of SNR values of different data sets is correct in the calculation performed on a single scale π . The peaks of πππ (π₯ π , π) characterize the presence of a structure in the data variable π₯, , denote a neighborhood of the collective state. The concept of a critical or collective state is characteristic of the strange kinetics approach , denoting a cluster of degrees of freedom with strong correlation. The approximation parameters of a finite Fourier series and the size of the descriptor π are determined from the conditions of the maximum of the function - collection of the collective state in the system - when the ordered data are analyzed in single step. In the matrix form, the renormalization-invariant formulas for the signal-to-noise ratio have the form: πππ (π, π) = (π β ππ)(π β ππ) [11]
π = β(πππ‘πππ₯(π + 1, π + 1, π) β πππ‘πππ₯(π + 1, π + 1, π) π ) [12] π = [2πππππ‘ππ‘π¦(π + 1) β (πππ‘πππ₯(π + 1, π + 1, π) + πππ‘πππ₯(π + 1, π + 1, π) π )] [13] where π(π, π) = 1 β π1 β π π+1 π πππ(πβπ+π,π+1) [14] Formulas [11] β [14] are equivalent to the formulas [3] β [5] and allow programming.
In calculations from
πΎ = π/2 + 1 unique ordered spectrum data, a symmetric closed-loop vector is constructed: π = (π , π , π , β― , π πΎβ1 , π πΎ , π πΎβ1 , β― , π , π ) [15] For π = 0 , taking into account the symmetry of the matrices π and π , the formulas for the signal-to-noise ratio [12] β [13] acquire an acceptable form for processing big data: π 2β = π (π β π ) + π (π β π ) + β π π (βπ πβ2 + 2π π β π πβ2 ) πΎβ2π=2 + π
πΎβ1 (βπ
πΎβ3 + π
πΎβ1 )+ π πΎ (βπ πΎβ2 + π πΎ ) [16] π 2β = π (3π β 4π + π ) + π (β4π + 7π β 4π + π )+ β π π (π πβ2 β 4π πβ1 + 6π π β 4π π+1 + π π+2 ) πΎβ2π=2 + π
πΎβ1 (π πΎβ3 β 4π
πΎβ2 + 7π
πΎβ1 β 4π πΎ ) + π πΎ (π πΎβ2 β 4π
πΎβ1 + 3π πΎ ) [17] Comparing the SNR values with the ordering scale, the scale is shifted to the left by the size of the descriptor πΎ . An ordered data set, with a preliminary approximation by the finite Fourier series π , is analyzed by a descriptor, of size πΎ , with a single step. Calculates β πππ (πΎ, π) by passing all the points in the data set. The objective function is defined as πππ₯(β πππ (πΎ, π)) in the search for parameters πΎ and π . As already noted, a correct comparison of the structural characteristics πππ of different data sets should be carried out on a single scale π with allowance for π invariance ([10] β [14]) . Similar to comparing measurements made in centimeters and inches. The method is used for large sets of data obtained in good resolution, which makes it possible to increase the scale of the comparison π with preservation of invariance. In order of magnitude, in the problem with conformers, the total number of data in the X-ray analysis spectrum is 2250 values, the optimal descriptor size for a given resolution is πΎ = 500 , the maximum harmonic of the finite Fourier series is π = 3 . The collective state in chemistry is called the flexibility or mobility of molecular fragments. Koshland's hypothesis of induced compliance with the appearance of biological activity, based on the assumption of the flexibility of the active center of the enzyme, satisfactorily explains the action of enzymes. As the substrate approaches the active center of the enzyme, a conformational restructuring occurs synchronously in the enzyme molecule, affecting a large number of degrees of freedom. The application of the computational method to the spectrum of three conformers shows a significant increase in the collective effect of the conformer, which is distinguished by its biological activity. A similar example of the collective effect is manifested in the thermomechanical curve method for polymers with different molecular weights in the region of high elasticity. The application of the method in solving the problems of data science consists in the preliminary transformation of the original non-Gaussian data and comparing the degree of cross-correlation between the data. 1.. The collective state in chemistry is called the flexibility or mobility of molecular fragments. Koshland's hypothesis of induced compliance with the appearance of biological activity, based on the assumption of the flexibility of the active center of the enzyme, satisfactorily explains the action of enzymes. As the substrate approaches the active center of the enzyme, a conformational restructuring occurs synchronously in the enzyme molecule, affecting a large number of degrees of freedom. The application of the computational method to the spectrum of three conformers shows a significant increase in the collective effect of the conformer, which is distinguished by its biological activity. A similar example of the collective effect is manifested in the thermomechanical curve method for polymers with different molecular weights in the region of high elasticity. The application of the method in solving the problems of data science consists in the preliminary transformation of the original non-Gaussian data and comparing the degree of cross-correlation between the data. 1.