Bayesian statistical analysis of hydrogeochemical data using point processes: a new tool for source detection in multicomponent fluid mixtures
Christophe Reype, Antonin Richard, Madalina Deaconu, Radu Stoica
BBayesian statistical analysis of hydrogeochemical data using point processes: anew tool for source detection in multicomponent fluid mixtures
C. Reype , A. Richard , M. Deaconu , and R. S. Stoica Universit´e de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France Universit´e de Lorraine, CNRS, GeoRessources, F-54000 Nancy, France Universit´e de Lorraine, CNRS, IECL, F-54000 Nancy, France
September 2020
Abstract
Hydrogeochemical data may be seen as a point cloud in a multi-dimensional space. Eachdimension of this space represents a hydrogeochemical parameter ( i.e. salinity, solute concentration,concentration ratio, isotopic composition...). While the composition of many geological fluids iscontrolled by mixing between multiple sources, a key question related to hydrogeochemical data setis the detection of the sources. By looking at the hydrogeochemical data as spatial data, this paperpresents a new solution to the source detection problem that is based on point processes. Resultsare shown on simulated and real data from geothermal fluids.
Introduction
The composition of many geological fluids is controlled by variable contributions of multiple sources( e.g. seawater, meteoric water, hydrothermal water). The knowledge of these sources helps to builtconceptual and quantitative models of fluid and mass transfer in the Earth’s crust (Yardley & Bodnar,2014). If the sources are known, the contribution of each sources in every mixture can be inferred fromhydrogeochemical data ( e.g.
Carrera, V´azquez-Su˜n´e, Castillo, & S´anchez-Vila, 2004; Skuce, Longstaffe,Carter, & Potter, 2015). In the case where the sources are not known, they can be inferred from thedata ( e.g.
Pinti et al., 2020).The paper presents a Bayesian method of source detection based on point processes. The method isinspired by pattern detection methodologies used in image analysis, animal epidemiology and astron-omy (R. Stoica, Descombes, & Zerubia, 2004; R. S. Stoica, Gay, & Kretzschmar, 2007; R. S. Stoica,Martinez, Mateu, & Saar, 2005; R. S. Stoica, Mart´ınez, & Saar, 2007).
Let x = { x i , i = 1 , . . . , n } be a set of sources, giving the source position within a multi-dimensional(in practice two-dimensional) space formed by the hydrogeochemical parameters. A data point d is amixture of these sources ( i.e. it is explained by these sources) if it is a barycenter of these sources asstated in (Faure, 1997) i.e. d = n (cid:88) i =1 γ i x i (1)with 0 ≤ γ i ≤ i and (cid:80) ni =1 γ i = 1.In the Euclidean plane, the source pattern ( i.e. set of sources) is unknown and also somehowoutlined by the set of hydrogeological data points. The key hypothesis at the basis of our work is thatthis pattern is made of interacting points. A preliminary condition for our model is that the hiddensources pattern exhibits the following properties : • the number of sources is not known but it should be controlled or minimal in a certain sense a r X i v : . [ s t a t . M E ] S e p two sources cannot be too close • the data points originating from a mixture of sources should be rather close to them • the convex hull enclosing the set of data points is enclosed within the convex hull given by thesource positions.These hypotheses allow to consider the sources as a realisation of a point process described by aGibbs probability density: p ( x | θ ) = exp [ − U ( x | θ )] Z ( θ ) = exp [ − U d ( x | θ ) − U i ( x | θ )] Z ( θ )with x the configuration of sources (or set of sources), Z ( θ ) the normalising constant and U the energyfunction.The energy function is built as the sum of two components. The first term, U d ( x | θ ) is the dataterm and it controls the positioning of the sources with respect to the observed data points d = { d , d , . . . , d m } . Its expression is given by U d ( x , θ ) = θ g ( x , d ) + θ m (cid:88) j =1 α ( d j , x ) + θ n e ( x , d ) . Here g ( x , d ) is the absolute value difference between the area of the sources and the data point convexhull, respectively. The function α ( d j , x ) represents the minimum distance between the data point d j and the sources cloud and it is given by min {(cid:107) d j − x i (cid:107) : i = 1 , . . . , n ( x ) } , with n ( x ) the numberof sources in the configurations. The measure n e ( x , d ) counts the number of data points in d , thatbelong to the convex hull given by x . The parameters θ , θ ≥ θ ≤ U i ( x | θ ) is the interaction term and it writes as U i ( x , θ ) = θ n ( x ) + θ n r ( x ) . with n r ( x ) the number of pairs of sources at distance shorter than r , which is a pre-fixed known value.The parameters θ , θ ≥ W ( i.e. µ ( W ) = (cid:82) W ξ dξ < ∞ ), that is defined by the previousenergy function, is well defined and locally stable R. Stoica (2014). Based on these terms, the model isable to generate point configurations that exhibit the properties required by the assumed hypotheses.The source pattern is estimated by the point configuration that maximises the probability density p ( x | θ ) (cid:98) x = arg max x ∈ Ω p ( x | θ ) = arg min x ∈ Ω U ( x | θ ) . (2)The solution of the problem (2) is obtained by implementing a simulated annealing algorithm. Thisalgorithm is a global optimisation method that iteratively samples from p ( x | θ ) /T while making T → The implemented simulated annealing algorithm has the following structure:1) set θ , T , k max , x (0) , c and k = 12) while k ≤ k max Bayesian analysis of geological data using point processesC. Reype et al.) generate x ( k ) with probability p ( x ( k − | θ ) /T b) set T = c ∗ T and k = k + 13) set x = x ( k ) This structure implements a sub-optimal cooling schedule, for practical reasons. An optimal loga-rithmic cooling schedule as specified by R. S. Stoica, Gregori, and Mateu (2005) may be considered.The sampling of p ( x | θ ) is done via the Metropolis-Hasting algorithm described below :1) set r c , p b , p d , p c with p b + p d + p c ≤
12) with probability p b choose birth, with probability p d choose death and with probability p c choosechange.birth: a) generate a random point η on W and set x (cid:48) = x ∪ { η } b) calculate β b = min { , p d p b p ( x ∪{ η }| θ ) p ( x | θ ) µ ( W ) n ( x )+1 } death: a) choose a point η of x and set x (cid:48) = x \ { η } b) calculate β d = min { , p b p d p ( x \{ η }| θ ) p ( x | θ ) n ( x ) µ ( W ) } change: a) choose a point η of x and generate a random point ξ in the ball B ( η, r c ) and set x (cid:48) = x \ { η } ∪ { ξ } b) calculate β c = min { , p ( x \{ η }∪{ ξ }| θ ) p ( x | θ ) }
3) the new configuration x = x (cid:48) is accepted with the appropriate probability β ; otherwise thealgorithm remains in the same state x .The previous dynamic is φ − irreducible, Harris recurrent and geometric ergodic, guaranteeingthe convergence of the algorithm towards the distribution of interest given by p ( x | θ ) Moller andWaagepetersen (2003); R. Stoica (2014); van Lieshout (2000). The proposed model was tested on two different data sets. The first one is a simulated data set, thesecond one is a set of hydrogeochemical data from geothermal fluids described in (Pinti et al., 2020).The model is coded in C++, and the results are displayed in R with the library ”ggplot”.We set T = 1000, k max = 10000, c = 0 . r c = 0 . p b = 0 . p d = 0 .
35 and p c = 0 .
3. Theparameters θ were chosen separately, for each data set, after several trials and errors. The data are created by generating three sources. The vector of contributions of each sources toa data point is generated by a Dirichlet law with parameters (1 , , − for eachcoordinate, was added to each datapoint to represent the noise during the measurement.We set the parameters to θ = (100 , , − , ,
50) and r = 2. The results are shown in Figure1. The black dots are the data points, the blue symbols are the real sources and the gradient of bluecolor shows the density of simulated sources.There are three areas that exhibit a high density of simulated sources, which corresponds to theactual number of sources. Moreover their positions are really close to the real sources.Bayesian analysis of geological data using point processesC. Reype et al.igure 1: Point process source detection for the simulated data in the case of a three-component fluid mixing system where x and y are the concentrations of two solutes (arbitraryunits) We are now comparing the results of our model with the results of the model presented in (Pinti etal., 2020). This model gives the smallest triangle (in term of area) that enclose the data. In Figure 2we set θ = (200 , , − , , θ = (196 , , − , , Point process source detection for a three-component geothermal fluid mixing systemwhere δ Cl, o / oo vs SMOC and Rc/Ra are respectively the stable isotopic composition ofchlorine and the He/ He ratio of the samples (data from (Pinti et al., 2020, Figure 4)) The model is able to detect the number and the position of the sources inferred in (Pinti et al.,2020), while the model sufficient statistics provides a more complete morpho-statistical description ofthe sources.
Clearly, the use of this method requires at least partial knowledge regarding the model parameters.Such a knowledge is built by embedding the available geological information into prior distributions.This new tool should be improved in order to become helpful not only in the analysis of geologicalBayesian analysis of geological data using point processesC. Reype et al.igure 3:
Point process source detection for a three-component geothermal fluid mixing systemwhere δ Br, o / oo vs SMOB and Rc/Ra are respectively the stable isotopic composition ofbromine and the He/ He ratio of the samples (data from (Pinti et al., 2020, Figure 5)) fluids but also in other fields that deals with mixtures ( e.g. Longman et al., 2018; Phillips & Gregg,2003). This is possible due to the use of the embedded spatial and Bayesian paradigms.
Acknowledgments
This work was performed in the frame of the DEEPSURF project ( http://lue.univ-lorraine.fr/fr/impact-deepsurf ) at Universit´e de Lorraine. This work was supported partly by the french PIA projectLorraine Universit´e d’Excellence, reference ANR-15-IDEX-04-LUE.
References
Carrera, J., V´azquez-Su˜n´e, E., Castillo, O., & S´anchez-Vila, X. (2004). A methodology to computemixing ratios with uncertain end-members.
Water resources research , (12).Faure, G. (1997). Principles and applications of geochemistry (Vol. 625). Prentice Hall New Jersey,United States,.Longman, J., Veres, D., Ersek, V., Phillips, D. L., Chauvel, C., & Tamas, C. G. (2018). Quantitativeassessment of pb sources in isotopic mixtures using a bayesian mixing model.
Scientific reports , (1), 6154.Moller, J., & Waagepetersen, R. P. (2003). Statistical inference and simulation for spatial pointprocesses . Chapman and Hall/CRC.Phillips, D. L., & Gregg, J. W. (2003). Source partitioning using stable isotopes: coping with toomany sources.
Oecologia , (2), 261–269.Pinti, D. L., Shouakar-Stash, O., Castro, M. C., Lopez-Hern´andez, A., Hall, C. M., Rocher, O., . . .Ram´ırez-Montes, M. (2020). The bromine and chlorine isotopic composition of the mantle asrevealed by deep geothermal fluids. Geochimica et Cosmochimica Acta .Skuce, M., Longstaffe, F., Carter, T., & Potter, J. (2015). Isotopic fingerprinting of groundwaters insouthwestern ontario: Applications to abandoned well remediation.
Applied Geochemistry , ,1–13.Stoica, R. (2014). Mod´elisation probabiliste et inf´erence statistique pour lanalyse des donn´ees spa-tialis´ees. Research Habilitation Thesis, Universit´e Lille , .Stoica, R., Descombes, X., & Zerubia, J. (2004). A gibbs point process for road extraction fromremotely sensed images. International Journal of Computer Vision , (2), 121–136.Bayesian analysis of geological data using point processesC. Reype et al.toica, R. S., Gay, E., & Kretzschmar, A. (2007). Cluster pattern detection in spatial data based onmonte carlo inference. Biometrical Journal: Journal of Mathematical Methods in Biosciences , (4), 505–519.Stoica, R. S., Gregori, P., & Mateu, J. (2005). Simulated annealing and object point processes : toolsfor analysis of spatial patterns. Stochastic Processes and their Applications , , 1860-1882.Stoica, R. S., Martinez, V. J., Mateu, J., & Saar, E. (2005). Detection of cosmic filaments using thecandy model. Astronomy & Astrophysics , (2), 423–432.Stoica, R. S., Mart´ınez, V. J., & Saar, E. (2007). A three-dimensional object point process for detectionof cosmic filaments. Journal of the Royal Statistical Society: Series C (Applied Statistics) , (4),459–477.van Lieshout, M. N. M. (2000). Markov point processes and their applications . Imperial College Press,London.Yardley, B. W., & Bodnar, R. J. (2014). Fluids in the continental crust.
Geochemical Perspectives ,3