Retrieval of Coloured Dissolved Organic Matter with Machine Learning Methods
Ana B. Ruescas, Martin Hieronymi, Sampsa Koponen, Kari Kallio, Gustau Camps-Valls
RRETRIEVAL OF COLOURED DISSOLVED ORGANIC MATTER WITH MACHINELEARNING METHODS
Ana B. Ruescas , Martin Hieronymi , Sampsa Koponen , Kari Kallio and Gustau Camps-Valls Image Processing Laboratory (IPL), Universitat de Val`encia, Spain Institute of Coastal Rsearch, Helmholtz-Zentrum Geesthacht, Germany Finnish Environment Institute SYKE, Finland
ABSTRACT
The coloured dissolved organic matter (CDOM) concentration is the standard measure of humic substance in natural wa-ters. CDOM measurements by remote sensing is calculated using the absorption coefficient ( a ) at a certain wavelength (e.g. ≈ nm ). This paper presents a comparison of four machine learning methods for the retrieval of CDOM from remote sensingsignals: regularized linear regression (RLR), random forest (RF), kernel ridge regression (KRR) and Gaussian process regres-sion (GPR). Results are compared with the established polynomial regression algorithms. RLR is revealed as the simplest andmost efficient method, followed closely by its nonlinear counterpart KRR. Index Terms — Remote Sensing, CDOM concentrations, Absorbing Waters, Linear Regression, Machine Learning Methods
1. INTRODUCTION
In the temperate and cold regions of the boreal zone humic waters are abundant, especially in lakes. These waters typically havefairly low total suspended matter (TSM) and chlorophyll a (Chl-a) concentrations, even though some cases of “black lakes”with high Chl-a and TSM values have been reported too [1]. For instance, in Finland the humic matter concentration of lakescorrelates with the share of peat land in the drainage area [2]. Humic lakes can also originate from peat dredging, e.g. in theNetherlands. CDOM absorption has an exponential shape and consequently decreases the water leaving reflectance in shortwavelengths.Information on humic substances is utilized in the application of official directives, lake management and climate changestudies. Within the Water Framework Directive (WFD), in Sweden and Finland, water colour values > mg Pt l − (corre-sponds approximately a CDOM (400) . m − ) are defined as humic. Waters with water colour > mg Pt l − ( a CDOM (400) m − ) are classified as humus rich lakes [3]. An accurate measurement of the a CDOM parameter from remote sensing iscrucial in these types of water. However, it is known that CDOM is the most critical and uncertain ocean colour (OC) product.The impact of high CDOM on reflectance is demonstrated in Figure 1. In the present case, Lake Garda represents lakewater with very low CDOM concentrations, while in Lake P¨a¨aj¨arvi the CDOM concentration is high. The level of Chl-a isapproximately the same in both lakes, and total suspended matter (TSM) is slightly lower in Lake Garda. R rs in the blue andgreen region of the spectrum is clearly higher in Lake Garda than in Lake P¨a¨aj¨arvi. R rs at < nm is significantly smaller inthe CDOM rich lake. At > nm , where the influence of a CDOM is small, R rs in Lake Garda is slightly lower due to lowerTSM concentration, and higher in Lake P¨a¨aj¨arvi due to higher scattering of organic and inorganic particles.Several band ratios have been proposed as predictive models for estimating CDOM from spectral data. These parametricapproaches only take into account a few spectral bands and thus they disregard the information contained in other bands.Non-parametric regression algorithms can alternatively exploit the information contained in all spectral bands. In this paper,we assess the performance of several machine learning (ML) algorithms in CDOM estimation over typical boreal waters:multivariate regression, random forests, kernel ridge regression and Gaussian processes.We will use a Hydrolight simulationdataset presented in the framework of the Case2eXtreme (C2X) project [5]. Basis for the tests are simulated hyperspectral R rs data that include extreme absorbing waters. We work with samples of medium to high CDOM concentrations at 440nm( − m − ) and medium to low Chl-a concentrations ( < mg/m ). The research was funded by the European Research Council (ERC) under the ERC-CoG-2014 SEDAL project (grant agreement 647423), and the SpanishMinistry of Economy and Competitiveness (MINECO) through the project TIN2015-64210-R. Especial thanks to Carsten Brockmann and the Case2eXtremeproject team (funded by ESA). http://seom.esa.int/page project014.php a r X i v : . [ phy s i c s . g e o - ph ] J a n ig. 1 . Simulated R rs in a typical humic lake in Finland (P¨a¨aj¨arvi) and in Lake Garda, Italy. R rs was simulated with the Hydrolight model[4] The remainder of the paper is organized as follows. §2 describes the methods used in this work. §3 gives an empiricalevidence of performance of the proposed methods in comparison to standard bioptical models for the particular dataset used.We conclude in §4 with some remarks and an outline future work.
2. METHODS AND APPLICATION2.1. Established approaches with band ratio algorithms
Reports on band ratio algorithms are mostly based on airborne and field measurements with negligible or only small atmosphericinfluence. Important wavelength regions for the band ratio algorithms for CDOM retrieval are the ones between 400-600 nmtaking 660-720 nm as reference. Following [6] [7], CDOM can be estimated by a ratio of reflectance at wavelength > nm to reflectance in the 400-550 nm range. This ratio is valid in a wide range of water constituent combinations. [4] [6] used insitu measured data in Finland, and derived the algorithms that work well for a CDOM and TSM. The latter is estimated using asingle band at 709 nm and not a ratio.The calibration for the CDOM algorithm with two band ratios is the main objective of Kallio’s work [6]. He calculatedHydrolight simulations made with concentration data from monitoring stations in Finland (5553 in total) and used them forthe calculation of the coefficients. Wavebands at 490, 665 and 709 nm (based on MERIS channels), were compared to the insitu CDOM measured at 440 nm. The optimal regression had a polynomial form with coefficients varying depending on theparticular ratio used x : y = p ∗ x + p ∗ x + p . (1)The dataset used in this exercise comes from the simulated database of the C2X project and it is based on simulated remotesensing reflectance ( R rs ), which is the ratio of water-leaving radiance to downwelling irradiance above the sea surface. Here, R rs refers to clear atmosphere with Sun at zenith and viewing angle exactly perpendicular. Highly absorbing waters arecharacterized by very low water-leaving radiance (black waters). The maximum of R rs is typically < . sr − and locatedbetween 550 and 605 nm for Case-2 absorbing (C2A) cases. For extremely Case-2 absorbing waters (C2AX), the maximumshifts towards the red spectral range > nm . The complete C2AX dataset used in the present work is illustrated in Figure 2.This complete dataset was filtered before using it here for Chl-a values > mg/m ; and only nadir view angles have beenleft. Hihgh Chl-a values cuased the ratio to increase with low CDOM values and to decrease with high CDOM values. Thosesimulations were removed from the dataset since they are not frequent in nature. ig. 2 . The dataset includes the whole range of Chl-a ( − mg/m ), with low TSM ( − g/m ) and high CDOM ( − m − ). Whenremoving the Chl-a > mg/m , the fluorescence peak of the dataset at ≈ nm decreases considerably Four machine learning algorithms for linear and non-linear regression are tested and compare to the polynomial regressionexplained above: (multivariate) linear regression (RLR), random forest (RF) [8], kernel ridge regression (KRR) [9], Gaussianprocess regression (GPR) [10].In multivariate (or multiple) linear regression (LR) the output y (DCOM) is assumed to be a weighted sum of B inputvariables, x := [ x , . . . , x B ] (cid:62) , that is ˆ y = x (cid:62) w . Maximizing the likelihood is equivalent to minimizing the sum of squarederrors, and hence one can estimate the weights w = [ w , . . . , w B ] (cid:62) by least squares minimization. Very often one imposessome smoothness constraints to the model and also minimizes the weights energy, (cid:107) w (cid:107) , thus leading to the regularized linearregression (RLR) method.An alternative method is that of random forests (RFs) [8]. A RF model is an ensemble learning method for regression thatoperates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees.The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.Kernel methods constitute a family of successful methods for regression [11]. We aim to incorporate two instantiations:(1) the KRR is considered as the (non-linear) version of the RLR [9]; and (2) GPR is a probabilistic approximation to non-parametric kernel-based regression, where both a predictive mean and predictive variance can be derived [12]. Kernel methodsoffer the same explicit form of the predictive model, which establishes a relation between the input (e.g., spectral data) x ∈ R B and the output variable (CDOM) is denoted as y ∈ R . The prediction for a new radiance vector x ∗ can be obtained as: ˆ y = f ( x ) = N (cid:88) i =1 α i K θ ( x i , x ∗ ) + α o , (2)where { x i } Ni =1 are the spectra used in the training phase, α i is the weight assigned to each one of them, α o is the bias in theregression function, and K θ is a kernel or covariance function (parametrized by a set of hyperparameters θ ) that evaluates thesimilarity between the test spectrum x ∗ and all N training spectra. We used the automatic relevance determination (ARD)kernel function, K ( x , x (cid:48) ) = ν exp (cid:18) − B (cid:88) b =1 ( x b − x (cid:48) b ) / (2 σ b ) (cid:19) + σ n δ ij , and learned the hyperparameters θ = [ ν, σ , . . . , σ B , σ n ] by marginal likelihood maximization. An operational MATLABoolbox is available at http://isp.uv.es/soft regression.html. In this exercise comparisons are made using the ratios as inputs, as well as the full spectral range found in the simulation dataset.This spectral range consisted in 6 bands corresponding to some of the MERIS sensor wavelengths: 442.5, 490, 510, 560, 665,708.75 nm, from blue to near-infrared (NIR). This is the common set found in many of the ocean colour sensors used for waterquality retrievals. Several trials using different inputs are tested: ratio 1 and ratio 2 are considered separately in the first twotests, both ratios are used together in the third test and all wavelengths are the multi-input variables in the fourth test. Statisticsused to check the validity of the methods are: mean error (ME), root mean squared error (RMSE), normalized mean squarederror (nMSE), mean absolute error (MAE) and the coefficient of correlation (R).
3. EXPERIMENTAL RESULTS3.1. Numerical comparison
The Table 1 offers an overview of the metrics for all four methods tested with the different input variables. When using onlyone input ( x = R rs (665) /R rs (490 nm ) , x = R rs (709) /R rs (490 nm ) ) the ML methods tested do not improve the resultsstrikingly. The polynomial regression metrics are very similar to the others and especially close to the KRR method with ratio1 ( x ) or the RF with ratio 2 ( x ). Even if in this second case the reduction in the ME and RSME is evident, computation timewill then be the determining factor, and the polynomial regression requires less time and it still quite efficient. Table 1 . Results obtained obtained with empirical fitting and several machine learning methods. Several scores are shown: mean error (ME),root-mean-square error (RMSE), mean absolute error (MAE) and Pearson’s correlation coefficient (R).
ME RMSE nMSE MAE RRatio 1: x = 665 / Polyfit 0.758 3.832 0.319 2.940 0.570RLR 0.624 4073 -0.042 3.174 0.438RF 0.687 3.753 -0.078 2.747 0.592KRR 0.714 3.707 -0.083 2.767 0.586GPR 0.692 3.755 -0.078 2.903 0.585
Ratio 2: x = 709 / Polyfit 0.732 3.710 0.263 2.846 0.603RLR 0.634 3.969 -0.054 3.071 0.484RF 0.387 3.320 -0.131 2.336 0.687KRR 0.648 3.411 -0.120 2.378 0.665GPR 0.808 3.604 -0.096 2.589 0.638
Both ratios: x = [ x , x ] RLR 0.703 3.900 -0.061 2.956 0.520RF 0.363 2.296 -0.291 1.497 0.867KRR 0.588 2.842 -0.199 1.843 0.802GPR 0.487 2.676 -0.225 1.664 0.814
All bands, x ∈ R B RLR 0.259 1.843 -0.387 1.189 0.913RF 0.137 0.735 -0.786 0.444 0.987KRR 0.045 0.812 -0.743 0.453 0.984GPR 0.014 0.646 -0.842 0.358 0.990
However, when using more than one variable, non-linear ML methods get more relevance, as it can be seen in the “BothRatios” and “All bands” sections on the Table 1. The RF seems to be the best model when using the two ratios together, whilethe GPR work well when using all the six reflectance bands. In this last case, the “winner” method seems not to be so clear,since a simple RLR model improves already the results, although errors with RLR are double than using any other of the otherthree ML methods. https://earth.esa.int/web/guest/missions/esa-operational-eo-missions/envisat/instruments/meris .2. Analysis of the models Figure 3 shows the partial dependence between the function (CDOM) and the target features (wavelengths) for the RLR model(red line) and the GPR model (black line). What can be inferred is that bands in the blue part of the spectrum (442.5, 490 nm)have a stronger influence in the CDOM derivation, while the green and red bands seem to be more neutral. The NIR band (709nm) has a slight but determinant influence. y R=442.5 y R=490 y R=510 y R=560 y R=665 y R=708.75
Fig. 3 . Partial plots for the RLR (red) and the GPR (black) models.
A second analysis of the models considered a permutation analysis by which the impact of the input features on the predic-tion error is evaluated in the context or absence of the other predictors. Figure 4 shows the influence of the wavelengths withinthe different models, which confirms the previous claim for all methods, with some variation in the KRR, where the weight ofthe green and red bands is higher.
One example of the application of the models to the test data and their validation is shown in Figure 5. Using all reflectancebands available, the plots show the summary of statistics of the residuals for the four methods (top) and the scatterplot of thelinear regression between the observed and the predicted values for the method with the best RMSE result, in this case the GPR,which is also the best model in terms of ME and R (cf. Table 1).
RLR RF1 KRR GPR-10-505 R e s i du a l s Observed signal P r e d i c t e d s i gn a l Fig. 5 . Application of model on test data
MSE Permutation analysis
RLR RF1 KRR GPR
Methods R e l e v a n ce Fig. 4 . RMSE permutation analysis of the models
A similar exercise has been done with the retrieval of Chl-a (results not showed here). The dataset used was the C2A, withcounted with a larger range of Chl-a values ( − mg/m ) and still low TSM values ( . − g/m ). The summary statisticsshowed that the best performing method is the KRR but followed closely by the RLR, which improves considerably the simplerratio method (709/665nm) e.g. from R=0.789 to R=0.902 or RMSE between 21.918 and 31.224 mg/m .
4. CONCLUSIONS
Four ML methods were tested using as training set simulated C2AX data, that is, extremely absorbing water with high CDOMconcentration, Chl-a contents < mg/m and low suspended matter. The traditional empirical algorithms are derived usingband ratios and their correlation with measured or simulated a CDOM . The polynomial algorithms derived using this empiricalrelationship are compared with more sophisticated ML methods using several variables as input: similar two-band ratios orthe six MERIS like wavebands available. Results show that multivariate linear regression methods are already very efficientwhen using more than 2 bands or ratios, that is, more information coming from different parts of the spectrum. GPR methodsgive a very good result and are considered the best in terms of error and correlation, but the computation time needed increasesconsiderably.
5. REFERENCES [1] Tiit Kutser, Birgot Paavel, Charles Verpoorter, Martin Ligi, Tuuli Soomets, Kaire Toming, and Gema Casal, “Remote sensing of blacklakes and using 810 nm reflectance peak for retrieving water quality parameters of optically complex waters,”
Remote Sensing , vol. 8,no. 6, pp. 497, 2016.[2] Pirkko Kortelainen, “Content of total organic carbon in finnish lakes and its relationship to catchment characteristics,”
Canadian Journalof Fisheries and Aquatic Sciences , vol. 50, no. 7, pp. 1477–1483, 1993.[3] S. Koponen, K. Kallio, and Attila J., “D5.5 boreal lakes case study results,” Tech. Rep., GLaSS project, H2020, EU, 2015.[4] K. Alikas, S. Lautt, and A. Reinart, “D3.4 adapated water quality algorithms,” Tech. Rep., GLaSS project, H2020, EU, 2014.[5] M. Hieromyni, H. Krasemann, D. Mueller, C. Brockmann, A. Ruescas, K. Stelzer, B. Nechad, K. Ruddick, S. Simis, G. Tisltone,F. Steinmetz, and P. Regner, “Ocean colour remote sensing of extreme case-2 waters,” Living Planet Symposium, 2016.6] Kari Kallio, “Optical properties of finnish lakes estimated with simple bio-optical models and water quality monitoring data,”
HydrologyResearch , vol. 37, no. 2, pp. 183–204, 2006.[7] Patrick L. Brezonik, Leif G. Olmanson, Jacques C. Finlay, and Marvin E. Bauer, “Factors affecting the measurement of CDOM byremote sensing of optically complex inland waters,”
Remote Sensing of Environment , vol. 157, pp. 199 – 215, 2015, Special Issue:Remote Sensing of Inland Waters.[8] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regression and correlation,”
Journal of the AmericanStatistical Association , vol. 80, no. 391, pp. 1580–598, 1985.[9] John Shawe-Taylor and Nello Cristianini,
Kernel Methods for Pattern Analysis , Cambridge University Press, June 2004.[10] C. E. Rasmussen and C. K. I. Williams,
Gaussian Processes for Machine Learning , The MIT Press, New York, 2006.[11] G. Camps-Valls and L. Bruzzone, Eds.,
Kernel methods for Remote Sensing Data Analysis , Wiley & Sons, UK, Dec. 2009.[12] G. Camps-Valls, J. Verrelst, J. Mu˜noz Mar´ı, V. Laparra, F. Mateo-Jim´enez, and J. Gomez-Dans, “A survey on gaussian processes forearth observation data analysis,”