Application of Different Simulated Spectral Data and Machine Learning to Estimate the Chlorophyll a Concentration of Several Inland Waters
AAPPLICATION OF DIFFERENT SIMULATED SPECTRAL DATA AND MACHINELEARNING TO ESTIMATE THE CHLOROPHYLL A CONCENTRATION OF SEVERALINLAND WATERS
Philipp M. Maier, Sina Keller
Karlsruhe Institute of Technology (KIT)Institute of Photogrammetry and Remote Sensing (IPF),Englerstraße 7, 76131 Karlsruhe, Germany c (cid:13) ABSTRACT
Water quality is of great importance for humans and forthe environment and hence has to be monitored continu-ously. One possibility are proxies such as the chlorophyll a concentration, which can be monitored by remote sensingtechniques. This study focuses on the trade-off between thespatial and the spectral resolution of six simulated satellite-based data sets when estimating the chlorophyll a concentra-tion with supervised machine learning models. The initialdataset for the spectral simulation of the satellite missionscontains spectrometer data and measured chlorophyll a con-centration of different inland waters. The analysis of theregression performance indicates, that the machine learningmodels achieve almost as good results with the simulatedSentinel data as with the simulated hyperspectral data. Re-grading the applicability, the Sentinel 2 mission is the bestchoice for small inland waters due to its high spatial andtemporal resolution in combination with a suitable spectralresolution. Index Terms — Machine learning, supervised regres-sion, chlorophyll a , hyperspectral data, spectral resolution
1. INTRODUCTION
According to the sixth sustainable development goals set bythe United Nations in 2018, clean water is a key resourcefor humans and the environment [1]. However, the waterquality is threatened extensively by human influences suchas emission of wastewater or overfertilization caused byagriculture. Hence there is a great demand for a continuousand efficient system to monitor water quality (cf. [2, 3, 4]).In addition to commonly applied in-situ probes, remotesensing as a technique is often considered when monitoringlarge water surfaces. Remote sensing offers some advan-tages over point sample measurement. In particular, satelliteimage data is frequently available and it is cost-efficient inthe long run. Furthermore, information about water quality
We thank the German Federal Ministry of Education and Researchfor funding the WAQUAVID project. parameters derived by satellite images are more representa-tive than in-situ measured point values in terms of area-widecoverage.One important water quality parameter is chlorophyll a (chl a ). It serves as a proxy for the nutrition supply of awater body. Chl a is a pigment which appears in phyto-plankton, and provides the basis for photosynthesis. Theoccurrence of phytoplankton depends on the natural nutri-tion supply of a water body as well as human sources.Chl a is detectable by passive remote sensing sensors inthe visible spectrum. An absorption feature in the spectralband region around 665 nm indicates chl a [5]. Severalstudies have already demonstrated the applicability of re-mote sensing data with respect to the estimation of chl a concentrations in inland waters [2, 6]. To estimate thechl a concentration with spectral data, two complemen-tary approaches are applied. First, engineering approachesconsider spectral features or band ratios [7, 8]. Second, ma-chine learning (ML) approaches have been emerged in thelast decade [9, 10, 11, 12, 6]. These approaches estimate thechl a concentration primarily in a supervised way withoutprior-knowledge of the underlying physical processes.In general, the estimation of chl a concentrations inwater bodies from remote sensing data is a challengingtask. Inland waters are optically complex since they containsuspended and particular materials. These materials arecharacteristic for every inland water [3].Another limiting factor when monitoring inland watersis the spatial resolution of the satellite images. Unfortu-nately, high spectral resolution is often accompanied bya lower spatial resolution. In case of the oceans, this isnot an issue. With respect to inland waters however, thespatial resolution is crucial and hence an exclusion crite-ria of some satellite sensors. For example, the SeaWiFS(Sea-viewing Wide Field-of-view Sensor) as an ocean waterobservation satellite mission has a spatial resolution of morethan 1 km [13]. Therefore, most of the smaller inland waterbodies are represented by only one, mixed pixel which hin-ders the use of satellite data for the estimation of the chl a concentration of small water bodies.Some studies investigate the trade-off between spec-tral and spatial resolution of satellite data recorded by thecommon missions [14, 15]. A thorough analysis of theestimation performance of feature engineering approaches a r X i v : . [ ee ss . I V ] A ug n chl a concentrations for several simulated satellite sen-sors is presented in [15]. Previous work [16] addressesthe effect of different hyperspectral resolutions of the inputdata and machine learning models when estimating chl a concentrations.In this study, we simulate satellite data with respect toseveral multi- and hyperspectral satellite missions such asLandsat 5, Landsat 8, Sentinel 2, Sentinel 3, EnMAP andHyperion. The basis of the simulated data is a spectrometerdataset of different inland waters which was conducted inthe surrounding region of Karlsruhe (Germany) during thesummer 2018. In total, the dataset contains datapoints.Each datapoint consists of the spectral information and theassociated chl a concentration. The simulated spectral dataserves as input data for selected ML models to estimate thechl a concentration of the different inland waters.The objectives of this contribution are: • the simulation of satellite data based on the measuredspectrometer data by applying the spectral responsefunction or a Gaussian function (Section 2); • the estimation of the chl a concentration by applyingdifferent supervised ML models such as random for-est (RF), support vector machine (SVM), multivariateadaptive regression spline (MARS) and an artificialneural network (ANN) on the respective simulateddata (Section 4); • the comparison of the regression performance interms of simulated data and applied ML model (Sec-tion 4); • the discussion of the regression performance with thefocus on the spectral and the spatial resolution of theinput data (Section 4).
2. DATASET AND DATA SIMULATION
The data used in this contribution is from a measurementcampaign [16] in the surroundings of Karlsruhe, a city lo-cated in the Southwest of Germany. During the summerof 2018, different inland water bodies were measuredwith a spectrometer and water samples were evaluated witha photometer. A detailed description of the measurementcampaign including the measurement setup is given in [16].The spectrometer records hyperspectral data in a spec-tral range of 341 nm to 1015 nm with a sampling intervalof 0.66 nm. Its measurement principle is based on the ratiobetween the incoming and the up-welling radiance in theperpendicular direction. The spectrometer was mounted ona tripod, which was placed as far as possible in the water incase of a natural water body. When measuring an artificialwater body, the spectrometer was set outside the water.The water samples for the chl a concentration analysis,which we use as reference data, were collected close tothe spectrometer. The measured chl a concentrations andthe respective spectra of the continuous spectrometer mea-surements were matched by their respective timestamps. Intotal, we obtain a dataset with datapoints. Each data-point consists of the spectral data and a chl a concentrationvalue. For our satellite-based simulation of the spectral data,we used spectrometer data in the wavelengths range of400 nm to 900 nm is used. The simulation of the spectra inaccordance with the satellite missions was conducted withthe hsdar-package in R [17]. Three different approachesexist to calculate the satellite bands out of spectral datawith different weighting functions: a Gaussian function, anequal-weighted function and the actual spectral responsefunction. To calculate the spectra according to the Sen-tinel 2, Landsat 5 and Landsat-8 missions, we relied on thereal spectral response function. When simulating Sentinel 3,the EnMAP and Hyperion satellite missions, we applied theGaussian function. In the case of Sentinel 3, which is notimplemented in the hsdar-package, we used the parameterscentral wavelength and full width at half maximum accord-ing to [18] and a Gaussian function to simulate the bands.Table 1 gives an overview of the spectral and spatial charac-teristics of the satellite missions which have been used forthe data simulation. Furthermore, Figure 1 illustrates thebandwidth of each satellite mission in the spectral range of400 nm to 900 nm.
3. METHODOLOGY
For the estimation of the chl a concentration based on thedifferent simulated satellite data, we selected four ML mod-els: support vector machine [19], random forest [20], mul-tivariate adaptive regression spline [21] and an artificialneural network [22]. The applied ML models are inspiredby the selection in [12] due to their satisfactorily perfor-mance.To apply these models, the dataset consisting of the chl a values and the simulated satellite data was prepared. It wassplit into five equally sized parts with respect to distributionof the target variable, the chl a concentration. Then, each ofthose parts was split randomly into two subsets: a trainingsubset and a test subset. All five training subsets wereaggregated to the final training subset. The test subset wasgenerated similarly. As a result, the distribution of the chl a concentration in the training as well as the test subset wererepresentative compared to the reference measurements.The training subset was used for the training of the MLmodels, while the test subset remained unused until thetest phase. Before starting with the training, we applieda grid search to adjust the hyperparameters of the models.For example, hyperparameters of the SVM model are thepenalty function cost and the kernel parameter gamma.During the test phase, the models were validated on theyet unknown test dataset. The performance of the regressionwas expressed by the coefficient of determination ( R ) andthe mean absolute error (MAE). Following the regressionperformance on the same database in [16], we also calcu-lated the first derivative of the spectra for the simulatedhyperspectral data of the Hyperion and EnMAP missionand applied those derivatives as input data for the RF andMARS model. In addition, we pre-processed the simulatedsatellite data with a scaling to ensure good regression resultsfor the the MARS, SVM and ANN models. able 1 . Summary of some characteristics of the different satellite systems used for the data simulation covering thespectral range between 400 nm to 900 nm. The hyperspectral satellite missions are highlighted by ∗ .Satellite Number Bandwidth Spectral range Spatial resolution Approach for Datamission of bands in nm in nm in m the simulation sourceSentinel 2
18 to 145 443 to 865 10 to 60 Response function [17]Sentinel 3
16 to 60 443 to 865 Response function [17]Landsat 5
60 to 140 485 to 840 Response function [17]Hyperion ∗
54 10
406 to 895 Gaussian function [17]EnMAP ∗
77 6 .
423 to 895 Gaussian function [17]
Wavelength in nm R e f l e c t an c e i n % Landsat 5Landsat 8Sentinel 2Sentinel 3
Fig. 1 . Median spectra of the spectrometer dataset and symbolization of the width of the satellite bands (colored lines).The dots in the middle of each bandwidth represent the simulated reflectance value of the band.
4. RESULTS AND DISCUSSION
Figure 2 and Table 2 present the regression performance ofestimating the chl a concentration with respect to the ap-plied ML models as well as the different simulated satellitedata. Regarding Figure 2, the regression performance of thefour ML models are in the same range.When considering the simulated satellite input data forestimating the chl a concentration, the regression resultsexpressed as R are distinguishable. For the simulatedhyperspectral satellite data (EnMAP and Hyperion), thecoefficient of determination ( R ) is quite similar. In caseof the simulated Landsat data, the regression results areclosely related. In detail, the ANN model performs worsethan the other three models on these two simulated datasets.However, for the simulated Sentinel data, the ANN modelprovides the best regression results.Considering the different simulated satellite data, theregression with the simulated hyperspectral data basedon the EnMAP and Hyperion mission achieves the bestresults. The corresponding MAE values range between10.1 µg L −1 to 12.6 µg L −1 . The MAE values of the modelswith simulated multispectral data according to the Sentinel missions is in the range between 10.9 µg L −1 to 14.8 µg L −1 .The estimation of the chl a concentration of all regressionmodels with simulated Landsat data performs the worstcompared to the other simulated satellite data. The MAEranges between 17.8 µg L −1 to 20.5 µg L −1 .Analyzing bandwidth, number of bands, spectral rangeand resolution of the simulated satellite data, Figure 1 showsthat Landsat 5 (green) and Landsat 8 (blue) have similarbands with a similar band positioning. The three bands be-tween 450 nm to 700 nm are nearly the same. In the spectralrange of 800 nm to 900 nm Landsat 8 provides a narrowerband than Landsat 5 and it has an additional fifth narrowband near 430 nm. With respect to the estimation of thechlorophyll a concentration, this additional band has nofurther impact on the regression task.Similar to the simulated Landsat data, the simulatedmultispectral Sentinel 3 data provides a better spectral res-olution and accounts for more bands with narrower band-widths than Sentinel 2. However, the regression perfor-mance of the ML models on simulated Sentinel 3 data isnot clearly better than the regression performance of themodels with simulated Sentinel 2 data. When comparingthe estimation performance with either simulated Sentinel R ² i n % ANN MARS RF SVM
Fig. 2 . Regression results ( R in % ) of the four ML models with different simulated satellite data.data or simulated Landsat data, the outperformance of themodels using the simulated Sentinel data can be well ex-plained. First, the simulated Sentinel data is characterizedby more bands. And second, these bands are well posi-tioned within the spectral range of 400 nm to 900 nm. Forexample, the simulated Sentinel data includes the extremesin the range of 660 nm to 710 nm which are related to chl a .The mentioned spectral range is not included in the twoLandsat missions and explains the poor chl a estimation ofall models [14].The simulated hyperspectral data (EnMAP and Hype-rion) with a nearly constant spectral resolution of 6.5 nmand 10 nm are not shown in Figure 1 due to reason of trans-parency. Comparing the regression results with the sim-ulated hyperspectral and the simulated Sentinel data, themodels relying on the hyperspectral datasets perform onlyslightly better. This finding indicates that the band position-ing of the Sentinel missions is good for the estimation ofchl a concentrations.Regarding the applicability of the simulated satellitedata for a general monitoring approach in the context ofinland waters, the Sentinel 2 data serves its purpose. It pro-vides data with appealing spectral resolution, a sufficientspatial resolution and is characterized by a high temporalfrequency. Hyperspectral data with a better spectral resolu-tion leads to a satisfying chl a estimation by applying thesame ML models. However, their temporal resolution staysbehind the temporal resolution of the Sentinel missions re-ferring to two satellite systems. Differentiating betweenthe two Sentinel missions, the application of the Sentinel 3satellites is limited to large inland water surface due to theirpoor spatial resolution of 300 m to 1000 m. In addition, theLandsat satellite missions provide an attractive spatial andtemporal resolution as well. However, the regression resultsof the models are the worst with this data since the Landsatmissions are characterized by the lowest spectral resolutionof all simulated satellite missions. Table 2 . Performance of the regression models expressedby MAE in µg L −1 . Simulated satellite data RF SVM ANN MARSEnMAP 10.9 12.6 11.7 10.1Hyperion 11.3 12.2 11.3 10.5Landsat 5 17.8 18.5 19.6 19.0Landsat 8 18.8 18.8 20.0 20.5Sentinel 2 14.8 13.2 11.5 14.2Sentinel 3 14.3 14.1 10.9 13.0
5. CONCLUSION
In this paper, we address the estimation of chl a concentra-tion with different simulated spectral data and supervisedML models. We rely on a spectrometer dataset measuredat several inland water bodies. For the simulation of thesatellite-base data, we chose six different satellite missionsas examples. In addition, we apply four different supervisedML models for the estimation of the chl a concentration.When comparing the simulated satellite data, the regres-sion performance of all models with the simulated hyper-spectral data achieves the best results due to their spectraland spatial resolution. Referring to the estimation results,the ML models combined with the simulated Sentinel dataare slightly worse than the estimation based on the simu-lated hyperspectral data. Regarding the applicability fora generic monitoring approach of inland waters, the Sen-tinel 2 mission provides the best option for smaller waterbodies. The Sentinel 3 mission poses an alternative for largewater bodies.When focusing on the different ML models, the choiceof a specific ML model has a minor impact on the regressionperformance. Solely, the ANN models outperforms theother models when using the simulated Sentinel data.In this study, we have focused on the estimation of thethe chl a concentration as a selected water quality parameter.For the estimation of further quality parameters such as dif-ferent algae types, the (simulated) hyperspectral data couldrovide an excellent basis due to its high spectral resolution.The choice of ML models and the (simulated) satellite datahas to be adapted according to the respective water qualityparameter which will be estimated. This investigation couldbe addressed in future work.
6. REFERENCES [1] United Nations Department of Economic and SocialAffairs,
The Sustainable Development Goals Report2018 , 2018.[2] Koponen, S., Pulliainen, J., Kallio, K. and Hallikainen,M., “Lake water quality classification with airbornehyperspectral spectrometer and simulated MERISdata,”
Remote Sensing of Environment , vol. 79, pp.51–59, 2002.[3] Palmer, S., Kutser, T. and Hunter, P. D., “Remotesensing of inland waters: Challenges, progress andfuture directions,”
Remote Sensing of Environment ,vol. 157, pp. 1–8, 2015.[4] Maier, P. M., Hinz, S. and Keller, S., “Estimationof Chlorophyll a, Diatoms and Green Algae Basedon Hyperspectral Data with Machine Learning Ap-proaches,”
38. Wissenschaftlich-Technische Jahresta-gung der DGPF und PFGK18 Tagung in M¨unchen ,vol. 27, pp. 49–57, 2018.[5] Morel, A and Prieur, L., “Analysis of variations inocean color1,”
Limnology and Oceanography , vol. 22,no. 4, pp. 709–722, 1977.[6] Keller, S., Maier, P., Riese, F., Norra, S., Holbach, A.,B¨orsig, N. Wilhelms, A., Moldaenke, C., Zaake, A.and Hinz, S., “Hyperspectral Data and Machine Learn-ing for Estimating CDOM, Chlorophyll a, Diatoms,Green Algae and Turbidity,”
International journal ofenvironmental research and public health , vol. 15, no.9, 2018.[7] Gitelson, A., “The peak near 700 nm on radiancespectra of algae and water: Relationships of its mag-nitude and position with chlorophyll concentration,”
International Journal of Remote Sensing , vol. 13, no.17, pp. 3367–3373, 1992.[8] Gons, H. J., “Optical Teledetection of Chlorophyll ain Turbid Inland Waters,”
Environmental Science &Technology , vol. 33, no. 7, pp. 1127–1132, 1999.[9] Keiner, L. E. and Yan, X. H., “A Neural NetworkModel for Estimating Sea Surface Chlorophyll andSediments fromThematic Mapper Imagery,”
RemoteSensing of Environment , vol. 66, no. 2, pp. 153–165,1998.[10] Matarrese, R., Morea, A., Tijani, K., de Pasquale, V.,Chiaradia, M. T. and Pasquariello, G., “A SpecializedSupport Vector Machine for Coastal Water Chloro-phyll Retrieval from Water Leaving Reflectances,” in
IGARSS 2008 - 2008 IEEE International Geoscienceand Remote Sensing Symposium . 2008, pp. IV – 910–IV – 913, IEEE.[11] Gonz´alez Vilas, L., Spyrakos, E. and Torres Palen-zuela, J. M., “Neural network estimation of chloro-phyll a from MERIS full resolution data for the coastalwaters of Galician rias (NW Spain),”
Remote Sensingof Environment , vol. 115, no. 2, pp. 524–535, 2011.[12] Maier, P. M. and Keller, S., “Machine learning regres-sion on hyperspectral data to estimate multiple waterparameters,” arXiv preprint arXiv , 2018.[13] O’Reilly, J. E., Maritorena, S., Mitchell, B. G., Siegel,D. A., Carder, K. L., Garver, S. A., Kahru, M. andMcClain, C., “Ocean color chlorophyll algorithms forSeaWiFS,”
Journal of Geophysical Research: Oceans ,vol. 103, no. C11, pp. 24937–24953, 1998.[14] Decker, A. G., Malthus, T. J., Wijnen, M. M. andSeyhan, E., “The effect of spectral bandwidth andpositioning on the spectral signature analysis of inlandwaters,”
Remote Sensing of Environment , vol. 41, no.2-3, pp. 211–225, 1992.[15] Beck, R., Zhan, S., Liu, H., Tong, S., Yang, B., Xu, M.,Ye, Z., Huang, Y., Shu, S., Wu, Q., Wang, S., Berling,K., Murray, A., Emery, E., Reif, M., Harwood, J.,Young, J., Nietch, C., Macke, D., Martin, M., Stillings,G., Stump, R. and Su, H., “Comparison of satellitereflectance algorithms for estimating chlorophyll-a ina temperate reservoir using coincident hyperspectralaircraft imagery and dense coincident surface obser-vations,”
Remote Sensing of Environment , vol. 178,pp. 15–30, 2016.[16] Maier, P. M. and Keller, S., “Estimating Chlorophyll aConcentrations of Several Inland Waters with Hyper-spectral Data and Machine Learning Models,” arXivpreprint arXiv , 2019.[17] Lehnert, L. W., Meyer, H. and Bendix, J., “Hsdar:manage, analyse and simulate hyperspectral data inR,”
R Package Version 0.4 , 2016.[18] Fletcher, K. (Ed.), “Sentinel-3: ESA’s Global Landand Ocean Mission for GMES Operational Services,”
ESA Communications , 2012.[19] Vapnik, V.,
The nature of statistical learning theory ,Springer science & business media, 2013.[20] Breiman, L., “Random Forests,”
Machine learning ,vol. 1, pp. 5–32, 2001.[21] Milborrow, S., “Multivariate Adaptive RegressionSplines,” 2018.[22] Ripley, B. D.,