[PDF] A new method for faster and more accurate inference of species associations from big community data

Abstract

Joint Species Distribution models (jSDMs) explain spatial variation in community composition by contributions of the environment, biotic associations, and possibly spatially structured residual variance. They show great promise as a general analytical framework for community ecology and macroecology, but current jSDMs scale poorly on large datasets, limiting their usefulness for novel community data, such as datasets generated using metabarcoding and metagenomics. Here, we present sjSDM, a novel method for estimating jSDMs that is based on Monte-Carlo integration of the joint likelihood. Implemented in PyTorch, a modern machine learning framework that can make use of CPU and GPU calculations, this approach is orders of magnitude faster than existing jSDM algorithms and can be scaled to very large datasets. Despite the dramatically improved speed, sjSDM produces the same predictive error and more accurate estimates of species association structures than alternative jSDM implementations. We provide our method in an R package to facilitate its applicability for practical data analysis.

Full PDF

11 A new method for faster and more accurate inference of species associations from big community data

Maximilian Pichler , Florian Hartig Theoretical Ecology, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany *corresponding author, [email protected]

Keywords: machine learning, regularization, metacommunity, co-occurrence, big data, statistics

Abstract

Main

Understanding the structure and assembly of ecological communities is a central concern for ecology, biogeography and macroecology (Vellend 2010). The question is tightly connected to many important research programs of the field, including coexistence theory (see Chesson 2000; e.g. Levine et al. et al. et al. et al. et al. et al. et al. et al. ‐ Neto & Legendre 2010; Dormann et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al.

Results

Method validation and benchmark against state-of-the-art jSDMs

Computational speed

On a GPU, our approach (G-sjSDM) required under 3 seconds runtime for any of our simulations with 50 to 500 sites and 5 to 250 species. When run on CPUs only (C-sjSDM), runtimes increased to a maximum of around 2 minutes (Fig. 1a, Fig. S1). In comparison, Hmsc had a runtime of around 7 minutes for our smallest scenario and increase in runtime exponentially when the number of species exceeded 40 (Fig. 1a). BayesComm was slightly faster than Hmsc but scaled worse than Hmsc to large data sizes (Fig. S1). Gllvm achieved fast runtimes, on par and sometimes better than our method for small data (< 50 species), but beyond that runtime started to increase exponentially as well, leading to runtimes >10 min for our larger test cases (Fig. 1a).

Figure 1:

Runtime benchmarks for G-sjSDM, C-sjSDM, gllvm, BayesComm, and Hmsc fit to different simulated scenarios: 50 to 500 sites with 10% (a), 30% (b) and 50% (c) number of species (e.g. for 100 sites and 10% we get 10 species). For each scenario, ten datasets were simulated an analyzed, and results were averaged. BayesComm was aborted for some scenarios due to too long runtimes.

Because of the runtime limitations of the other approaches, we calculated big data benchmarks only for G-sjSDM. The overall runtimes for G-sjSDM increased from under one minute for 5,000 sites to a maximum of around 4.5 minutes for 30,000 sites (Fig. 2). G-sJSDM showed greater runtime increases when increasing numbers of sites, while the numbers of species (300, 500, and 1,000 species in each scenario) had only small effects on runtimes (Fig. 2).

Figure 2:

Benchmark results for sjSDM on big community data. We simulated communities with 5,000, 15,000, and 30,000 sites and for each 300, 500, and 1,000 species.

For the empirical benchmarking datasets from Wilkinson et al. (2019), the CPU based C-sJSDM model achieved a 3.8 times lower runtime for the bird dataset and 23 times lower runtime for the butterfly dataset than MLP (which corresponds to BayesComm), the fastest jSDM in Wilkinson et al.

Table 1:

Model runtimes in hours. Results for MPR, MLR, HPR, LPR, DPR, HLR-S, HLR-NS taken from (Wilkinson et al.

Accuracy of the inference about species-environment and species-species associations

For simulated data with dense species-species association structures, BayesComm and sjSDM consistently achieved higher accuracy in the inferred species-species associations than the LVM models Hmsc and gllvm (Fig. 3 a). The performance of all methods dropped with an increasing proportion of species, to around 70% for the non-latent and 60% for the LVM models (Fig. 3 a). Even for communities with 300 to 1,000 species, sjSDM achieved accuracies of 69% percent and higher (Table S4). For environmental preferences (measured

Results from Wilkinson et al. 2019 Our approach Dataset Size (site * species) MPR MLR HPR LPR DPR HLR-S HLR-NS C-sjSDM G-sjSDM

Birds (Harris 2015) 2,752 * 370 3.8 >168 NA 120.4 27.3 >168 15.2

Butterflies (Ovaskainen et al.

Eucalpyts (Pollock et al. <0.001 <0.001

Frogs (Pollock et al. <0.001 <0.001

Fungi (Ovaskainen et al. <0.001 <0.001

Mosquitos (Golding 2015) 167 * 16 <0.02 >168 6.4 0.14 0.73 2.0 0.2 <0.001 <0.001

1 by RMSE), Hmsc showed slightly higher inferential performance when the number of sites was low (Fig. S2b) while all models performed approximately equal for a high number of sites (Fig. S2b).

Figure 3:

Accuracy of the inferred environmental responses and species-species associations.

Models were fit to simulated data with 50 to 500 sites and the number of species set to 0.1, 0.3 and 0.5 times the number of sites. All values are averages from 5 simulated datasets. The upper row shows the accuracies of matching signs (positive or negative covariance) for the estimated and true species-species association matrix. The lower row shows the accuracy of inferring non-zero species associations for sparse association matrices (95% sparsity), measured by the true skill statistic (absolute associations smaller than 0.01 were assigned the class ‘0’ and absolute associations greater than 0.01 were assigned the class ‘1’).

For simulated data with sparse species-species association structures (95% sparsity), sjSDM achieved the highest TSS (up to 0.35-0.38 with 30% and 50% species, see Fig. 3b). Hmsc showed for 10% species the second highest TSS (Fig. 3b), but for 30% and 50% species together with gllvm the lowest TSS (a maximum of 0.1 TSS for 30% and 50% species). BayesComm showed in average the lowest TSS for 10% species, but for 30% and 50% species the second highest TSS (Fig. 3b). The inferential performance regarding the environmental predictors showed the same pattern as for dense species-species associations. 2 All models improved their environmental accuracy (Fig. S2c) and reduced RMSE as the number of sites increased (Fig. S2d).

Predicting species occurrences

All models performed similarly in predicting species occurences in the simulation scenarios, with predictive accuracies of around 0.75 AUC (Fig. 4).

Figure 4:

Predictive performance in simulated species distributions for G-sJSDM and C-sJSDM with gllvm, BayesComm, and Hmsc as references. Species distribution scenarios with 50 to 500 sites and 10% (a), 30% (b) and 50% (c) species were simulated, on which the models were fitted (training). Models predicted species distributions for additional 50 to 500 sites (testing). Area under the curve (AUC) was used to evaluate predictive performance on holdout.

Case Study - Inference of species-species associations from eDNA

In our eDNA case study with 3,649 OTUs over 125 sites, we found that without regularization, sjSDM inferred the strongest negative species-species covariances among the most abundant species and the strongest positive species-species associations among the rarest species (Fig. 5a, b). When optimizing the regularization strength for the species-species associations via a leave-one-out cross-validation, positive and negative species-species associations changed somewhat, but the overall pattern stayed qualitatively constant (Fig.5 a, b). For the 3 equally regularized environmental covariates, we found that most species showed the highest dependency on ellenberg F (moisture), ellenberg L (light availability), and ellenberg N (nitrogen).

Figure 5:

Inferred species associations and environmental preferences for the eDNA community data. The left column (panel A-C) shows species-species associations for different regularization penalties, with the 5564 species sorted according to their mean abundance in the 126 sites. The large panel D) shows the covariance structure of c), but with species sorted after their most important environmental coefficients (the outer ring shows the environmental effect distribution for the species within the group).

Discussion jSDMs extend standard species distribution models by accounting also for species-species associations. Current jSDM software, however, exhibit computational limitations for the large community matrices, which limits their use for big community data that are created by novel methods such as eDNA studies and metabarcoding. Here, we presented sjSDM, a new numerical approach for fitting jSDMs that uses Monte-Carlo integration of the model likelihood, 4 which allows moving calculations to GPUs. We show that this approach is orders of magnitude faster than existing methods (even when run on the CPU) and predicts as well as any of the other jSDM packages that we used as a benchmark. To avoid overfitting, especially when fitting sjSDM to hitherto computationally unrealistic eDNA datasets with thousands of species, we introduced a flexible elastic net regularization on species associations and environmental preferences. sjSDM inferred the signs of full association matrices and identified zero/non-zero entries in sparse species-species associations across a wide range of scenarios better than all tested alternatives. Advantages over non-latent jSDMs occurred in particular for sparse species-species associations, while improvement over latent-variable were visible for all tested species-species association structures.

Computational performance

Wheras runtimes for Hmsc, BayesComm, and gllvm started to increase exponentially when the number of species exceeded around 100, sjSDM scaled close to linearly with the number of species regardless of whether we used GPU or CPU computations (Fig. 1a). A further advantage of sjSDM is that, unlike in particular the MCMC algorithms used in BayesComm and Hmsc, it is highly parallelizable, which allows using the advantages of modern computer hardware such as GPUs. These two properties, scalability and parallelizability, make sjSDM the first and currently only jSDM software package that seems capable of analyzing big eDNA datasets (Humphreys et al. et al. et al.

Inferential performance

All jSDM implementations showed similar performance in correctly inferring environmental responses, but the non-latent approaches sjSDM and BayesComm achieved significant higher accuracy in inferring the correct signs of species-species associations (Fig. 3). The original idea of LVM algorithms was to the reduce the overall model complexity and the accompanying runtime, but apparently the constraints imposed by this structure create some bias that showed in particular for dense species association structures (compare Fig. 2a, Fig. S5). This is not particularly surprising, as similar phenomena have been found also for other approaches to covariance regularizations, for example in spatial models (Stein 2014). It is difficult to estimate how important these biases are in practical applications, because we still know too little about the typical structure of species associations in real ecological data (Ovaskainen et al. et al.

Implications and outlook for ecological data analysis

The jSDM structure has the potential to become the new default statistical approach for species and community observations that originate from eDNA and similar novel community data. To achieve this promise, however, we require statistical algorithms scale computationally to big datasets, and deliver accurate inference, in particular for a large number of species or operational taxonomic units. Our results show that a combination of a scalable and parallelizable Monte-Carlo approximation of the likelihood, together with a shrinkage regularization of the species-species covariance, might be a configuration that can achieve both goals. We see further need for research in particular on the of how to impose regularization, or other structures on the covariance. In principle, all software packages that we compared could include additional regularization methods, such as the elastic net employed in our approach. Better understanding the use of such statistical approaches is one promising route for further research. Another option would be to impose ecologically motivated structures on the species-species covariance matrix. Another interesting question is how ecologist will use jSDMs, once they scale to big data. Many recent studies have stressed that jSDMs may improve predictions (e.g. Norberg et al. et al. et al. et al. et al. et al. et al. et al.

Conclusion

We presented sjSDM, a new method to fit jSDMs, and benchmarked it against state-of-the-art jSDM software. sjSDM is orders of magnitudes faster than current alternatives, and it can be flexibly regularized, which leads to overall superior performance in inferring the correct species association structure. We emphasize that the superior scaling holds also when using CPU computations, and that the possibility to move calculations on a GPU is only a further advantage of the algorithm. We provide our tool in a R package (available for Linux, MacOS, and Windows), with a simple and intuitive interface and the ability to switch easily between linear and non-linear modeling, as well as between CPU and GPU computing. The R package also includes extensions for considering abundance data as well as spatial predictors, and to partition the importance of space, environment, and species associations for predicting the observed community composition.

Methods

The structure of the jSDM problem

Species-environment associations are classically addressed by species distribution models (SDM), which estimate the probability of abundance or presence of a species as a function of the environmental predictors. The functional form of the niche can be expressed by GLMs, or by more flexible approaches such as generalized additive models, boosted regression trees or Random Forest (see Elith & Leathwick 2009). A jSDM generalizes this approach by including the possibility of species-species associations. By an association, we mean that two species tend to appear together more or less often than expected from their environmental responses alone. The most common structure that has emerged to fit species associations is the MVP, which describes the site by species matrix 9 𝑦𝑦 𝑖𝑖𝑖𝑖 as a function of the environment, plus a multivariate normal distribution that describes the species-species associations: 𝑍𝑍 𝑖𝑖𝑖𝑖 = 𝛽𝛽 + 𝑋𝑋 𝑖𝑖𝑖𝑖 ∗ 𝛽𝛽 𝑖𝑖𝑖𝑖 + 𝑀𝑀𝑀𝑀𝑀𝑀� Σ 𝑖𝑖𝑖𝑖 � 𝑌𝑌 𝑖𝑖𝑖𝑖 = 1( 𝑍𝑍 𝑖𝑖𝑖𝑖 > 0) (1) After the model is fit, the fitted species-species covariance matrix sigma is normally transformed into a correlation matrix for further interpretation. Current approaches to fit the jSDM model structure

The model structure described in eq. 2 can be fit directly, using logit (Ovaskainen et al. et al. et al. j species increases quadratically as (j*(j-1)/2), i.e. for 50 species there are 2250 parameters to fit. Because of these problems, a series of papers (Warton et al. et al. et al. et al. et al. 𝛴𝛴 𝑖𝑖 , 𝑖𝑖 = 𝜆𝜆 𝑖𝑖𝑗𝑗 ∗ 𝜆𝜆 𝑖𝑖𝑗𝑗𝑇𝑇 ( 𝜆𝜆 = factor loadings), allowing the model to effectively fit species associations. The latent variables are sometimes interpreted as unobserved environmental predictors, but in general, they are probably better viewed as a purely technical construct that amounts to a regularized reparameterization of the covariance 0 matrix. The complexity of association structure can be set via the number of latent variables (usually to a low number, see Warton et al. An alternative approach to fit the jSDM structure

Because the latent-variable models still have computational limitations, and also because of the need for flexible regularization discussed in the introduction, we propose a different approach to fit the model structure in eq. 1. The main computational issue of the MVP is that the cumulative multivariate normal distribution, used as the inverse link function, has no closed analytical expression, which makes the evaluation of the likelihood costly. The algorithm that we present as sjSDM is based on a Monte-Carlo approximation of the MVP likelihood proposed by Chen et al. et al. et al. k is sampled from a multivariate Gaussian based on the species covariance matrix (Fig. S1). The conditioned random vectors are added to linear predictions of the remaining model, and a univariate probit link is applied to the results. Based on this, the binomial likelihood is calculated for the observed data (Fig. S1). Here, we use the same idea, but apply it to the standard generalized linear MVP, which means that we conform to the model structure typically used in this field and can profit from all benefits associated with parametric models. Note that despite simulating all test data with a probit link, following standard assumptions of existing jSDM packages, we used a logit link in sjSDM instead of the probit link, because interim results showed a higher accuracy of the estimates (in particular the environmental coefficients) for this structure. We implemented the method in an R package (see section Data Accessibility), using PyTorch (Paszke et al. et al. Benchmarking our method against state-of-the-art jSDM implementation

To benchmark our approach, we used six datasets from Wilkinson et al. et al. et al. (2019), and two state-of-the-art latent-variable jSDM implementations: Hmsc (version 3.0-4, Tikhonov et al. (version 1.2.1, Niku et al.

Regularization to infer sparse species-species associations

For the previous benchmark, we simulated data under the assumption that all species interact. While this assumption may or may not be realistic, it is generally desirable for a method to work well also when there is only a small number of associations, i.e. when the species-species covariance matrix is sparse. We were particularly interested in this question because we conjectured that the LVM approach imposes correlations on the species-species associations that makes it difficult for LVM models to fit arbitrarily sparse covariance structures. We therefore simulated the same scenarios, but with 95% sparsity in the species-species associations. To adjust our model to such a sparse structure, we applied an elastic net shrinkage (Zou & Hastie 2005) on all off-diagonals of the covariance matrix. Following common practice in the field, lambda and alpha were tuned via leave-one-out cross-validation in 40 random steps. Here, we used 1000 samples for the MVP approximation in sjSDM, because interim results showed that the approximation error can be crucial for the tuning. For BayesComm, and gllvm we used the default settings (see details and additional comments in Supporting Information S1). For Hmsc, also the default settings were used and following Tikhonov et al. et al.

Case study – Inference of species-species associations from eDNA

To demonstrate the practicability of our approach, we fit our model to an eDNA community dataset from a published study that sampled 130 sites across Denmark (for details on the study design see Brunbjerg et al. et al. et al. et al.

Acknowledgments

We would like to thank Douglas Yu and Yuanheng Li for valuable comments and suggestions. 5

Authors’ Contributions

FH and MP jointly conceived and designed the study. MP implemented the sjSDM software, ran the experiments, and analyzed the data. Both authors contributed equally to discussing and interpreting the results, and to the preparation of the manuscript.

Data Accessibility

The compiled datasets for runtime benchmarking (case study 1) are available as supporting information for Wilkinson et al.

References

Allouche, Omri., Tsoar, A. & Kadmon, R. (2006). Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS).

Journal of Applied Ecology , 43, 1223–1232. Arnqvist, G. (2020). Mixed Models Offer No Freedom from Degrees of Freedom.

Trends in Ecology & Evolution , S0169534719303465. Bálint, M., Pfenninger, M., Grossart, H.-P., Taberlet, P., Vellend, M., Leibold, M.A., et al. (2018). Environmental DNA Time Series in Ecology.

Trends in Ecology & Evolution , 33, 945–957. Barraquand, F., Picoche, C., Detto, M. & Hartig, F. (2019). Inferring species interactions using Granger causality and convergent cross mapping. arXiv:1909.00731 [q-bio, stat] . Bartholomew, D.J., Knott, M. & Moustaki, I. (2011).

Latent variable models and factor analysis: A unified approach . John Wiley & Sons. Blanchet, F.G., Cazelles, K. & Gravel, D. (2020). Co ‐ occurrence is not evidence of ecological interactions. Ecol Lett , ele.13525. Brunbjerg, A.K., Bruun, H.H., Moeslund, J.E., Sadler, J.P., Svenning, J.-C. & Ejrnæs, R. (2017). Ecospace: A unified framework for understanding variation in terrestrial biodiversity.

Basic and Applied Ecology , 18, 86–94. Calatayud, J., Andivia, E., Escudero, A., Melián, C.J., Bernardo-Madrid, R., Stoffel, M., et al. (2019). Positive associations among rare species and their persistence in ecological assemblages.

Nature Ecology & Evolution , 1–6. Chen, D., Xue, Y. & Gomes, C.P. (2018). End-to-End Learning for the Deep Multivariate Probit Model. arXiv:1803.08591 [cs, stat] . Chesson, P. (2000). Mechanisms of Maintenance of Species Diversity.

Annual Review of Ecology and Systematics , 31, 343–366. 7 Chib, S. & Greenberg, E. (1998). Analysis of multivariate probit models.

Biometrika , 85, 347–361. Clark, J.S., Gelfand, A.E., Woodall, C.W. & Zhu, K. (2014). More than the sum of the parts: forest climate response from joint species distribution models.

Ecological Applications , 24, 990–999. Cottenie, K. (2005). Integrating environmental and spatial processes in ecological community dynamics: Meta-analysis of metacommunities.

Ecology Letters , 8, 1175–1182. Cristescu, M.E. (2014). From barcoding single individuals to metabarcoding biological communities: towards an integrative approach to the study of global biodiversity.

Trends in Ecology & Evolution , 29, 566–571. Deiner, K., Bik, H.M., Mächler, E., Seymour, M., Lacoursière ‐ Roussel, A., Altermatt, F., et al. (2017). Environmental DNA metabarcoding: Transforming how we survey animal and plant communities.

Molecular Ecology , 26, 5872–5895. Desjonquères, C., Gifford, T. & Linke, S. (n.d.). Passive acoustic monitoring as a potential tool to survey animal and ecosystem processes in freshwater environments.

Freshwater Biology , 0. Dormann, C.F., Bobrowski, M., Dehling, D.M., Harris, D.J., Hartig, F., Lischke, H., et al. (2018). Biotic interactions in species distribution modelling: 10 questions to guide interpretation and avoid false conclusions.

Global Ecol Biogeogr , 27, 1004–1016. Dormann, C.F., Schymanski, S.J., Cabral, J., Chuine, I., Graham, C., Hartig, F., et al. (2012). Correlation and process in species distribution models: bridging a dichotomy.

Journal of Biogeography , 39, 2119–2131. Elith, J. & Leathwick, J.R. (2009). Species Distribution Models: Ecological Explanation and Prediction Across Space and Time.

Annual Review of Ecology, Evolution, and Systematics , 40, 677–697. Fritzler, A., Koitka, S. & Friedrich, C.M. (2017). Recognizing Bird Species in Audio Files Using Transfer Learning, 14. 8 Frøslev, T.G., Kjøller, R., Bruun, H.H., Ejrnæs, R., Hansen, A.J., Læssøe, T., et al. (2019). Man against machine: Do fungal fruitbodies and eDNA give similar biodiversity assessments across broad environmental gradients?

Biological Conservation , 233, 201–212. Gallien, L., Douzet, R., Pratte, S., Zimmermann, N.E. & Thuiller, W. (2012). Invasive species distribution models – how violating the equilibrium assumption can create new insights.

Global Ecology and Biogeography , 21, 1126–1136. Gilbert, B. & Bennett, J.R. (2010). Partitioning variation in ecological communities: do the numbers add up?

Journal of Applied Ecology , 47, 1071–1082. Golding, N. (2015). Mosquito community data for Golding et al. 2015 (Parasites & Vectors). Golding, N. & Harris, D.J. (2015).

BayesComm: Bayesian community ecology analysis . Golding, N., Nunn, M.A. & Purse, B.V. (2015). Identifying biotic interactions which drive the spatial distribution of a mosquito community.

Parasites & Vectors , 8, 367. Guirado, E., Tabik, S., L. Rivas, M., Alcaraz-Segura, D. & Herrera, F. (2018). Automatic whale counting in satellite images with deep learning. bioRxiv . Harris, D.J. (2015). Generating realistic assemblages with a joint species distribution model.

Methods in Ecology and Evolution , 6, 465–473. Hui, F.K.C. (2016). boral – Bayesian Ordination and Regression Analysis of Multivariate Abundance Data in r.

Methods in Ecology and Evolution , 7, 744–750. Humphreys, J.M., Murrow, J.L., Sullivan, J.D. & Prosser, D.J. (2019). Seasonal occurrence and abundance of dabbling ducks across the continental United States: Joint spatio-temporal modelling for the Genus Anas.

Diversity and Distributions , 25, 1497–1508. Ingram, M., Vukcevic, D. & Golding, N. (n.d.). Multi-Output Gaussian Processes for Species Distribution Modelling.

Methods in Ecology and Evolution , n/a. Krapu, C. & Borsuk, M. (2020). A spatial community regression approach to exploratory analysis of ecological data.

Methods in Ecology and Evolution , n/a. Lasseck, M. (2018). Audio-based bird species identification with deep convolutional neural networks.

Working Notes of CLEF , 2018. 9 LeCun, Y., Bengio, Y. & Hinton, G. (2015). Deep learning.

Nature , 521, 436. Leibold, M.A. & Chase, J.M. (2017).

Metacommunity ecology . Princeton University Press. Leibold, M.A., Holyoak, M., Mouquet, N., Amarasekare, P., Chase, J.M., Hoopes, M.F., et al. (2004). The metacommunity concept: a framework for multi-scale community ecology.

Ecology Letters , 7, 601–613. Levine, J.M., Bascompte, J., Adler, P.B. & Allesina, S. (2017). Beyond pairwise mechanisms of species coexistence in complex communities.

Nature , 546, 56–64. Mac Aodha, O., Gibb, R., Barlow, K.E., Browning, E., Firman, M., Freeman, R., et al. (2018). Bat detective–Deep learning tools for bat acoustic signal detection.

PLOS Computational Biology , 14, e1005995. Mainali, K.P., Warren, D.L., Dhileepan, K., McConnachie, A., Strathie, L., Hassan, G., et al. (2015). Projecting future expansion of invasive species: comparing and improving methodologies for species distribution modeling.

Global Change Biology , 21, 4464–4480. Mittelbach, G.G. & Schemske, D.W. (2015). Ecological and evolutionary perspectives on community assembly.

Trends in Ecology & Evolution , 30, 241–247. Nakagawa, S. & Schielzeth, H. (2013). A general and simple method for obtaining R2 from generalized linear mixed-effects models.

Methods in Ecology and Evolution , 4, 133–142. Niku, J., Brooks, W., Herliansyah, R., Hui, F.K.C., Taskinen, S. & Warton, D.I. (2020). gllvm: Generalized linear latent variable models . Niku, J., Hui, F.K.C., Taskinen, S. & Warton, D.I. (2019). gllvm: Fast analysis of multivariate abundance data with generalized linear latent variable models in r.

Methods in Ecology and Evolution , n/a. Norberg, A., Abrego, N., Blanchet, F.G., Adler, F.R., Anderson, B.J., Anttila, J., et al. (2019). A comprehensive evaluation of predictive performance of 33 species distribution models at species and community levels.

Ecological Monographs , 0, e01370. 0 Ovaskainen, O., Abrego, N., Halme, P. & Dunson, D. (2016). Using latent variable models to identify large networks of species-to-species associations at different spatial scales.

Methods in Ecology and Evolution , 7, 549–555. Ovaskainen, O., Hottola, J. & Siitonen, J. (2010). Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions.

Ecology , 91, 2514–2521. Ovaskainen, O., Tikhonov, G., Dunson, D., Grøtan, V., Engen, S., Sæther, B.-E., et al. (2017a). How are species interactions structured in species-rich communities? A new method for analysing time-series data.

Proceedings of the Royal Society B: Biological Sciences , 284, 20170768. Ovaskainen, O., Tikhonov, G., Norberg, A., Blanchet, F.G., Duan, L., Dunson, D., et al. (2017b). How to make more out of community data? A conceptual framework and its implementation as models and software.

Ecology Letters , 20, 561–576. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in PyTorch. In:

NIPS autodiff workshop . Peres ‐ Neto, P.R. & Legendre, P. (2010). Estimating and controlling for spatial structure in the study of ecological communities.

Global Ecology and Biogeography , 19, 174–184. Picciulin, M., Kéver, L., Parmentier, E. & Bolgan, M. (2019). Listening to the unseen: Passive acoustic monitoring reveals the presence of a cryptic fish species.

Aquatic Conservation: Marine and Freshwater Ecosystems , 29, 202–210. Pimm, S.L., Alibhai, S., Bergl, R., Dehgan, A., Giri, C., Jewell, Z., et al. (2015). Emerging Technologies to Conserve Biodiversity.

Trends in Ecology & Evolution , 30, 685–696. Poisot, T., Stouffer, D.B. & Gravel, D. (2015). Beyond species: why ecological interaction networks vary through space and time.

Oikos , 124, 243–251. Pollock, L.J., Tingley, R., Morris, W.K., Golding, N., O’Hara, R.B., Parris, K.M., et al. (2014). Understanding co-occurrence by modelling species simultaneously with a Joint Species Distribution Model (JSDM).

Methods in Ecology and Evolution , 5, 397–406. 1 Pontarp, M., Bunnefeld, L., Cabral, J.S., Etienne, R.S., Fritz, S.A., Gillespie, R., et al. (2019). The Latitudinal Diversity Gradient: Novel Understanding through Mechanistic Eco-evolutionary Models.

Trends in Ecology & Evolution , 34, 211–223. Sang, H., Jun, M. & Huang, J.Z. (2011). Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors.

Ann. Appl. Stat. , 5, 2519–2548. Skrondal, A. & Rabe-Hesketh, S. (2004).

Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models . Chapman and Hall/CRC. Stein, M.L. (2007). Spatial variation of total column ozone on a global scale.

Ann. Appl. Stat. , 1, 191–210. Stein, M.L. (2014). Limitations on low rank approximations for covariance matrices of spatial data.

Spatial Statistics , 8, 1–19. Tabak, M.A., Norouzzadeh, M.S., Wolfson, D.W., Sweeney, S.J., Vercauteren, K.C., Snow, N.P., et al. (2019). Machine learning to classify animal species in camera trap images: Applications in ecology.

Methods in Ecology and Evolution , 10, 585–590. Thuiller, W., Lavorel, S., Sykes, M.T. & Araújo, M.B. (2006). Using niche-based modelling to assess the impact of climate change on tree functional diversity in Europe.

Diversity and Distributions , 49–60. Tikhonov, G., Abrego, N., Dunson, D. & Ovaskainen, O. (2017). Using joint species distribution models for evaluating how species-to-species associations depend on the environmental context.

Methods in Ecology and Evolution , 8, 443–452. Tikhonov, G., Duan, L., Abrego, N., Newell, G., White, M., Dunson, D., et al. (2019a). Computationally efficient joint species distribution modeling of big spatial data.

Ecology , n/a, e02929. Tikhonov, G., Opedal, Ø.H., Abrego, N., Lehikoinen, A., Jonge, M.M.J. de, Oksanen, J., et al. (2020). Joint species distribution modelling with the r-package Hmsc.

Methods in Ecology and Evolution , 11, 442–447. 2 Tikhonov, G., Ovaskainen, O., Oksanen, J., de Jonge, M., Opedal, O. & Dallas, T. (2019b).

Hmsc: Hierarchical model of species communities . Tobler, M.W., Kéry, M., Hui, F.K.C., Guillera ‐ Arroita, G., Knaus, P. & Sattler, T. (2019). Joint species distribution models with species correlations and imperfect detection.

Ecology , 100, e02754. Tokuda, T., Goodrich, B., Van Mechelen, I., Gelman, A. & Tuerlinckx, F. (2011). Visualizing distributions of covariance matrices.

Columbia Univ., New York, USA, Tech. Rep , 18–18. Urban, M.C., Bocedi, G., Hendry, A.P., Mihoub, J.-B., Pe’er, G., Singer, A., et al. (2016). Improving the forecast for biodiversity under climate change.

Science , 353. Ushey, K., Allaire, J. & Tang, Y. (2019).

Reticulate: interface to “Python.”

Van der Putten, W.H., Macel, M. & Visser, M.E. (2010). Predicting species distribution and abundance responses to climate change: why it is essential to include biotic interactions across trophic levels.

Philosophical Transactions of the Royal Society B: Biological Sciences , 365, 2025–2034. Vellend, M. (2010). Conceptual Synthesis in Community Ecology.

The Quarterly Review of Biology , 85, 183–206. Warton, D.I., Blanchet, F.G., O’Hara, R.B., Ovaskainen, O., Taskinen, S., Walker, S.C., et al. (2015). So Many Variables: Joint Modeling in Community Ecology.

Trends in Ecology & Evolution , 30, 766–779. Wilkinson, D.P., Golding, N., Guillera ‐ Arroita, G., Tingley, R. & McCarthy, M.A. (2019). A comparison of joint species distribution models for presence–absence data.

Methods in Ecology and Evolution , 10, 198–211. Wisz, M.S., Pottier, J., Kissling, W.D., Pellissier, L., Lenoir, J., Damgaard, C.F., et al. (2013). The role of biotic interactions in shaping distributions and realised assemblages of species: implications for species distribution modelling.

Biological Reviews , 88, 15–30. 3 Wood, C.M., Gutiérrez, R.J. & Peery, M.Z. (2019). Acoustic monitoring reveals a diverse forest owl community, illustrating its potential for basic and applied ecology.

Ecology , 100, e02764. Wrege, P.H., Rowland, E.D., Keen, S. & Shiu, Y. (2017). Acoustic monitoring for conservation in tropical forests: examples from forest elephants.

Methods in Ecology and Evolution , 8, 1292–1301. Zou, H. & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67, 301–320. Zurell, D., Pollock, L.J. & Thuiller, W. (2018). Do joint species distribution models reliably detect interspecific interactions from co-occurrence data in homogenous environments?

Ecography , 41, 1812–1819. Appendix

Simulation scenarios

The MVP can be interpreted as individual GLMs connected by correlated residuals, which are sampled from a multivariate Gaussian, and with a probit link. Sites are notated with i = 1,...M; species with j = 1, …, K; and environmental covariates with n = 1,...N . Environmental covariates and species responses (beta) were uniformly sampled (1,2). The lower triangular covariance matrix was uniformly sampled (3), the diagonal was set to one (4) and multiplicated by the transposed lower triangular to get a symmetric positive definite matrix (4). 𝛽𝛽 , 𝑋𝑋 ∼ 𝑈𝑈 ( −

1, +1) (1) 𝛴𝛴 𝑖𝑖 , 𝑖𝑖𝑗𝑗𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 ∼ 𝑈𝑈 ( −

1, +1) (2) 𝛴𝛴 𝑖𝑖 = 𝑖𝑖𝑗𝑗𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = 1 (3) 𝛴𝛴 𝑖𝑖 , 𝑖𝑖 = 𝛴𝛴 ∗ 𝛴𝛴 𝑡𝑡 (4) 𝑦𝑦 𝑖𝑖𝑖𝑖 = 𝑋𝑋 𝑖𝑖𝑖𝑖 ∗ 𝛽𝛽 𝑖𝑖𝑖𝑖 + 𝑀𝑀𝑀𝑀𝑀𝑀 (0; 𝜮𝜮 𝒋𝒋 , 𝒋𝒋 ) (5) 𝑧𝑧 𝑖𝑖𝑖𝑖 = 1 ( 𝑦𝑦 𝑖𝑖𝑖𝑖 > 0) (6) Species responses consist of a linear species - environmental response and correlated residuals (5). Following a probit link, responses higher than zero were set to one and the remaining to 0. Runtime on case study

We used compiled datasets from Wilkinson et al.

Table S1:

Compiled datasets that were taken from Wilkinson et al.

Dataset Original paper Species Sites Covariates

Birds Harris 2015 370 2,752 8 Butterflies Ovaskainen et al. et al. et al. et al.

Approximation of multivariate probit model

To approximate the likelihood in sjSDM, we use a Monte-Carlo approach that was suggested in a slightly different context by Chen et al.

Figure S1:

Overview of the algorithm employed in sjSDM to approximate of the multivariate probit model. Univariate random vectors of length k are conditioned on the lower triangular covariance matrix (red). After the conditioned random vectors are added to the linear response of the species to their environment (yellow), the univariate probit link is applied. Based on the true occurrences (blue), the binomial likelihood is calculated. Model settings and computational environment for the benchmarks

This section provides a more detailed explanation about model settings and the computer setup under which our benchmarks were performed (See Table S1 for an overview). Unless stated otherwise, we used default settings for parameters. 7

Table S2:

Overview of the used approaches

Model Optimization type Package

Multivariate probit model MLE sjSDM MCMC BayesComm Latent-variable model MCMC Hmsc Variational bayes / Laplace approximation gllvm

BayesComm

BayesComm models were fit in 50,000 MCMC sampling iterations, with two chains, thinning = 50, and burn-in of 5000. Prior were not changed from default: normal prior on regression coefficients b ~ N(0;10) and an inverse Wishart prior on the covariance matrix.

Hmsc

Hmsc models were fit in 50,000 MCMC sampling iterations, with two chains, thinning = 50, and burn-in of 5000. Since the two chains were not run in parallel (which is supported by Hmsc), the measured runtime was halved. The number of latent variables in Hmsc are automatically inferred by gamma shrinkage prior. The shrinkage priors of Hmsc were not changed from default. We note that there is the option to tune regularization via shrinkage priors in Hmsc: a regularizes the lower triangular of the species association and a regularizes the number of latent variables (see Bhattacharya & Dunson 2011). We acknowledge that this might improve accuracy of Hmsc inference. On the other hand, it should be noted that a) these settings were not tuned in recent benchmarks, and are likely not tuned by users either. b) the runtime of 8 tuning several combinations would be not practicable (see our results) and c) it is to be expected that a low a results in a higher accuracy but then the LVM approach would approximate the MVP model and that would contradict the LVM’s unique characteristic. gllvm gllvm models were fit as binomial models with probit link. The number of latent variables were increased from 2 to 6 with the number of species. If default starting values = “res” caused an error, model was re-run with starting values = “zero” and if another error occurred, the model was re-run with starting values = “random”. Run time was measured individually, not as a sum over possible model fitting tries. sjSDM sjSDM models were fit in 50 iterations (epochs) and a batch size of 10% number of sites. Learning rate was set to 0.01. 50% number of species were set as degree of freedom (df) for covariance parametrization. For sparse species association matrices, 50 iterations with a learning rate of 0.03 were used and the regularization of the species-species covariances were tuned in 40 random steps and 5-folded cross-validation. 2,000 Monte-Carlo samples were used for the MVP approximation because interim results showed that with too less samples the variance of the models’ log-Likelihoods might interfere with the comparison. Computer setup

Additional results

Scaling of the algorithms in log plots

In the main paper, we provide the benchmarks on a linear scale. Below, we also provide then in log format, which demonstrates that many other software packages, including the CPU version of sjSDM scale exponentially, while G-sjSDM scales sub-exponentially for the scenarios that we tested (Fig S1).

Figure S2:

Results for computational log runtime benchmarking of G-sjSDM, C-sjSDM, gllvm, BayesComm, and Hmsc jSDM implementations. Models were fit to different simulated SDM scenarios: 50 to 500 sites with 10% (a), 30% (b) and 50% (c) number of species (e.g. for 100 sites and 10% we get 10 species). For each scenario, ten simulations were sampled, and results were averaged. Due to high runtimes, runs for BayesComm, gllvm and Hmsc were aborted at specific points.

Additional results of sjSDM on large scale datasets

Beside the runtimes for the large-scale datasets (see main paper), we also calculated the accuracy of the matching signs of predicted and true parameters for the association matrix and environmental coefficients (Table S4). Moreover, the root mean squared error (RMSE) for the environmental coefficients were calculated (Table S4). 0 Overall, we found that the association accuracy decreased from 300 to 1000 species with the different number of sites, but overall, the association accuracy increased with the number of sites (Table S4). The accuracy of environmental coefficients was close to 1.0 in all scenarios and the RMSE for the environmental coefficients was close to zero in all scenarios (Table S4).

Table S4:

Accuracy of matching signs for estimated associations, environmental coefficients, and root mean squared error (RMSE) for environmental coefficients Sites 5,000 15,000 30,000 Species 300 500 1000 300 500 1000 300 500 1000 Covariance accuracy 0.744 0.721 0.688 0.750 0.727 0.693 0.750 0.728 0.692 Env accuracy 0.988 0.985 0.987 0.990 0.989 0.991 0.990 0.990 0.990 Env RMSE 0.040 0.037 0.035 0.034 0.029 0.026 0.033 0.029 0.025

Accuracy of inferring environmental parameters

For non-sparse species-species associations, all models inferred with high accuracy true signs of the environmental predictors and also achieved similar RMSE errors estimated environmental parameters (Fig. S3 a-b). 1

Figure S3:

Results for inferential benchmarking of G-sjSDM with gllvm, BayesComm, and Hmsc as references. Models were fit to different simulated SDM scenarios: 50 to 500 sites with 10%, 30% and 50% number of species (e.g. for 100 sites and 10% we get 10 species). For each scenario, ten simulations were sampled, and results were averaged. Due to high runtimes, runs for BayesComm, gllvm and Hmsc were aborted at specific points. A) and B) show the environmental coefficient accuracy (matching signs) and the corresponding RMSE with full species-species association matrices. C) and D) show the environmental coefficient accuracy (matching signs) and the corresponding RMSE with sparse (50% sparsity) species-species association matrices.

For sparse species associations, the models achieved similar performances in inferring the environmental parameters (Fig. S2 c-d). 2

Convergence check

To check convergence of sigma and the betas, the potential scale reduction factors (psrf) for Hmsc and BayesComm in the simulation scenarios were calculated (two chains, burn-in = 5000 and 50,000 sampling iterations). We found no psrf > 1.2 for BayesComm, but for Hmsc in most simulation scenarios at least for one parameter (beta or lambda (factor loadings)) a psrf > 1.2 (Table S2). In case of Hmsc, we removed the psrf factors for the last latent factor loading because interim results showed that the psrfs for the last factor loading were on average always higher than 1.2, although they were estimated to be very small (perhaps a numerical issue).

Figure S4:

Rate of weights in percent with potential scale reduction factor > 1.2 with non-sparse and sparse association matrices in simulation scenarios for HMSC (for factor loadings and beta estimates) and BayesComm (covariance and beta estimates.

Covariance accuracy behavior

To further assess the jSDM’s behavior in inferring the species-species association matrix, we set the number of species to 50 and increased the number of sites from 50 to 330. For each, 3 step we computed the averaged (we sampled 5 scenarios for each setting) covariance accuracy and environmental RMSE. BayesComm achieved at 330 sites around 0.82 accuracy and sjSDM around 0.80. sjSDM and BayesComm increased the covariance accuracy steadily with the number of sites, while Hmsc and gllvm stopped increasing their accuracy at around 0.68 accuracy (Fig. S4 a). sjSDM and BayesComm achieved in average 0.1 more accuracy than Hmsc and gllvm (Fig. S4 a). All models achieved a similar RMSE over all scenarios (Fig. S4 b). sjSDM showed overall the highest RMSE (Fig. S4 b). All models decreased their RMSE with increasing number of sites (Fig. S4 b).

Figure S5:

Results for examining the ability to recover the covariance structure in dependence of number of sites for G-sjSDM, BC, gllvm, and Hmsc. In the simulated species distribution scenarios, the number of species were constantly set to 50, but the number of sites were varied from 50 to 330 sites. Performance was measured by the accuracy of matching sings between estimated covariances matrices and true covariance matrices (A). Moreover, the root mean squared error for the environmental effects with the true coefficients were calculated (B).

Simulation from a Latent Variable Model

We also simulated new data from a latent variable model varying the number of species from 5 to 100 (5, 10, 25, 50, 100) and the number of latent variables (1 - 5) with a constant number of 500 sites. In all simulations, the species’ environmental preference was described for three 4 environmental covariates (beta), which was randomly selected. For each scenario, we simulated 5 communities. The number of latent variables in gllvm was set to the real number of latent variables. For the simulated LVM communities, all models achieved with increasing number of species high accuracy in inferring the signs of the species-species association matrices (Fig. S5 a). gllvm showed for 5 species lower performance than Hmsc and sjDM independent of the number of latent variables. sjSDM showed lower accuracy for communities with 50, and 100, species in the case of 1 -2 latent variables (Fig. S5 a). Hmsc and gllvm showed in call communities higher RMSE error between the true and estimated environmental coefficients but they were able to reduce the RMSE error with increasing number of species (Fig. S5b)

Figure S6:

Results for communities simulated by a latent variable model. Models were fit to different simulation scenarios: 5, 10, 25, 50 and 100 species with 1 - 5 latent variables. Number of environmental coefficients were set to five. For each scenario, five datasets were analyzed, and results were averaged. The upper row shows the accuracy in inferring the signs of species-species associations (negative or positive covariance). The lower row shows the RMSE error between the true and the estimated environmental predictors. References

Bhattacharya, A. & Dunson, D.B. (2011). Sparse Bayesian infinite factor models.

Biometrika , 98, 291–306. Chen, D., Xue, Y. & Gomes, C.P. (2018). End-to-End Learning for the Deep Multivariate Probit Model. arXiv:1803.08591 [cs, stat] . Golding, N. (2015). Mosquito community data for Golding et al. 2015 (Parasites & Vectors). Harris, D.J. (2015). Generating realistic assemblages with a joint species distribution model.

Methods in Ecology and Evolution , 6, 465–473. Ovaskainen, O., Hottola, J. & Siitonen, J. (2010). Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions.

Ecology , 91, 2514–2521. Ovaskainen, O., Roy, D.B., Fox, R. & Anderson, B.J. (2016). Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models.

Methods in Ecology and Evolution , 7, 428–436. Pollock, L.J., Tingley, R., Morris, W.K., Golding, N., O’Hara, R.B., Parris, K.M., et al. (2014). Understanding co-occurrence by modelling species simultaneously with a Joint Species Distribution Model (JSDM).

Methods in Ecology and Evolution , 5, 397–406. Wilkinson, D.P., Golding, N., Guillera ‐ Arroita, G., Tingley, R. & McCarthy, M.A. (2019). A comparison of joint species distribution models for presence–absence data.

Methods in Ecology and Evolution , 10, 198–211., 10, 198–211.