[PDF] A community-powered search of machine learning strategy space to find NMR property prediction models

Abstract

The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published "in-house" efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.

Full PDF

AA community-powered search of machine learning strategy space to find NMR property prediction models

Lars A. Bratholm , Will Gerrard , Brandon Anderson , Shaojie Bai , Sunghwan Choi , Lam Dang , Pavel Hanchar , Addison Howard , Guillaume Huard , Sanghoon Kim , Zico Kolter , Risi Kondor , Mordechai Kornbluth , Youhan Lee , Youngsoo Lee , Jonathan P. Mailoa , Thanh Tu Nguyen , Milos Popovic , Goran Rakocevic , Walter Reade , Wonho Song , Luka Stojanovic , Erik H. Thiede , Nebojsa Tijanic , Andres Torrubia , Devin Willmott , Craig P. Butts, David R. Glowacki & Kaggle participants School of Chemistry, University of Bristol, Cantock’s Close, Bristol BS8 1TS, UK; School of Mathematics, University of Bristol, Fry Building, Woodland Road, Bristol BS1 1UG, UK; Dept. of Computer Science, University of Bristol, Merchant Venturer’s Building, Bristol BS8 1UB, UK; Intangible Realities Laboratory, University of Bristol, Cantock’s Close, Bristol BS8 1TS, UK; Bosch Center for Artificial Intelligence, Pittsburgh, PA 15222, USA; Carnegie Mellon University, Pittsburgh, PA 15213, USA; National Institute of Supercomputing and Network, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; BNP Paribas Cardif; Fyusion, Inc.; Kaggle, Google Inc., Mountain View, US; Ebay Korea, 34F, Gangnam Finance Center, 152 Teheran Ro, Gangnam Gu, Seoul, Republic of Korea; Department of Computer Science, The University of Chicago, 5730 S Ellis Ave, Chicago, IL 60637; Department of Statistics, The University of Chicago, 5747 S Ellis Ave, Chicago, IL 60637; Center for Computational Mathematics, Flatiron Institute, 162 5th Ave., New York, NY 10010; Bosch Research and Technology Center, Cambridge, MA 02139, USA; Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea, current address: Korea Atomic Energy Research Institute, (34057) 111, Daedeok-daero 989beon-gil, Yuseong-gu, Daejeon, Republic of Korea; MINDS AND COMPANY, 2621, Nambusunhwan-ro, Gangnam-gu, Seoul 06267, Republic of Korea; Totient Inc, Sindjeliceva 9, 11000 Belgrade, Serbia; KAIST Web Security & Privacy Lab, 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; Medbravo.org, Alicante, Spain; * [email protected], [email protected] The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.

1. Introduction

The rise of machine learning (ML) in the physical sciences has created a number of notable successes, ( ) and the number of published outputs is increasing substantially. ( ) This explosion is perhaps not entirely surprising, given that ML ‘search space’ is effectively infinite. For example, the performance of a particular ML algorithm strategy depends sensitively on at least four components: (a) the dataset used for training (and the corresponding methodology used for dataset curation); (b) the feature selection used to construct ML inputs; (c) the choice of ML algorithm; and (d) the values of the optimal constituent hyperparameters. For components (b) and (c), the space of possibilities is continually expanding; for components (a) and (d), the space of possibilities is potentially infinite. Given the sensitivity of ML approaches to each of the items outlined above, ML’s explosion within the scientific literature has led to warnings of an emerging computational reproducibility crisis, a risk exacerbated by the fact that many peer-reviewed ML publications do not include the data and algorithms required to reproduce their results. ( ) The difficulty of searching an enormous ML space is compounded by the fact that the training of even simple neural networks has been shown to be an NP-complete problem. ( ) Deciphering whether any global optima lurk within an effectively infinite ML search space has been the topic of a great deal of research; however, there seems to be a consensus emerging that it is practically impossible to demonstrate that any particular ML strategy is in fact optimal or bias-free, even for very simple systems. ( ) Broadly speaking, the parameter spaces in which a particular ML strategy can be constructed are non-convex, and characterized by multiple local minima and saddle points in which optimization algorithms can get trapped. ( ) Nevertheless, ML algorithms can produce useful results. In a nod to the 1950 Japanese period drama “Rashomon” (where various characters provide subjective, alternative, self-serving, yet compelling versions of the same incident), ML’s tendency to produce many accurate-but-different models has been referred to as the “Rashomon effect” in machine learning. ( ) In such a vast space, any individual agent has a chance of stumbling upon a reasonable ML model. Given the difficulty of rationalizing the uniqueness of any particular ML model or approach, individual models are increasingly being used as constituents within ensemble models, whose combined accuracy outperforms that of any individual model. ( ) Over the last several years, a number of studies have demonstrated the utility of ‘crowd-sourced’ approaches for solving scientific problems which involve searching hyperdimensional spaces. ( ) Inspired by recent attempts within both particle physics (

20, 21 ) and materials science ( ) using community power to develop ML algorithms, we worked with Kaggle (an online platform for ML competitions), to design a competition encouraging participants to develop ML models able to accurately predict QM nuclear magnetic resonance (NMR) properties from 3D molecular structure information. ( ) The fact that some of our authorship team had worked in this area over several years ( ) meant that we had quantitative and qualitative benchmarks to analyse competition progress in relation to what conventional academic research approaches had achieved. The so-called ‘Champs Kaggle Competition’ (CKC) ran from 29-May-19 through 28-Aug-19. The 5 models which achieved the highest accuracy were awarded respective prize money of $12.5k, $7.5k, $5k, $3k and $2k. Over ~13 weeks, the CKC received 47,800 model predictions from 2,700 teams in 84 countries (Fig 1a), representing the most exhaustive search to date of ML strategies aimed at predicting QM NMR properties from 3D molecular structure information. The number of participants who engaged with the CKC was amongst the highest for any physical science challenge which Kaggle has hosted to date. Fig 1b and 1c show a steady increase in the number of participants who joined the CKC versus time. CKC participants reported being drawn to the competition because it: (a) facilitated progress on an important research problem; (b) involved a rich, noise-less dataset whose structure was easy to understand; and (c) had a dataset which was manageable using standard data processing tools, workflows, and hardware. Figure 1 : (a) map showing the number teams participating from different countries over the duration of the CKC (countries with less than 5 participants are shown light gray); (b) the number of CKC participants vs. time; (c) number of submissions per day

2. Competition Design

NMR is the dominant spectroscopic technique for determining 2D and 3D molecular structure in solution. Amongst the most important data obtained in an NMR spectrum are the chemical shifts (which describe the position/frequency of a signal in the spectrum) and the scalar couplings (which determine the splitting/shape of the signal in the spectrum). ML methods to predict NMR properties are established in academic and commercial workflows for determining 2D molecular structure from experimental NMR datasets. ( ) Despite this success, these 2D approaches often fail when the NMR properties are affected by 3D structure, for example atoms are separated by several bonds yet remain close in 3D space (ring current effects, hydrogen bonding etc.). This is an inherently difficult problem as the 3D molecular structure is simply not well described by 2D representations and there are not enough high-quality experimental data available to accurately infer most 3D relationships from a 2D structural representation alone. The most accurate computed predictions of NMR properties use QM methods like density functional theory (DFT) to get a one-to-one mapping between a 3D structure and the contribution it has to the experimentally observed NMR property. Accurate QM methods for NMR property predictions are powerful but expensive. Recent work has thus focussed on developing ML algorithms which can efficiently reproduce the results of costly QM methods, achieving results in seconds rather than hours or days. (

24, 29 ) ML approaches have the added appeal that they can be trained using large datasets of DFT-computed NMR parameters, which are not limited to experimental structural observations. With a large enough training database, we have shown in previous work that an ML strategy can approach the accuracy of DFT calculations of atom-centered NMR parameters such as chemical shift for 3D structure analysis, but with several orders of magnitude reduction in time. ( ) Beyond NMR, the last decade has seen considerable effort focused on machine learning QM molecular properties. ( ) Broadly speaking, this work has tended to focus on predicting atomic properties such as partial charges, or molecular properties such as energies and dipoles. Relatively little work has been carried out designing ML models which are able to predict pairwise atomic properties such as scalar coupling constants. Our earlier work to develop pairwise property prediction algorithms were effectively independent-atom treatments, in which atomic feature vectors describing the local environment of each atom were concatenated. ( ) However, this approach loses information about the relative position/orientation of each atom’s respective environment, which is important for multiple-bond couplings. The CKC represents an attempt to kickstart research into ML methods able to make accurate prediction of pairwise properties. Scalar couplings are critically dependent on the 3D structure of the molecule for which they are being measured; however at the time we carried out this work, we were unaware of accurate experimental databases linking pair-wise mutiple-bond NMR scalar couplings to well-defined 3D molecular structures. Therefore, we decided to run the CKC utilizing molecular structures included in the QM9 dataset, a publicly available benchmark for developing ML models of 3D structure-property relationships. ( ) QM9 includes ~134k molecules comprised of carbon, fluorine, nitrogen, oxygen and hydrogen. The molecules included within QM9 have no more than 9 heavy atoms (non-hydrogen), with a maximum of 29 total atoms. To obtain a corresponding set of scalar couplings, we extended the QM9 computational methodology, using the B3LYP functional ( ) and the 6-31g(2df,p) basis set ( ) to compute NMR parameters on the optimized QM9 structures. The computed QM9 scalar coupling constants are available under Creative Commons CC-NC-BY 4.0, enabling others to build on this work. To remove the possibility of CKC participants overfitting their models to the entire set of computed QM9 scalar couplings, 65% of molecules in the dataset were randomly partitioned into a training set and the other 35% to a testing set. The test set was further split, with 29% of the data in a ‘public’ test set, and 71% of the data in a ‘private’ test set (competitors were unaware of the specifics of the private/public split). Both the training and test sets included the molecular geometries and indices of the coupling atoms. Unlike the test set, the training set included a range of other data, including the calculated scalar coupling values, their breakdown into Fermi contact (FC), spin-dipole (SD), paramagnetic spin-orbit (PSO) and diamagnetic spin-orbit (DSO) components, and a range of auxiliary information obtained from the QM computations (e.g., potential energy, dipole moment vectors, magnetic shielding tensors and Mulliken charges). As the CKC progressed, participating teams continually iterated and improved their models. A regularly updated and publicly visible leaderboard enabled each team to see where their model ranked in predicting the public test set data compared to the model predictions made by all of the other teams. The leaderboard scores were determined using a function which accounted for the 8 different types of coupling constants included in the training and testing datasets: J HC , J HN , J HH , J HC , J HN , J HH , J HC and J HN (where the superscript indicates the number of covalent bonds separating the atom pairs indicated by the subscript). Since the number of couplings of each type differed (e.g., the molecular composition of the QM9 test set included 811,999 J HC couplings compared to 24,195 J HN couplings) and spanned different value ranges, the scoring function used the average of the logarithm of the mean absolute error for each type of coupling constant: score = 1( ) log , 1- ! )|/ " − /1 " | ! " $! where t is an index that runs over the T = 8 different scalar coupling types, i is an index that spans 1.. n t , the number of observations of type t , y i is the scalar coupling constant for observation i , and /1 " is the predicted scalar coupling constant for observation i . This scoring function ensures, for example, that a 10% improvement in one type of coupling will improve the score by the same amount as a 10% improvement in another type of coupling, so that no coupling class dominates.

3. Results

Over the course of the CKC, Fig 2a shows the evolution of the best score whose source code was publicly available (public notebooks), and its relationship to the top score versus time. Fig 2b shows that the time trace of the top score is well fit by a bi-exponential curve with two distinct phases.

Phase 1 lasted for the first week, during which time the accuracy increased by ~12x (~2.5 improvement in score), with a time constant of ~1.29 ± 0.18 days.

Phase 2 lasted for the next 12 weeks, during which time the accuracy improved more gradually by a factor of ~4x (~1.5 improvement in score), with a time constant of ~50.0 ± 16.6 days. To determine which models were awarded prize money, the final set of model rankings were assessed using Eq (1) to evaluate how well each of the models predicted the scalar coupling values in the private test set (preventing competitors inferring the target property from the leaderboard scores rather than from the training set). Due to the large amount of noise-less data, the positioning in the top 37 submissions was the same on the public and private leaderboard at the end of the CKC. Several teams commented that the stability between the public & private leaderboards made for an enjoyable competition. The top-scoring method achieved a geometric mean error (exponential of the score) of 0.039 Hz which was 6-16x more accurate than what could be achieved using our own recently developed methodology (see SI for details). ( ) In addition to the final score, Kaggle also rewards participants who make the best contributions to: (1) publicly available code, and (2) the discussion forums. As a result of these incentives, a number of participants opted to voluntarily publish their source code (public notebooks). In many cases, the public notebooks were then utilized and adapted by other CKC participants. As shown in Fig 2a, the best score achieved using these public notebooks follows a time trace which is similar to the leading score, but less accurate by ~1.5. A number of participants made instructional web posts, scripts, and videos outlining specific approaches which they had taken during the CKC. For example, video presentations by Andrey Lukyanenko ( ) and the NVIDIA team ( ) discuss the approaches which they utilized to develop the 8 th and 33 rd place solutions, respectively. The CKC summary features insightful write-ups by several top teams in which they describe their various model approaches. ( ) Figure 2: (a) score evolution vs. time. Black line shows the best performing method vs. time. Blue line shows the best performing public notebook. Red lines shows the best submission by each team; (b) best fit of the time dependent leader (black) score to an biexponential curve of the form ! ∙ ! ) + - ∙ " ) + . ( A = 2.11; B = 2.97; t = 50.0 days; t = 1.29 days; C = -3.59). Blue indicates the best ME model score. To assess the extent to which the prize-winning submissions differed from one another (and other highly ranked submissions), we used the top 400 submissions to construct a meta ensemble (ME) model as a linear combination of the top scoring models: y ",&' = ) 9 ( y ",( Eq (2) )**(+,

Given that many of the top models (and all of the prize winners) were ensemble models, we have adopted the term “meta-ensemble” (ME) to emphasize the fact that Eq (2) is an ensemble of ensemble models. In Eq (2), the ME prediction y i,ME of the i’ th scalar coupling constant is a linear combination of the predictions y i,j of the j ’th ranked model. The index k specifies the lowest ranked model to be included within the optimized ME model. When k = 1, Eq (2) runs over the entire list of the top 400 models. When k > 1, Eq (2) neglects top-scoring models. Setting k = 6 for example, the Eq (2) ME model excludes all of the prize-winning models (ranks w j were determined by minimizing y i,ME using half of the test set, under the constraint that the weights were positive and summed to unity. While a range of different ME models can be constructed (e.g., different ensembles for each type of coupling, median averaging etc.), this simple mean is easy to interpret. Figure 3: (a) comparison of the top individual score (orange) to the ME model score (blue) as function of the k value in Eq (2); (b) the number of contributors to the ME model at a particular k value that had an optimized weight greater than 0.01.

Different classes of machine learning algorithms (or even the same algorithm with different hyperparameters) may be able to learn different regions of the data better than others. Thus, by combining the highest scoring model predictions that have the least correlation for a meta-ensemble, the strengths of various models may be accumulated, a result confirmed by the ME analysis shown in Fig 3a as a function of k . As expected (for k = 1..300) the optimized ME model achieves an accuracy which always surpasses that of the best individual model. In the regime where the top scorers are incrementally being eliminated from Eq (2) ( k = 1..50), Fig 3a shows that the ME model has a score that is ~0.2 lower than the “best” model. For example, the k = 7 ME model (which neglects the top 6 models) still outperforms the winning solution, and the k = 11 ME model outperforms the winning solution when the per-type ensemble mentioned above is used. Fig 3b shows how many contributors to the ME model at a particular k value had an optimized weight greater than 0.01. Broadly speaking, the Fig 3a results can be lumped into three regimes. In the first regime ( k ~ 1..40) the best performing methods dominate the ME and there is little to be gained by including within the ME methods that are very different if they perform worse. In the second regime ( k ~ 41..200), Fig 3a shows that the gap between the top score and the ME model widens to ~0.4. Here there are many similarly performing yet different methods, so there is much to be gained by combining their different approaches into a ME. In the third regime ( k ~ 201..300) the gain from a ME decreases, presumably because many of the models are similar variants of the public notebooks. The relative benefit of constructing a ME model (versus using a top-scoring model) thus appears to be more significant outside of the band of top-scoring and low-scoring models. For the k = 1 ME model, which was 7-19x more accurate than our previously published model, ( ) we analysed in further detail its constituents. The results in Table 1 show the k =

1 ME constituents with weights w i > 0.02, along with the relative rankings j of the constituent ME models. Table 1 shows that there is no particular model which is dominant: there are five models with a weighting greater than 0.11, and three with a weighting greater than 0.20. Of the six models in Table 1, one ( Table 1: Summary descriptions for the six models in the final ME. “Use of Scalar coupling components” refers to whether a team decomposed the scalar couplings into four separate components in their model. 1 st nd rd th th th Weight w j Number of submissions 73 Country

USA Spain Belarus S. Korea Serbia France USA

Team size 5 Any Chemistry expertise? Y N Y Y N Y Use of scalar coupling components? N N Y N N N Translational invariance? Y Y N Y Y Y Rotational invariance? Y N N Y Y Y Previous Kaggle experience? N Y Y N Y N Included additional input features? Y N N Y Y N Number of model parameters ~105M ~60M ~70M ~60M ~66M ~250K

To further understand the relationship between the winning submissions within the k = 1 ME model, we carried out a correlation analysis on the top 50 team submissions. The submissions were then ordered using a hierarchical clustering analysis (see SI). The results in Fig 4b show that the i.e. all relatively similar to each other. Fig 4c specifically highlights the low correlation between models ) Originally developed for learning molecular potential energy surfaces (PESs), Cormorant takes advantage of rotational symmetries in order to enforce physical relationships in the resultant neural network, by using spherical tensors to encode local geometric information around each atom’s environment, which transform in a predictable way under rotation. The use of spherical tensors allows for a network architecture that is covariant to rotations, so that if a rotation is applied to a layer, all activations at the next layer will automatically inherit that rotation. As such, a rotation to a Cormorant input will propagate through the network to ensure that the output transforms as well. This captures local geometric information while still maintaining the desired transformation properties under rotations. Team Figure 4: (a) score evolution vs. time for Table 1 teams. Black line shows the best performing method at a current time. Colored lines show the best submission by each team; (b) correlation amongst the top 50 submissions. Red indicates high correlation, and blue low. Bottom and right side shows the ranking of the submission, while top and left features a dendrogram depicting the hierarchical clustering; (c) correlation between Table 1 teams.

4. Discussion & Conclusions

Community-powered approaches offer a powerful tool for searching ML strategy space and providing accurate predictions for physical science problems like the prediction of 2-body QM NMR properties. Within 3 weeks, the best score on the Kaggle public leader board achieved an accuracy which surpassed our own previously published approaches, ( ) suggesting that an open source community-powered ‘swarm search’ of ML strategy space may in some cases be significantly faster and more cost-efficient than conventional academic research strategies where a single agent (e.g., a PhD student or post-doctoral researcher) spends several years hunting for solutions in an infinite search space. ME model construction combined with correlation analysis highlights the strength of the CKC ‘swarm search’ approach, in line with the “Rashomon effect”. Whereas our earlier approaches to predicting NMR structure coupling constants ( ) had relied on kernel-ridge regression approaches ( ) where the internal distances and angles in the molecules were systematically encoded to a feature vector for the coupling atom pairs using predetermined basis functions, the community which emerged around the CKC pioneered a new application of transformer neural nets ( ) to QM molecular property prediction. While such networks have found extensive use for sequence modelling and transduction problems such as language modelling and machine translation, they represent a relatively new approach to predicting QM properties like NMR shifts or scalar couplings, and it will be interesting to explore their further application to other QM properties and more general 2-body property prediction problems, which are relevant in several domains across the physical sciences. The rich portfolio of open source blog posts, data, insight, source code, and discussions arising from the CKC offers an excellent foundation for subsequent research and follow-up studies, through community initiatives or more conventional academic research approaches. Teams ) This contrasts with previously published Kaggle competitions in particle physics (

20, 21 ) and materials science, ( ) where the winners tended to be domain experts. Table 1 shows that teams with prior domain expertise (e.g., he relatively simple input descriptions used by many of the top teams transferred to the neural network the challenge of learning an effective input representation. Such approaches contrast with those favored by physical scientists, which utilize more complex descriptors constructed so as to include domain specific insight (e.g., rotational symmetries for team Taking advantage of the variance in approaches, the various model predictions can be combined into a ME model whose combined accuracy surpasses that of any individual model, 7-19x more accurate than what our previous methods were able to achieve. The benefit of a ME model seems to be most significant in the regime where there are many independent individual models with similar performance. Fig 2a shows that the average benefit which new models contributed to the overall improvement in prediction accuracy decreased versus time, with a rapid improvement over the first week, followed by a much more gradual improvement over the next 13 weeks. Fig 1c shows that the number of model predictions was approximately constant versus time with an increase over the final 20 days. These observations indicate an overall decrease in the relative cost/benefit ratio as a function of time. This cost/benefit decrease is qualitatively compatible with conclusions drawn from previous meta-analyses of scientific progress, ( ) which suggest that search strategies for scientific discovery tend to become less efficient with time. In our case, these results suggest that a shorter competition may have furnished similar insights. The results also highlight potential shortcomings in the elaborate scheme of awards and prizes which scientific disciplines utilize to incentivize progress and recognize ‘top-performers’ – e.g., the fact that solution Contributions

LAB and DRG devised the project concept, conceived various analysis strategies, and organized overall execution of the project work strands (LAB early phases, DRG late phases). LAB, WG, DRG, CPB, AH and WR worked together to design the form of the CKC. LAB computed and curated the CKC data, managed the CKC, responded on the forums, engaged with the winners, and organized a discussion workshop. LAB carried out data analyses with assistance of WR and WG, and guidance from DRG. AH and WR mounted the project on the Kaggle platform, and advised on how to run the CKC. The following CKC participants contributed to discussions which fed into the paper, as well as contributed to solution code and write-ups which are included in the SI: ZK, DW, JPM, MK & SB (team

Competing Interests:

This competition was made possible through financial support from Kaggle, where AH and WR are employees.

Data and materials availability:

All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.

Acknowledgments.

Part of this work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol. WG is partially supported by the EPSRC National Productivity Investment Fund (NPIF) for Doctoral Studentship funding. LAB thanks the Alan Turing Institute under the EPSRC grant EP/N510129/1. DRG is supported by the Leverhulme Trust (Philip Leverhulme Prize) and Royal Society (URF/R/180033). LAB and DRG acknowledge support of this work through the “CHAMPS” EPSRC programme grant EP/P021123/1. SC was supported by National Research Foundation of Korea (2018R1D1A1B07049981, 2019M3E5D4065968) funded by the Ministry of Science and ICT. We furthermore thank Prof. Jan H. Jensen for access to additional computing resources and Dr. Christopher Sutton for helpful advice on the design of the CKC.

References

1. F. Noé, S. Olsson, J. Köhler, H. Wu, Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning.

Science , eaaw1147 (2019). 2. M. Raissi, A. Yazdani, G. E. Karniadakis, Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations.

Science , 1026 (2020). 3. M. Jaderberg et al. , Human-level performance in 3D multiplayer games with population-based reinforcement learning.

Science , 859-865 (2019). 4. C. W. Coley et al. , A robotic platform for flow synthesis of organic compounds informed by AI planning.

Science , eaax1566 (2019). 5. N. Brown, T. Sandholm, Superhuman AI for multiplayer poker.

Science , 885-890 (2019). 6. K. Kaufmann et al. , Crystal symmetry determination in electron diffraction using machine learning.

Science , 564-568 (2020). 7. N. Dolensek, D. A. Gehrlach, A. S. Klein, N. Gogolla, Facial expressions of emotion states and their neuronal correlates in mice.

Science , 89-94 (2020). 8. K. Hao, We analyzed 16,625 papers to figure out where AI is headed next. , (2019). 9. M. Hutson, Artificial intelligence faces reproducibility crisis.

Science , 725-726 (2018). 10. A. L. Blum, R. L. Rivest, Training a 3-node neural network is NP-complete.

Neural Networks , 117-127 (1992). 11. A. S. Rich, T. M. Gureckis, Lessons for artificial intelligence from the study of natural stupidity. Nature Machine Intelligence , 174-180 (2019). 12. Y. N. Dauphin et al. , in Advances in neural information processing systems . (2014), pp. 2933-2941. 13. L. Semenova, C. Rudin, A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning. arXiv preprint arXiv:1908.01755 , (2019). 14. G. Valentini, F. Masulli, in

Italian workshop on neural nets . (Springer, 2002), pp. 3-20. 15. S. Cooper et al. , Predicting protein structures with a multiplayer online game.

Nature , 756-760 (2010). 16. F. Khatib et al. , Algorithm discovery by protein folding game players.

Proceedings of the National Academy of Sciences , (2011). 17. H. Sauermann, C. Franzoni, Crowd science user contribution patterns and their implications.

Proceedings of the National Academy of Sciences , 679 (2015). 18. F. Heigl, B. Kieslinger, K. T. Paul, J. Uhlik, D. Dörler, Opinion: Toward an international definition of citizen science.

Proceedings of the National Academy of Sciences , 8089 (2019). 19. R. Heck et al. , Remote optimization of an ultracold atoms experiment by experts and citizen scientists.

Proceedings of the National Academy of Sciences , E11231 (2018). 20. C. Adam-Bourdarios et al. , The Higgs Machine Learning Challenge.

Journal of Physics: Conference Series , 072015 (2015). 21. M. Kiehn et al. , The TrackML high-energy physics tracking challenge on Kaggle.

EPJ Web Conf. , 06037 (2019).

22. C. Sutton et al. , Nomad 2018 kaggle competition: solving materials science challenges through crowd sourcing. arXiv preprint arXiv:1812.00085 , (2018). 23. Predicting Molecular Properties. , (2019). 24. W. Gerrard et al. , IMPRESSION – prediction of NMR parameters for 3-dimensional chemical structures using machine learning with near quantum chemical accuracy.

Chemical Science , 508-515 (2020). 25. ACD/Labs, NMR Predictior Software. . 26. M. Research, NMR Predict. URL: https://mestrelab.com/software/mnova/nmr-predict . 27. A. M. Castillo, A. Bernal, R. Dieden, L. Patiny, J. Wist, “Ask Ernö”: a self-learning tool for assignment and prediction of nuclear magnetic resonance spectra.

Journal of Cheminformatics Journal of the American Chemical Society , 13313-13313 (2006). 29. F. M. Paruzzo et al. , Chemical shifts in molecular solids by machine learning.

Nature Communications , 4501 (2018). 30. K. Hansen et al. , Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space. The Journal of Physical Chemistry Letters , 2326-2331 (2015). 31. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach. Journal of Chemical Theory and Computation , 2087-2096 (2015). 32. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, paper presented at the Proceedings of the 34th International Conference on Machine Learning - Volume 70, Sydney, NSW, Australia, 2017. 33. K. Schütt et al. , in Advances in neural information processing systems . (2017), pp. 991-1001. 34. A. S. Christensen, L. A. Bratholm, F. A. Faber, O. A. v. Lilienfeld, FCHL revisited: Faster and more accurate quantum machine learning.

The Journal of Chemical Physics , 044107 (2020). 35. J. Behler, M. Parrinello, Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces.

Physical Review Letters , 146401 (2007). 36. J. Behler, Atom-centered symmetry functions for constructing high-dimensional neural network potentials. The Journal of Chemical Physics , 074106 (2011). 37. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules.

Scientific Data , 140022 (2014). 38. P. J. Stephens, F. J. Devlin, C. F. Chabalowski, M. J. Frisch, Ab Initio Calculation of Vibrational Absorption and Circular Dichroism Spectra Using Density Functional Force Fields. The Journal of Physical Chemistry , 11623-11627 (1994). 39. R. Ditchfield, W. J. Hehre, J. A. Pople, Self‐consistent molecular‐orbital methods. IX. An extended Gaussian‐type basis for molecular‐orbital studies of organic molecules. The Journal of Chemical Physics , 724-728 (1971). 40. W. J. Hehre, R. Ditchfield, J. A. Pople, Self—consistent molecular orbital methods. XII. Further extensions of Gaussian—type basis sets for use in molecular orbital studies of organic molecules. The Journal of Chemical Physics , 2257-2261 (1972). 41. R. Krishnan, J. S. Binkley, R. Seeger, J. A. Pople, Self‐consistent molecular orbital methods. XX. A basis set for correlated wave functions. The Journal of chemical physics , 650-654 (1980). 42. M. J. Frisch, J. A. Pople, J. S. Binkley, Self‐consistent molecular orbital methods 25. Supplementary functions for Gaussian basis sets. The Journal of chemical physics , 3265-3269 (1984). 43. Andrey Lukyanenko. https://youtu.be/sdIR8i0f_5A?t=1344 , (2019). 44. Accelerating Molecular Property Prediction. https://info.nvidia.com/accelerating-molecular-property-prediction-reg-page.html , (2019). 45. Predicting Molecular Properties - Competition Finalized, Congratulations & Takeaways. , (2019). 46. B. Anderson, T. S. Hy, R. Kondor, Cormorant: Covariant molecular neural networks. Advances in Neural Information Processing Systems , 14510-14519 (2019). 47. F. A. Faber, A. S. Christensen, B. Huang, O. A. v. Lilienfeld, Alchemical and structural distribution based representation for universal quantum machine learning.

The Journal of Chemical Physics , 241717 (2018). 48. A. Vaswani et al. , Attention is all you need.

Advances in neural information processing systems , 5998-6008 (2017). 49. A. Rzhetsky, J. G. Foster, I. T. Foster, J. A. Evans, Choosing experiments to accelerate collective discovery.

Proceedings of the National Academy of Sciences , 14569 (2015). upporting Information forA community-powered search of machine learningstrategy space to ﬁnd NMR property predictionmodels

S1 Overview

Section S2 and S3 contains details on how the dataset was generated, and whatconsiderations went into the choices made. Section S4 describes the data setavailable to the partitipants, as well as details on how submissions were scored.Section S5 contains various plots and analysis strategies that didn’t make itinto the main paper, as well as details on how the ensembles were ﬁtted and thecorrelations were calculated.Section S7, S8, S9, S10, S11 and S12 contain detailed write-ups explaining themodel architecture of the ﬁve winning teams as well as the 12th placed team.Finally, section S13 outlines the code and data available for download, includingall the winning solutions.

S2 Dataset considerations

We opted to use structures from the QM9 dataset[1] to construct our owndataset as well as the same computational methodology. QM9 consists of around130,000 molecules comprised of carbon, ﬂuorine, nitrogen, oxygen and hydro-gen. Each molecule has up to 9 heavy atoms (non-hydrogen) and up to 29 totalnumber of atoms. Each structure was optimized by the authors using Densityfunctional theory (DFT), with the B3LYP functional[2] and the 6-31g(2df,p)basis set[3, 4, 5, 6]. There were several advantages of basing our dataset onQM9, as well as several possible disadvantages that needed to be considered.QM9 has historically had an important role as a benchmark dataset in the de-velopment of machine learning models for 3D structure-property relationships,and we deemed it beneﬁcial to augment this with magnetic properties. This isparticularly true for pairwise properties like scalar coupling constants, as thereto our knowledge currently does not exist any dataset that includes pair-wiseproperties. Using an existing set of structures also eased the workload of gener-ating a dataset considerably. We initially had two primary concerns about using1M9 structures speciﬁcally and one primary concern about using computationalmethods in general. • The QM9 molecules are quite small compared to molecules commonlyused as e.g. medicines, and the solutions to the competition might notwork (due to memory use etc.) on larger molecules. However, since thestrength of atomic interactions decay with distance (with the exception oflarge aromatic systems), we believe that we could ultimately modify thesolutions to scale better with system size by introducing an interactiondistance cuto ↵ . At this time, we have not researched the current size limitof the winning solutions. • All of the QM9 molecules have been optimized to form a local minimumon the potential energy surface (equilibrium). In the future we might beinterested in studying non-equilibrium structures, and the solutions mightperform less well on these (such as methods ignoring 3D relationships, andthat relies entirely on connectivity). All of the top solutions incorporated3D relationships, and at this time we have no reason to believe that anyof them will perform poorly on a non-equilibrium dataset. • By relying on quantum chemistry calculations that directly relate a molec-ular 3D structure to the target property, participants could guess the spe-ciﬁc methodology and compute the target properties of the test set, thusknowing the correct answer. Due to the computational program not be-ing widely available, but more importantly due to the high computationalcost (which directly translates to monetary cost), we expected that it wasunlikely that anyone would ‘cheat’ in this way. This is particularly trueas it was a requirement of the competition that the source code to thesolution was made accessible to the organizers for a team to be eligible forany prize money.

S3 Dataset generation

Of the 130,831 molecules in QM9 that passed the geometry consistency check,we removed an additional 42 molecules as these had no hydrogen atoms. Theremaining structures were modiﬁed to a valid xyz format and input ﬁles forGaussian NMR computations were constructed. The scalar coupling constants(including contributions from separate terms), dipole moment vectors, nuclearshielding tensors, mulliken charges and potential energy were parsed from theGaussian output ﬁles and the dataset was written in an easily accessible csvformat. The molecule structures were made available in xyz formatted ﬁles aswell as a single csv ﬁle. While the target observable was the scalar coupling con-stants, the auxiliary data was included in case additional learning from thesecould be achieved. A further 14 molecules were removed due to one or moreof the scalar couplings being a big outlier compared to the range seen in theremaining molecules. For a small subset of the remaining molecules (108) Gaus-sian automatically enabled symmetry in the DFT computations. This had little2 ↵ ect on most of the observables, however the direction of the dipole vectorand the nuclear shielding tensor became arbitrary as a result of this (howeverthe norm of the vector and trace of the matrix was still correct). This wasdiscovered when the competition was ongoing, but since it only a ↵ ected a smallsubset of the auxiliary data, the competition data was not updated with datacomputed with symmetry disabled. S4 Competition details

The goal of the competition was to predict the scalar coupling constants fromthe interaction between a hydrogen and a hydrogen, carbon or nitrogen atom1,2 or 3 bonds apart. This was to be done without the competitors knowing anyinformation other than the coordinates and type of element of the atoms in themolecules. The dataset was randomly split into a 65%/35% training/test split.The scalar coupling distributions seemed to be very similar with random splitsso stratiﬁed splits did not seem necessary.The competitors were provided the following ﬁles: • structures.csv , which contains the coordinates and element type of eachof the 2,358,657 atoms of the 130,775 molecules that constitutes the com-bined test and training set. • train.csv , which contains the values in Hz of the 4,658,147 scalar couplingconstants in the training set as well as which atom pairs in which moleculethe coupling is between. • test.csv , which contains 2,505,542 pairs of atoms and their respectivemolecules, which the competitors had to predict the scalar coupling con-stants of. • potential energy.csv , which contains the potential energy in Hartree of the85,003 molecules in the training set. • magnetic shielding tensors.csv , which contains the XX, XY, XZ, YX, YY,YZ, ZX, ZY and ZZ components of the magnetic shielding tensors in ppmof the 1,533,537 atoms in the training set. • mulliken charges.csv , which contains the Mulliken charges in atomic unitsof the 1,533,537 atoms in the training set. • dipole moment.csv , which contains the dipole moments in Debye of the85,003 molecules in the training set. • scalar coupling contributions.csv , which contains the FC, SD, PSO andDSO components in Hz of the 4,658,147 scalar coupling constants in thetraining set. The sum of these equates to the values in train.csv .3he test set was further split into a 29%/71% public/private test set. All sub-missions were scored immediately on both splits. The public score was madeavailable on the public leaderboard, while the private score (which the winnerswere determined from) were hidden until the end of the competition. here wereno changes in the top 37 placements between the public and private leaderboard,indicating that there was little noise in the data.We opted to use a custom score function to evaluate the submissions. Thedataset included 8 di ↵ erent types of coupling: 1JHC (coupling between a hy-drogen and a carbon separated by 1 covalent bond), 1JHN, 2JHH, 2JHC, 2JHN,3JHH, 3JHC and 3JHN. Since the number of couplings of each type di ↵ ered dra-matically (811,999 3JHC couplings in the test set, but only 24,195 1JHN) andspanned di ↵ erent ranges, a bad choice of scoring metric could easily be domi-nated by the performance on the 3JHC coupling constants.Our loss function is the logarithm of the geometric mean of the mean absoluteerror for each type: score = 1 T T X t log n t n t X i | y i ˆ y i | ! (S1)Where: • T is the number of scalar coupling types. • n t is the number of observations of type t . • y i is the actual scalar coupling constant for the observation i . • ˆ y i is the predicted scalar coupling constant for the observation i .This way, a 10% improvement in one type of coupling will improve the score bythe same amount as a 10% improvement in another type of coupling. S5 Analysis

S5.1 Ensembling

We created an ensemble of the top 400 submissions to see which submissionscontributed the most. Of the top 400 submissions, 18 were removed due tobeing duplicates. The data points (the test set referenced above) were split intoa training and test set of equal size, stratiﬁed by coupling type using the Scikit-learn library[7]. We did a linear ﬁt to the submissions using TensorFlow[8],minimizing the loss function in equation S1 under the constraint that the weightswere positive and summed to 1. 6 submissions had weights of larger than 0 . S5.2 Correlation

We investigated how similar the submissions from the top teams were by com-puting the correlation between a scaled subset of their submissions on the testdata. 20,000 data points for each coupling type were drawn randomly from thetest set. For each coupling type the submissions were the scaled by their in-dividual root mean square error (RMSE). These two pre-processing steps weredone to make each coupling type have an equal impact on the correlation.The same analysis was done on the top 50 team submissions where the sub-missions were clustered with hierarchical clustering using the ’complete’ linkagein the Scipy module[9].

S5.3 Manifold

In a similar analysis, we projected the top 100 submissions of a scaled subset ofthe test set down on a two-dimensional manifold. Again, 20,000 data points foreach coupling type were drawn randomly from the test set. As the mean abso-lute error (MAE) is a more natural metric in relation to the scoring metric usedin the competition, the submissions were for each coupling type scaled by their5ndividual MAE. This pre-processing is important as if the submissions weren’tscaled appropriately, the dimensionality reduction might mainly capture the av-erage performance of a given submission, rather than how the methods di ↵ er ingeneral.Figure S1 shows the submissions projected onto a two-dimensional Multidimen-sional Scaling (MDS) manifold (which tries to preserve the Manhattan distancebetween submissions)[7, 10, 11, 12].Figure S1: Projections of top 100 submissions to a two-dimensional MDS mani-fold. Coloring indicate submissions score as indicated by the colorbar. A subsetof 7 submissions are marked with the competition rank. S5.4 Comparison with previous competitions

To get an idea of how engaging the competition was to the community (and ear-lier how much participation we could expect), we looked at how the number ofparticipating teams have historically depended on the prize pool. We retrievedthe number of participating teams for all previous competitions with a monetaryprize and converted the prize into US dollars. Additionally we restricted our-selves to competitions where everyone could enter, where the number of teamswere listed, that were not of recruitment or code-type, and competitions thattook place within the last ﬁve years. Figure S2 shows how this competitioncompare to previous ones.

S5.5 Participant engagement

We looked at how many days separated each participating teams’ ﬁrst and lastsubmission, to get an overview over the average engagement. Figure S3 shows6igure S2: How the prize pool are related to number of participating teamson Kaggle in the last ﬁve years (log-log plot). This competition highlighted inorange.that many teams concentrated their e ↵ orts (or did only a single submission)over 1-2 days. However, Figure S4 truncates any team that made continuoussubmissions over a period longer than 3 weeks into a single bin, showing thatthe majority of the teams were engaged throughout the competition7igure S3: Histogram showing the number of days between a teams’ ﬁrst andlast submission.Figure S4: Histogram showing the number of days between a teams’ ﬁrst andlast submission, where the rightmost bin indicate teams with more than 3 weeksbetween ﬁrst and last submissions. 8 In the manuscript we provide comparisons to our previously published method-ology (IMPRESSION [13]). We report ratio’s in a range, where the upper boundare the best models that we have trained on the training set. These could likelybe improved upon using higher memory machines.As the error in kernel methods tend to scale as a power law with number ofdata points, we extrapolated what the error theoretically could be in the limitof using the entire training data set. This estimate is used as the lower bound.Table S3 shows the score contribution of each coupling type for the best en-semble, the winning submission as well as scores for IMPRESSION best trainedmodels and interpolated errors. Similarly Table S4 shows the mean absoluteerrors.Table S3: Score contributions for each coupling type for the best ensemble, thewinner and IMPRESSION lower and upper bounds. (lower, upper)Type Ensemble* Winner IMPRESSION1JHC -0.296 -0.284 (-0.150, -0.003)1JHN -0.332 -0.288 (-0.256, -0.178)2JHH -0.515 -0.477 (-0.209, -0.119)2JHC -0.433 -0.410 (-0.238, -0.018)2JHN -0.463 -0.440 (-0.283, -0.143)3JHH -0.507 -0.477 (-0.048, 0.019)3JHC -0.421 -0.400 (-0.123, 0.049)3JHN -0.485 -0.464 (-0.216, -0.107)Sum -3.453 -3.241 (-1.522, -0.499)Table S4: Mean absolute error for each coupling type for the best ensemble, thewinner and IMPRESSION lower and upper bounds.Type Ensemble* Winner IMPRESSION1JHC 0.0937 0.1031 (0.3002, 0.9728)1JHN 0.0702 0.0999 (0.1292, 0.2404)2JHH 0.0162 0.0220 (0.1886, 0.3872)2JHC 0.0313 0.0376 (0.1488, 0.8682)2JHN 0.0246 0.0296 (0.1040, 0.3180)3JHH 0.0173 0.0220 (0.6815, 1.1675)3JHC 0.0345 0.0408 (0.3751, 1.4811)3JHN 0.0207 0.0244 (0.1780, 0.4257)Geometric mean 0.0317 0.0391 (0.2183, 0.6069)9

S7.1 Features

Of the provided data, we use only the element and position of each atom andthe scalar coupling type of each bond. From these, we use the open sourcepackage RDKit[14] to generate additional features for each atom, speciﬁcallybond order and partial charge. Below we list the total set of features of eachmolecular object that we use in the network. • Atoms: element, number of neighbors, order of neighboring bonds, partialcharge, angle with nearest neighbors. All atoms are included. • Bonds: each atom’s element, coupling type (if applicable), bond order (ifapplicable), distance. These are included regardless of whether there is achemical bond or not. • Triplets: each atom’s element, bond angle. These are included only if thereis a chemical bond (one central atom bonded to a pair of other atoms).The methods described can be generalized to include quadruplets (and dihedralangles) as well, although the addition of quadruplets was found to not be helpfulin this problem.

S7.1.1 Molecular Representation

Deep learning approaches to problems in molecular modeling generally representthe molecule as a graph, with nodes representing atoms and edges representingbonds. Many of these are then passed through networks that can be describedusing the general framework of message passing neural networks (MPNN)[15],where information is usually passed through alternating convolutions, ﬁrst fromedges to nodes and then from nodes to edges. This representation is restrictivefor several reasons. Most importantly, information may only be passed locally, asdirect connections do not exist between, for example, pairs of atoms, or an atomand a bond far away in the molecule. Further, this molecular representationonly allows us to use features that directly correspond to a particular atom orbond; for example, it is not clear how to incorporate features of atom tripletsas input in a typical MPNN. In contrast to this framework, we represent eachmolecular object of interest (in this case, each atom, bond, and triplet, but wemay incorporate other objects or properties of the molecule as we like) as a nodein a complete graph.

S7.2 Model Architecture

Our architecture is a deep learning approach with three stages: an embeddinglayer, several graph transformer layers, and element-wise group convolutions.Each of these is detailed in the subsections below.10

We use three di ↵ erent embedding layers, each corresponding to a type of molec-ular object (atom, bond, and triplet). All of these objects have a mixture ofdiscrete and continuous features, and some of these features are hierarchical innature (for example, bonds’ coupling type and sub-type, which includes somelocal bond-order information). As such, we use hierarchical embeddings[16],which embed each feature separately, and then linear combine these embeddingsto yield a single representation of the object. Discrete features are embeddedin the usual fashion, while for continuous features we use a sine ﬁlter embed-ding similar to the position embeddings employed by Transformer models[17].We embed all features described in Section S7.1, with the notable exception ofatom positions. Instead, we use these positions to construct a matrix of relativedistances to be used later in the network; this ensures that network output isinvariant to translation and rotation of the molecule.Subsequent layers in the network will convolve over all nodes in the graph; assuch, we require that the output size of each of these embeddings be the same.It is worthwhile to note that after the embedding layer, all nodes are treatedidentically by the network – that is, the graph transformer layer makes nodistinction between nodes represented by atoms, bonds, and triplets. Rather,all representation of molecular objects exist in the same space, and it is thejob of the embedding layer parameters to learn representations that can bemeaningfully distinguished by the remainder of the network. S7.2.2 Graph Transformer Layers

Throughout this section, we let Z be a matrix that represents the hidden layerof a molecule, where each column of Z represents a particular node. The layerdescribe here, which we call the graph transformer layer, forms the founda-tional building block of our architecture. This layer is an extension of a graphconvolution layer, which, in its most general form, may be represented as

GraphConvolution ( Z ) = ( W ZA ) (S2)where is an activation function, W is a matrix of learnable weights, and A isa static mixing matrix that encodes the structure of the graph; common choicesfor A include a normalized adjacency matrix or a graph Laplacian. Our ﬁrstmodiﬁcation is the replacement of the mixing matrix A with a self-attentionmechanism, which was popularized by the Transformer model[17] for sequencemodeling. In self-attention, the strength of message passing is computed via aparametrized inner product between nodes: GraphSelfAttention = ( W Z softmax ( Z T W T W Z )) (S3)Under this architecture, W3 learns the message to be passed, while the weightmatrices W1 and W2 learn the strength of each message. Replacing the mixingmatrix A in this manner removes the only structural information the network11igure S5: Graphical representation of the model architecture.receives about the graph. To reintroduce this information, we incorporate ascaling based on the distance between each pair of nodes in the graph. Let Dbe a matrix such that D i,j represents distance between nodes i and j, and let be a learnable scalar parameter. Distance-scaled self- attention is given by ScaledGraphAttention ( Z ) = ( W Z softmax ( Z T W T W Z D )) (S4)This scaling has the e ↵ ect of reducing the strength of interaction between pairsof faraway nodes. The learnable parameter allows us to learn the intensityof this scaling: as approaches 0, relative distances are ignored, and as approaches , we operate on a sparse graph, with information only ﬂowingbetween directly adjacent molecular objects. The scaled graph self-attentiontransformation is shown in Figure S5. As in the original Transformer model,we follow self-attention with a linear transformation, a residual connection, and12ayer normalization[18]. The full graph Transformer layer is given byGraphTransformerLayer( Z ) (S5)= LayerNorm( W ScaledGraphSelfAttention( Z ) + Z ) (S6)= LayerNorm( W ( W Z softmax( Z T W T W Z D )) + Z ) (S7) S7.2.3 Deﬁning Distances

The self-attention scaling described above requires a matrix D of relative dis-tances between objects represented in the graph. However, since bonds andtriplets do not have a well-deﬁned position in space, it is not necessarily trivialto deﬁne distances among these objects. To do so, we begin with distancesbetween two atoms a and a’, denoted d atom ( a, a ), which we deﬁne as simpleEuclidean distance between their positions. Distance involving larger sets ofatoms are deﬁned recursively, and with the guiding principle that the distancebetween two sets of atoms should be 0 if and only if one set is contained in theother. Distances are deﬁned as below. In the following, let a be an atom, B bea bond connecting atoms b1 and b2, and C be a triplet containing center atomc1, other atoms c2 and c3, and bonds c12 and c13 . d atom,bond( a, B ) = min( d atom( a, b , d atom( a, b d bond,bond( B, B ) = 14 X b B,b B d atom,bond( b, b ) (S9) d atom,trip( a, C ) = min c c ,c ,c ( d atom,atom( a, c )) + d atom,atom( a, c )(S10) d bond,trip( B, C ) = min( d bond,bond( B, c ) , d bond,bond( B, c )) (S11) d trip,trip( C, C ) = 14 X c c ,c ,c c ,c d bond,bond( c, c ) (S12) S7.2.4 Group Convolutions

After being passed through some number of graph transformer layers, we takethe nodes representing coupling bonds and pass them through a series of 1 ⇥ ↵ erent for each group. A ﬁnal grouped convolution outputs thepredicted scalar coupling constant. S7.3 Training and Ensembling

In total, 13 of the models described in the previous section were trained on thetask of predicting magnetic scalar coupling constants. Model size varied, but13enerally had between 12 and 18 graph transformer layers, and hidden dimen-sion between 600 and 800. These models were initially trained using absoluteloss instead of the loss deﬁned in equation S1. The model was implementedin PyTorch[19] and network training was done using the ADAM optimizer[20]with a cosine annealing learning rate scheduler[21] for 200 epochs. For someof our best models, we additionally ﬁne-tuned by training for approximately40 epochs using the loss deﬁned in equation S1 to improve their single modelscores by around -0.020 to -0.030. For each coupling subtype (i.e. couplingtype, plus further breakdowns by element and chemical environment informa-tion), the targets were scaled to 0 mean and one standard deviation to simplifythe learning process. These 13 models were ensembled with a mix of mean andmedian ensembling. First, for each coupling type, we observed how frequentlyeach model’s prediction was the median among all 13 predictions. The 9 bestmodels according to this metric were selected, with the remaining 4 discarded.Finally, for each coupling constant, we selected the center 5 median values fromamong the 9 model predictions; the mean of these 5 values was used as theﬁnal prediction. We found that this ensembling strategy was more e ↵ ectivethan strict mean or median ensembling, and did not require a validation set toestimate individual models’ test accuracy. S8 Solution 2 - Quantum Uncertainty

S8.1 Data Augmentation

The key to the success of this model lies in the data augmentation, and comesfrom the insight that pair-wise properties can be modelled as atom-wise prop-erties, given that relative position to the paired atom is encoded in the input.This is achieved by making duplicate ’sibling’ molecules that are translated tohave a di ↵ erent atom as origin.For a given molecule of N atoms, where M of these are of atom type H, C orN, M siblings are constructed, each centered on a di ↵ erent H, C or N atom.Furthermore the sibling molecules are padded with dummy atoms to make allmolecules have the same number of atoms (29 for this competition), which arelater masked in the model. Each atom in the sibling molecules are then matchedwith the corresponding coupling constant label, where dummy values are usedfor coupling types that are not present in the dataset, and these are similarlymasked later in the model. S8.2 Input Features

A feature vector is constructed for each atom in the sibling molecules, thatcontain the Cartesian coordinate of the atom, the atom type index (C=0, H=1,...) as well as an index for the type of coupling formed with the origin atom(1JCH=0, 2JHH=1, ...). In the feature vector 3 elements will be the X, Y and Zcomponent of the Cartesian coordinate, while the remaining elements are used14igure S6: Overview of the overall architecturefor embedding the atom type and coupling type. For a feature vector of e.g.length 1024, 510 elements will be repetitions of the atom type index, 511 willbe repetitions of the coupling type index, while the remaining 3 elements willcontain the Cartesian coordinates. Note that there is no graph information norany other manually engineered features.

S8.3 Model Architecture

The model used a standard transformer architecture [22] utilizing the fast.ailibrary [23], where each feature vector are updated with several transformerlayers followed by a feed forward neural network to decode the scalar couplingconstants from the transformed feature vectors (See Fig S6). Adding rota-tions during training didn’t improve the model performance, indicating thatthe molecules were either systematically aligned in a way that enabled e cienttraining, or that rotation invariance were easily learned by the model.15igure S7: Overview of the ensembling strategy S8.4 Training and Ensembling

14 models were trained, with varying dimensions from 512 to 2048 and layersfrom 6 to 24, and with scores of -2.9 to -3.1 Each model parameter size rangedfrom 12M to 100M (biggest model). A small validation set were kept separate toﬁt the ensemble model. Four di ↵ erent regressors were trained on the validationset predictions of the 14 models: k-nearest neighbors, linear, RANSAC [24]and Theil-Sen [25] regression models. A voting regressor were using to combinepredictions of the four models, which resulted in a score of -3.21. Finally thesingle model with the best score (6 layers, 1024 dimensional feature vector) weretrained on all the training data (including validation set), yielding a score of-3.16. Averaging this best single model with the voting regressor model yieldeda ﬁnal score of -3.23 on the public testing set. An overview of the ensemblingstrategy is shown in Fig S7. Predictions from the single best model takes ⇠ ⇠

15 ms per molecule).

S9 Solution 3

S9.1 Features

Due to the graphical nature of molecular systems, graph-based architecturesare often applied to many chemical problems including this competition [26].However, the nature of problem in this competition is slightly di ↵ erent to typicalchemical problems, since atom-pair properties of three-dimensional chemicalstructures has not been modelled before and only a subset of atomic pairs inthe molecules have target values (coupling constants). For example, the largestmolecule in the training set (nonane) has 29 atoms and there are 406 atomicpairs. Among those atomic pairs, only 127 atomic pairs have target values.For this reason, an alternative representation for molecular systems than thegraph representation can be utilized. Here, we use a set of only a part of pairfeatures as a descriptor for a molecule. Even though we replace the graphical16epresentation, the size-extensiveness and permutation-invariant nature shouldbe preserved. A similarity-based attention (or simply attention) is one of themechanisms that satisﬁes both conditions:Attention( Q, K, V ) = softmax( QK T p d k ) V (S13)where Q,K,V and d k are query, key, value, and feature dimension of the key valuerespectively. An output of attention is a weighted average of input values wherethe weights are determined by similarity of queries and keys. Attention layershold permutational invariance and size-extensivity because if the size or orderof the input vector is changed, those of corresponding key and query vectorsare changed. In many di ↵ erent contexts of machine learning, a large numberof variants of attention have been proposed. A multi-head attention is one ofthe most frequently employed ones. It generalizes conventional attentions withconcatenating results of attention whose inputs are linearly transformed withdi ↵ erent weights.Multihead( Q, K, V ) = Concat( O , O , ..., O h ) W (S14) O i = Attention( QW Qi , KW Ki , V W Vi ) (S15)where W Qi , W Ki , W Vi and W O are weights for query, key, value, and outputsof attentions respectively. One of the beneﬁts of using (multi-head) attention isto minimize sequential computing which is an obstacle for parallelization. S9.2 Model Architecture

Transformers, one of the well-known natural language processing architectures,recorded overwhelming performance in the Seq2Seq[27] problem with replac-ing sequential computing to multi-head attention. The original transformermodel[22] includes positional embedding to impose sequential information oninput and output feature vectors before performing the encoding and decod-ing process, since encoder and decoder are composed of permutation-invariantlayers. Also, like ordinary Attention and Linear blocks, they are size extensivewhich means di ↵ erent lengths of sequence can be treated. Hence, we employencoder of Transformer to embedding pair sequence information to constructthe latent space that contains interaction among atomic pairs. In terms ofNLP, atomic pair information and pair sequences become words and sentencesrespectively. We refer to the original transformer paper for more details on theattention and transformer architecture [22]. Once again important features ofthe model is size-extensivity and permutation-invariance.Our model (shown in Figure S9) adopts the encoder of Transformer archi-tecture because of its size extensivity and permutation invariance. We make anatomic pair sequence to represent a molecule and calculate a target value byconsidering interactions among atomic pairs. The Transformer encoder (greypart in Figure S9) is employed to transfer input feature to latent space which is17ncoding Number of encoder layers 8Number of heads for attention 8Dropout Ratio 0.1Intermediate size 2048Readout Dimension 832Dropout Ratio 0.1Embedding Embedding Dimension for AtomicCharge 32Embedding Dimension for Position 256Embedding Dimension for AtomicNumber 64Embedding Dimension for Distance 64Embedding Dimension for Type 64Data Augmentation Mean of translational noise (Angstrom) 0Standard deviation of translationalnoise (Angstrom) 2Mean of angle in rotational noise (rad) 0Standard deviation of angle in rota-tional noise (rad) 1.57Dummy Types 1JHO, 1JCO, 1JCN, 1JNO, 1JCC,1JNN, 1JFCTraining and Ensemble Learning rate 0.0003Linear warmup step 30Weight Decay 0.01Size of seed ensemble 10Mean of initial weights distribution 0Standard deviation of initial weightsdistribution 0.02Figure S8: Detailed solution informationread by Readout (purple part in Figure S9, more detailed information in Fig-ure S10). For each type, weights of Readout are di ↵ erent but structures anddimensions of weights are the same. (see Table S8) S9.3 Readout stage

Readout stage evaluates the spin coupling constant (SC in Figure S9) from theoutput sequence, output of the encoding layer. Here, we employ physical con-dition; the spin coupling constant is the sum of 4 components, Fermi Contactcontribution (FC), Spin-dipolar contribution (SD), Paramagnetic spin-orbit con-tribution (PSO) and Diamagnetic spin-orbit contribution (DSO). The Readoutstage is designed to predict a target value as mean of two di ↵ erently evaluatedvalues. One directly comes from the latent space and the other one is a sum of4 components which comes from latent space. The loss function for the whole18igure S9: Pictoral representation of the overall architecture19igure S10: Pictoral representation of the overall modelmodel is constructed as the log of the sum of MAEs of outputs of two Readoutlayers. S9.4 Features

An atomic pair feature is composed of two atoms’ properties and pair prop-erties. (See Figure S11 ) For atomic properties, the position of atom, atomicnumber and atomic charge, available in QM9 dataset are used and distance be-tween them and type of coupling are for the pair properties. Those values areembedded and concatenated to build feature vector. Unlike the sum operation,concatenation is not a commutative operation so our feature vectors are depen-dent on the order of atoms in a pair. To make the feature vector independentof the order of atoms, test time augmentation or data augmentation can be ap-plied, but in the data set, the order of atoms in a pair are sorted. Therefore, weuse none of them. For clarity, it should be noticed that this order-dependencydoes not necessarily lead to permutation dependency of atomic pairs belongingto molecules.

S9.4.1 Data Augmentation

As described in the previous section, positions of atoms are used in input se-quences, therefore rotational and translational invariances are not conserved inour model. Although both invariances are not mathematically preserved, bytraining augmented data, we make parameters of our model guided to havepseudo-invariance in a certain range of translational and rotational changes.20igure S11: Pictoral representation of the input featuresFortunately, positions of atoms belonging to the QM9 set do not have extremevalues, which means this kind of less rigorous strategy can cover all the testcases in the competition. If we need to cover general molecular geometry, thiskind of strategy might not work. Generation of noise can be done in everyepoch with low computational cost, the size of the original training databasebeing enlarged 10 times with randomly selected noise. The selected noise isnot changed during the training. The translational perturbations on each axisand angle of rotational perturbations are sampled from a Normal distribution.By this augmentation of the training data, we observed that the trained modelrecorded small enough disagreements with both changes and a better score inleaderboard. Mean and standard deviation values for both translational androtational perturbation can be found in Table S8.

S9.4.2 Dummy types

Some of the geometric information of molecule is missing in the input sequencesbecause not all atomic pairs are included. It may cause an incomplete descrip-tion of chemical circumstances. In order to overcome this limit, dummy pairsare introduced. The dummy pairs are a part of the input sequences but donot have coupling constant values. The predicted values of dummy types arenot used to evaluate the loss but the feature vectors of dummy types partic-ipate in process of building latent space even for meaningful pairs. By this,the geometric information of non-activated pairs can be included in the inputsequence. Thousands of pair types exists in the given training set. Amongthem, 7 additional atomic types are included in input sequences: 1JHO, 1JCO,1JCN, 1JNO, 1JCC, 1JNN, and 1JFC. The selection of these dummy pairs is atunable hyperparameter but due to a lack of computing power, we did not testvarious combinations of dummy types. We choose to add neighboring types ﬁrstbecause nearby pairs may have a strong impact on chemical circumstances. Ifmore dummy types are added, more geometric information can be included ininput sequences but an increase of computational costs follows as the length ofsequence increases. 21

In order to achieve a high score, we employed an ensemble technique and pseudo-labeling. An ensemble technique is to merge the results of various models andpseudo-labeling is to use new dataset which contains unlabeled data whose labelis assumed as the results of previously trained model. We originally plannedto use a seed ensemble which uses a number of identical models with di ↵ erentrandom seed numbers but because of inequalities of teammates’ computing re-sources. Therefore, models with di ↵ erent N values (in Figure S9) and epochs areused. An overall training process is illustrated in Figure S12. At ﬁrst stage, 15models are trained using only training dataset with dropout. At second stage,parameters of models were ﬁne-tuned by turning o ↵ the dropouts. By usingmodels trained up to the second phase, labels for test sets (so-called pseudo-label) are predicted. To minimize error of pseudo-label, results of 4-8 trainedmodels up to the second phase are employed. In the third phase, newly con-structed training sets which including both original training dataset and pseudo-labeled test set are used. Because of due date of competition, we only take 8models and further trained with the expanded datasets. At phase 3 and 4, 20epochs are progressed. To mitigate the unpredicted bias from pseudo-labeling,at the last phase, the models are trained with only the original training dataset.Adam optimizer, typical choice, is employed with gradient clipping. Linearwarmup and linear decay are used for the stable convergence of training. Allweights are initialized with uniform distribution (mean=0.0, std.=0.02) and 0is used for initial bias values. S9.5.1 Speciﬁcation of models

Training in each ensemble is composed of 4 di ↵ erent steps. First step, trainingis performed with training set data and dropout. In the second step, furthertraining is proceeded without dropout. Using the obtained model from thesecond step, the coupling constants for testing set are predicted; these predictedvalues are called pseudo label. The pseudo-labeled data in addition to trainingdata is used in the third training step. To mitigate overﬁtting, further trainingis performed with only training set.Output values of each type are normalized because each type shows di ↵ erentdistribution of target values. S9.6 Performance

Our ﬁnal model has around 75M parameters. With two V100 graphic cards,training the model takes around 2 days.22igure S12: Pictoral representation of the training procedure23

10 Solution 4

S10.1 Model Architecture

The model we used is a development of the Graph Attention Network [28] archi-tecture with Gated Residual Connections [29]. The model operates on moleculargraphs (nodes being the atoms, and edges being the bonds), augmented by ar-tiﬁcial links between the nodes representing atoms 2 and 3 bonds apart in themolecular structure. Both graph nodes and graph edges hold feature vectors(in contrast to the more standard approach where only the nodes hold featurevectors). Size of the node and edge feature vectors are equal in each layer. Thenetwork is implemented in the PyTorch [30] framework, using the Deep GraphLibrary [31], which requires all graph edges to be directed. We therefore repre-sent each bond (as well as each 1-3 and 1-4 link) as a pair of graph edges (onein each direction).The network consists of multiple layers, where each layer updates both thenode and the edge feature vectors through the following steps:1. Fully connected layers applied to nodes and edges: ⌘ l = W lnode n l , ✏ l = W ledge e l where n l and e l represent the values of the node and edge embeddingsfrom the previous layer, while ⌘ and ✏ represent intermediate values usedwithin the rest of the layer.2. Attention based convolutional update applied to each node: ↵ lij = sof tmax j ( ( A l [ ⌘ li k ✏ lij k ⌘ lj ])) n li = ( X j ( ↵ lij ⌘ li ✏ lij ))where is a non-linearity (in the ﬁnal model we used the Leaky RectiﬁedLinear Unit with the slope set to 0.2). Furthermore, we employ the multi-head extension to the attention mechanism (as described in [28]), wheremultiple independent attention mechanisms are applied, and their resultsconcatenated: n li = K n k =1 ( X j ( ↵ lijk ⌘ lik ) ✏ lijk )3. Fully connected update to each edge, based on the concatenation of thesource node feature vector, edge feature vector, and the destination nodefeature vector: e lij = W ltriplet [ n li k ✏ lij k n lj ]))4. At the end of each layer we apply layer normalization [32] to node andedge feature vectors. 24ach layer is followed by a Gated (Parametric) Residual connection [29], onboth the node and the edge feature vectors. The outputs of the residual valuesare passed through Parametric Rectiﬁed Linear Units [33].The ﬁnal model is made up of an embedding layer, eight residual layers, andthe two fully connected output layers applied to edge feature vectors of the ﬁnallayer. Embedding size for both the nodes and the edges in each residual layer is1152, and the number of heads is 24 (each head accounting for 48 parameters).The output sizes of the two ﬁnal fully connected output layers were 512 and 8(one output for each of the coupling types).Coupling constant between two atoms are predicted by taking the corre-sponding output from the edges linking the graph nodes representing the twoatoms. For each pair of atoms the graph will incorporate two such edges (onein each direction, due to the directed nature of the graph). During training, wetreat these as two independent predictions (calculating and back-propagatingthe loss as if these were two data points). In prediction mode we output theaverage of these two predictions. S10.2 Features

In addition to the molecular graph structure (atoms and bonds as inferred byOpenBabel [34]) the model uses a number of atom and bond features. Foratoms we use: electronegativity, ﬁrst ionization energy, and electron a nity foreach atom type, as well as atom Mulliken charge taken from the QM9 data-set [35, 36]. For edges we use features: bond length, bond angle (for artiﬁcial1-3 edges), and the dihedral angle (for artiﬁcial 1-4 edges). All features werestandardized based on the training set statistics. Labels were standardised forall models, except those used for predicting 1JHC, where we only subtract themean (which we empirically determined to produce better results). S10.3 Training and ensembling

We trained the model using a modiﬁed version of the LAMB optimizer [37],where we decouple the weight decay term from the trust region calculations,similarly to the AdamW modiﬁcation of the Adam optimizer [38]. The trainingwas done using a mini-batch size of 80 molecules. Models were ﬁrst pre-trainedto jointly predict the scalar coupling constants for all coupling types. In theﬁne-tuning step we train separate models for JHC and JHH types, and continuetraining the joint model for the JHN types. The training was done using theMean Absolute Error loss.During training we used the following dynamic learning rate schedule: •

30 epoch cycle, linearly varying the learning rate from 0.001 to 0.01 andback (pre-train phase) •

70 epochs with a constant learning rate of 0.001 (pre-train phase) • Constant learning rate of 0.001 until convergence; 90-100 cycles, dependingon coupling type (ﬁne tuning phase)25he model was regularized using weight decay set to 0.05 for the ﬁrst 30epochs and 0.01 afterwards. Final model parameters are derived by runningStochastic Weight Averaging [39] over the models from the ﬁnal 25 epochs oftraining. Final predictions are produced as the mean of the outputs from twofolds (two sets of models, both using the same architecture, but with a di ↵ erenttrain/validation split, with 90% of the data used in training, and 10% used forvalidation). Each of the two model sets consists of 6 models: • a model per coupling type, for each of 1JHC, 2JHC, 3JHC, 2JHH, 3JHHcoupling types • S10.4 Performance

The training procedure for the ﬁnal ensemble took 200 GPU hours on systemswith 2080Ti cards. The ensemble achieved a score of -3.18667 on the publictest set, and -3.18085 on the private test set provided by the competition. Inorder to illuminate the architecture’s performance, in addition to the ﬁnal model(denoted FULL) we also benchmarked a single model trained to predict all ofthe coupling type interaction jointly (denoted SINGLE), and an ensemble wheresingle model is specialized for each coupling type (denoted TYPE). The resultsare presented in Table S5.Table S5: Scores achieved by di ↵ erent ensembling choices.Model Private score Public score Train time (GPU h)FULL -3.18085 -3.18667 200PER-TYPE -3.13853 -3.14362 85SINGLE -2.96183 -2.96443 24 S11 Solution 5

Our model is an ensemble model built from 16 graph-based deep learning mod-els. Our base models are inspired from MatErials Graph Network (MEGNet)[40]which is an architecture used to predict properties of either molecules or crystals.In the following we will describe our best single model.

S11.1 Input features

We created features from the raw atom elements and coordinates using OpenBabel[34].OpenBabel provides numerous chemical features for each atoms, bonds, andfor the whole molecule, which are all included in our input features. We alsoadded translation and rotation invariant features such as ACSFs[41], distancesof bonds, angle between bonds and the raw coordinates. Random rotations on26he molecule coordinates were applied as a way of data augmentation. Themost important features sorted by a permutation feature importance are thefollowing: • Atom : size of smallest ring that contains the atom, heteroatom valence,number of ring connection to the atom, average angle of bonds connectedto the atom, whether a neighboring atom has an unsaturated bond to athird atom, count of hydrogen bound to the atom. • Bond : angle formed by the neighboring bond with the closest atom, min-imum distance of neighboring bond, bond distance, scalar coupling type. • Molecule : number of atoms, number of non-hydrogen atoms.We also noticed that the most important features of our preliminary models,according to a permutation feature importance, were the ring topology in themolecule and the angles between two bonds. This inspired us to modify MEG-Net by adding two operations related to rings and bond-bond angles. We willdescribe these modiﬁcations below.

S11.2 Model architecture

Our models consists of three stages :1. a preprocessing operation that transforms molecular, bonds and atomicfeatures into a graph representation.2. multiple steps of a graph update operation.3. a readout operation that transform the graph representation to a set ofscalar coupling values.The architecture is described in Figure S13.

S11.2.1 Preprocessing operation

The preprocessing operation builds a graph representation of the molecule. Eachatom in the molecule represents a node in the graph. We choose to represent themolecule with a dense graph rather that limiting the edges to chemical bonds.This choice helped the information ﬂows better in our network. Each node, edgeand the whole graph are associated to a state vector. To build each state vector,we apply a multi-layer perceptron to the features of each considered object.

S11.2.2 Graph update operation

The graph update operation takes a graph as input and outputs the same graphstructure with updated state vector values. It consists of three steps :1. update the edge state vectors 27igure S13: Best single model architecture2. update the node state vectors3. update the global state vectorWe denote V = { v i } i =1: N v the set of node vectors, with N V the node count, E = { e k } k =1: N E the set of edge vectors, with N E the edge count, and u theglobal graph vector. G = ( V, E, u ) is the input graph representation and G =( V , E , u ) is the output graph representation of the graph update operation.In the following, we denote all functions as multi-layer perceptron withtwo hidden layers, SoftPlus activation and LayerNormalization that takes avector as input and outputs another vector. If two functions share the samesubscript it means that the multi-layer perceptrons share their parameters. Ifthe function has no subscript it means that its parameters are not shared withany other multi-layer perceptron.All functions are vector aggregation functions that outputs one vector froma set of vectors. In our architecture, all aggregations are attention functionsthat takes two parameters : • a set of vectors to aggregate, which is used as the keys and values in theattention mechanism • a vector which is used as query in the attention mechanismThe L operator is the vector concatenation operator.28 ing 1 Ring 1 edge vectorsRing 2 edge vectors

Ring 2

Figure S14: Computation of r with two rings Edge vector update

The edge vector update consists of four operations.In the following we denote e the edge vector to be updated and b the bondassociated with it.Firstly we introduce an intermediary vector r which is an aggregation ofthe edge vectors contained in all rings b is a part of, where a ring denote asimple cycle of atoms and chemical bonds in a molecule. The purpose of r is to allow a better ﬂow of the rings related information. We proposed thisintermediate vector because we observed that rings features had an high impacton preliminary scalar coupling models.We denote N R the number of rings containing b in the molecule. For the i th ring containing b , R i = { e ( e k ) } k =1: N Ri is the processed edge vectors of thering. The vector r is computed as follows : r i = ⇣ R i , e ( e ) ⌘ r = ⇣ { r i } i =1: N R , e ( e ) ⌘ r i is an aggregated edge vector along the i th ring containing b and r is anaggregated edge vector along in all the rings containing b . An example of suchcomputation is displayed in Figure S14.Secondly we introduce another intermediary vector a which is an aggregationof neighboring edge vectors and local geometric features. The purpose of a isﬁrstly to allow to integrate angular edge-edge features into our architecture andsecondly to provide a better ﬂow of the edge-edge information. We proposedthis intermediate vector because we observed that the engineered edge feature”angles between one edge and the edges formed with the closest atom” featurehad an high impact on our preliminary scalar coupling models.We denote A = { ( e i , e i ) } i =1: N V is all couples of edges formed between anatom in the molecule and the two atoms in e . f i is the feature vector character-ising the triangle of edges ( e , e i , e i ) which is 5-dimensional and contains the 3angles in the triangle as well as the length of edges e i and e i . The vector a isdeﬁned as follows : a = ⇣n ( f i ) M e ( e ) M e ( e i ) M e ( e i ) o i =1: N V , e ( e ) ⌘ eighboringatom 1Neighboringatom 2 Figure S15: Computation of a for the edge e in a four node system.An example of such computation is displayed in Figure S15.Finally we compute the updated edge vector e . We denote the v and v the ﬁrst and second atom vector of e . e = ⇣ r M a M e ( e ) M u ( u ) ⌘ + atom ( v ) + atom ( v ) + e (S16) e contains a skip connection to enable deeper and faster model training. Node update

We denote v the node vector to be updated, v i the i th nodevector di ↵ erent from v and e i the updated edge vector between the nodes as-sociated with v and v i . Let C = { v i L e i } i =1: N V be the set of concatenatednode and updated edge vectors. The vector v is deﬁned as follows : v agg = ⇣ C, v ( v ) ⌘ v = s ⇣ v agg M v ( v ) M u ( u ) ⌘ + v As for the edge update, we integrate a skip connection.

Global state update

We denote V = { v k } k =1: N V the set of updated nodevectors and E = { e k } k =1: N E the set of updated edges vectors. The updatedglobal state vector is deﬁned as follows : u agg edge = ⇣ E , u ( u ) ⌘ u agg node = ⇣ V , u ( u ) ⌘ u = ⇣ u agg edge M u agg node M u ( u ) ⌘ + u As for the edge and node update, we integrate a skip connection.30

To predict a scalar coupling associated with an edge, we concatenated the edgevector, the two node vectors associated with the edge, and the global state vectorand passed it through multi-layer perceptron to generate the ﬁnal output. Sincethere was 8 di ↵ erent types of scalar coupling, our model had 8 di ↵ erent multi-layer perceptrons to enforce a di ↵ erent readout for each type. S11.3 Training and ensembling

We trained our model with the Adam[20] optimizer with a ﬁxed learning rateand the original article default parameters, for about 150 epochs then reducedthe learning rate by a factor 2 each 3 epochs for 15 epochs.The dimension of our graph vectors was set at 300. Our batch size was20, a relatively small number due to the high memory requirements of thecomputation of the neighboring edge vectors aggregation.

S11.3.1 Experiments and model variations

We observed that even models with a modest score could help to contributeto a better overall performance with an ensemble model. As such we triedvarious architectures with di ↵ erent modiﬁcations. The following experimentswere tested and integrated as base models: • Di ↵ erent activation functions in our multi-layer perceptron: Softplus pro-vided a boost of performance in comparison to a ReLU baseline. • Normalization: LayerNormalization worked better that BatchNormaliza-tion in our experiments and helped improve convergence. • In the readout stage, rather that using only the edge and two nodes as-sociated with a scalar coupling, we concatenated all the edges and nodesvectors in the chemical bond path of the two atoms we compute the scalarcoupling. We observed faster training with this method but could notintegrate it in our best single model in time. • We iterated with di ↵ erent number of graph update operations or di ↵ erentnumber of hidden layers. S11.3.2 Ensemble learning

Ensemble learning is a technique that aggregates multiple models into a singleone to obtain a better performance. It has become a standard in machinelearning competitions. To build our ﬁnal prediction, we ﬁtted a linear model anda gradient boosting model that we averaged. Rather than using only our 16 basemodels in this ensemble, we also integrating intermediary models checkpointsof those 16 models to further improve the performance. In the end there wasabout 50 input models in the ensemble.31

We can think of few improvements for our best single model architecture: • Prune the amount of edges considered in the neighboring edge vectorsaggregation by keeping only the edges closest to the updated edge. Thiswould greatly reduce the memory consumption of the architecture, allowa bigger batch size and fasten the model. We suppose that this would notdegrade the performance as the angular features that had high permu-tation importance in our preliminary models were related to the closestedges. • Integrate the full chemical bond path in the readout stage mentioned inS11.3.1 to increase the training speed as it proved e cient on other similararchitectures. S12 Solution 12

S12.1 Introduction

There has been extensive work recently using graph neural networks for pre-dicting properties of molecular systems. Many problems along this directionrequire the use of conﬁguration information (i.e., the positions of all atomsin a molecule.) A key challenge in applying machine learning techniques tothese problems is that of capturing local geometric information (such as bondor torsion angles), while preserving symmetries of the overall system. Symme-tries such as rotation and translation invariance are fundamental properties ofmolecular systems, and therefore must be exactly preserved in order for a ma-chine learning architecture to e ↵ ectively use training data and make meaningfulpredictions.Our recent paper proposed the COvaRiant MOleculaR Artiﬁcial Neural neT-work (Cormorant) to help solve these issues [42]. Cormorant uses sphericaltensors as activation to encode local geometric information around each atom’senvironment. A spherical tensor is an object F ` which transforms in a pre-dictable way under a rotation. Speciﬁcally, for a rotation R SO (3), thereis a matrix D ` ( R ), such that F ` ! D ` ( R ) F ` . Here, D ` ( R ) is known as aWigner-D matrix, and ` is known as the order of the representation.A tensor with ` = 0 is a scalar quantity (that is, invariant under rotations),whereas ` = 1 is a vector quantity such as a dipole, and ` = 2 is a second-ordertensor like a quadrupole, and so on. This structure is well known in physics.For example, the multipole expansion is a decomposition of an electrostaticpotential V ( r ) = P ` =0 Q ` Y ` (ˆ r ) /r ` +1 into multipole moments Q ` and sphericalharmonics Y ` (ˆ r ) oriented in the direction of the vector r . For a more in-depthdiscussion of spherical tensors, we point the reader to Ref. [42].The use of spherical tensors allows for a network architecture that is “co-variant to rotations.” This means that if level s , a rotation is applied to allactivations F s, ` ! D ` ( R ) F s, ` , then at the next level s + 1, all activations will32utomatically inherit that rotation so F s +1 , ` ! D ` ( R ) F s +1 , ` . As such, a ro-tation to the input of Cormorant will propagate through the network to ensurethe output transforms as well. In this way, we can capture local geometric in-formation, but can still maintain the desired transformation properties underrotations.A key point is that the quantum algebra (SU(2)) used in the deﬁnition ofNMR couplings is for our purposes equivalent (homomorphic) to the SO(3)algebra used in

Cormorant . The J -Coupling Hamiltonian H = 2 ⇡ I i · J ij · I j describes the couplings J ij between spin operators I i . The spin operators arethemselves objects that generate the Lie algebra SU (2), and thus transformaccording to Wigner-D matrices. Cormorant therefore naturally incorporatesthe structure of J -Coupling Hamiltonian, and we expect is a natural platformlearning NMR couplings. S12.2 Model architecture

We now brieﬂy summarize our

Cormorant architecture; for a more in-depthdescription of our architecture, see the original paper [42].The structure of of

Cormorant is similar to a graph neural network, withthe key di ↵ erence that vertex activations F i and edge activations G ij for atoms i, j are promoted to lists of spherical tensors F i = ⇣ F i , . . . , F ` max i ⌘ and G ij = ⇣ G ij , . . . , G ` max ij ⌘ , where F ` i C (2 ` +1) ⇥ n ` and G ` ij C (2 ` +1) ⇥ n ` are sphericaltensors of order ` , and n ` is the multiplicity (number of channels) of the tensor.In order to maintain covariance, we must carefully choose our non-linearity.The core non-linearity in Cormorant is dictated by the structure of the D ` ( R )matrices, and is known as the Clebsch-Gordan product [CG]. We express theCG product of ⌦ cg of two spherical tensors as:[ A ` ⌦ cg B ` ] ` = ` + ` M ` = | ` ` | C ` ` ` ( A ` ⌦ B ` )where ⌦ denotes a Kronecker product, and C ` ` ` are the famous Clebsch-Gordan coe cients [CG].Our vertex activations F si at level s are chosen to be F si = h F s i F s i ⌦ cg F s i ⇣ X j G sij ⌦ cg F s j ⌘i · W vertex s, ` where denotes concatenation and W vertex s, ` is a linear mixing layer that actson the multiplicity index. We choose the edge activations to have the form of More precisely, SU(2) is a double cover of SO(3), and both groups share a Lie algebra. s, ` ij = g s, ` ij ⇥ Y ` (ˆ r ij ), where r ij is a vector pointing from atom i to atom j . Thescalar-valued edge terms are given by g s, ` ij = µ s ( r ij ) h⇣ g s , ` ij F s i · F s j ⌘ s, ` ( r ij ) ⌘ · W edge s, ` i with µ s ( r ij ) a learnable mask function, ⌘ s, ` ( r ij ) a learnable set of radial basisfunctions, and W edge s, ` a linear layer along the multiplicity index.We iterate this architecture for s = 0 , . . . , s max . Finally, at the last layer of Cormorant , we take the ` = 0 component of the edge g s max ` ij , and use it as aprediction of the J -coupling. S12.3 Training and ensembling

To predict J-coupling constants, we ﬁrst observed that each J-coupling was arotationally invariant feature on pairs of atoms. Consequently, it was natural toread the values of the J-couplings o ↵ the values of g s, ` ij . We therefore constructeda neural net with ﬁve cormorant layers, with a maximum ` value of 3. In eachlayer, we used 48 vertex activations for each value of ` . We then took a linearcombination of the values of g s, ` ij for each layer and took a learned linear com-bination of the values as our prediction for the scalar coupling constant. Initialfeatures were constructed using the radial distance between atoms, the atomicidentity, and the atomic charge. In general, almost exactly the same networkwas used as for the QM9 tests in the Cormorant paper [42]. We also used theMulliken charges as initial features for pretraining. Our training proceeded inthree stages.1. Three nets were trained on the 1J, 2J, and 3J couplings with the MullikenCharges on an internal split on the training data.2. Eight nets were trained on the 1JHC, 1JHN, 2JHH, 2JHC, etc. couplingswith the Mulliken Charges on an internal split on the training data, withweights initialized to be the values in stage 1.3. Each of the nets in stage 2 was trained on the full dataset, without accessto the Mulliken charges.The internal split was constructed using an 80/10/10 train/validation/test spliton the training dataset provided in the Kaggle competition. The model wasoptimized using the AMSGrad optimizer for 200 epochs. The learning rate wasvaried using cosine annealing. For stage one the initial and ﬁnal learning was5 ⇤ and 5 ⇤ , and for stages two and three the initial and ﬁnal learningrates were 3 ⇤ and 3 ⇤ , respectively. We repeated stages two andthree with di ↵ erent random seeds in attempt to average over multiple trainednetworks, however this had negligible e ↵ ect on the results.In table S6, we give the ﬁnal results on the Kaggle Test/Train split. Weachieve considerably stronger results for the 2J and 3J couplings than for the1J couplings. However, we were pleasantly surprised by the minimal ﬁne-tuning34able S6: Performance Dataset 1JHC 1JHN 2JHH 2JHC 2JHN 3JHH 3JHC 3JHN

Training -2.736 -4.834 -4.366 -3.481 -5.4889 -4.632 -3.275 -5.557Test -1.768 -2.224 -3.584 -2.856 -3.312 -3.552 -2.688 -3.496required to get results close to state of the art. Future directions include directlyconnecting the tensors learned by cormorant with the tensor-valued data inNMR experiments to provide more detailed inferences.

S13 Code and data

Code and data is available on http://osf.io/kcaht and http://github.com/larsbratholm/champs_kaggle . These contain the following • List of molecules removed as they did not contain hydrogens. • List of molecules removed due to one or more couplings being outliers. • List of training and test molecules. • Script to convert QM9 extended XYZ-ﬁles into both regularly formattedXYZ-ﬁles as well as input ﬁles for Gaussian NMR computations. • Script to create the Kaggle dataset from the output ﬁles of Gaussian NMRcomputations. • Archive of all extended XYZ-ﬁles of QM9 that passed the consistencycheck of the original paper. • Archive of all regular formatted XYZ ﬁles parsed from the extended XYZﬁle format of the QM9 dataset. • Archive of all Gaussian input ﬁles. • Archive of all Gaussian output ﬁles. • Archive of the Kaggle dataset. • Archive of the top 400 submissions. • Script to create an ensemble from the top submissions. • Script to create the plots used in the paper and SI. • Code used by the 1st, 2nd, 3rd, 4th, 5th and 12th placed teams to createtheir models. 35 eferences [1] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatolevon Lilienfeld. Quantum chemistry structures and properties of 134 kilomolecules.

Scientiﬁc Data , 1, 2014.[2] P. J. Stephens, F. J. Devlin, C. F. Chabalowski, and M. J. Frisch. Abinitio calculation of vibrational absorption and circular dichroism spectrausing density functional force ﬁelds.

The Journal of Physical Chemistry ,98(45):11623–11627, 1994.[3] R. Ditchﬁeld, W. J. Hehre, and J. A. Pople. Self-consistent molecular-orbital methods. ix. an extended gaussian-type basis for molecular-orbitalstudies of organic molecules.

The Journal of Chemical Physics , 54(2):724–728, 1971.[4] Michael J. Frisch, John A. Pople, and J. Stephen Binkley. Self-consistentmolecular orbital methods 25. supplementary functions for gaussian basissets.

The Journal of Chemical Physics , 80(7):3265–3269, 1984.[5] W. J. Hehre, R. Ditchﬁeld, and J. A. Pople. Self—consistent molecularorbital methods. xii. further extensions of gaussian—type basis sets for usein molecular orbital studies of organic molecules.

The Journal of ChemicalPhysics , 56(5):2257–2261, 1972.[6] R. Krishnan, J. S. Binkley, R. Seeger, and J. A. Pople. Self-consistentmolecular orbital methods. xx. a basis set for correlated wave functions.

The Journal of Chemical Physics , 72(1):650–654, 1980.[7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.[8] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S. Corrado, Andy Davis, Je ↵ rey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geo ↵ rey Irving,Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Man´e, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner,Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Va-sudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Watten-berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software availablefrom tensorﬂow.org. 369] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland,Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, War-ren Weckesser, Jonathan Bright, St´efan J. van der Walt, Matthew Brett,Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nel-son, Eric Jones, Robert Kern, Eric Larson, CJ Carey, ˙Ilhan Polat, Yu Feng,Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, RobertCimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M.Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, andSciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for ScientiﬁcComputing in Python. Nature Methods , 2020.[10] I Borg and P Groenen. P.(1997) modern multidimensional scaling. theoryand applications.[11] Joseph B Kruskal. Nonmetric multidimensional scaling: a numericalmethod.

Psychometrika , 29(2):115–129, 1964.[12] Joseph B Kruskal. Multidimensional scaling by optimizing goodness of ﬁtto a nonmetric hypothesis.

Psychometrika , 29(1):1–27, 1964.[13] Will Gerrard, Lars A Bratholm, Martin J Packer, Adrian J Mulholland,David R Glowacki, and Craig P Butts. Impression–prediction of nmr pa-rameters for 3-dimensional chemical structures using machine learning withnear quantum chemical accuracy.

Chemical Science , 11(2):508–515, 2020.[14] Greg Landrum. Rdkit: Open-source cheminformatics. .[15] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, andGeorge E. Dahl. Neural message passing for quantum chemistry. In

Proceed-ings of the 34th International Conference on Machine Learning - Volume70 , ICML’17, page 1263–1272. JMLR.org, 2017.[16] Mohammed Alsuhaibani, Takanori Maehara, and Danushka Bollegala.Joint learning of hierarchical word embeddings from a corpus and a taxon-omy. In

Automated Knowledge Base Construction (AKBC) , 2019.[17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In

Advances in neural information processing systems , pages 5998–6008, 2017.[18] Jimmy Lei Ba, Jamie Ryan Kiros, and Geo ↵ rey E. Hinton. Layer normal-ization, 2016.[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. Pytorch: An imperative style, high-performance deep learn-ing library. In Advances in Neural Information Processing Systems , pages8024–8035, 2019. 3720] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization, 2014.[21] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent withwarm restarts, 2016.[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez,

Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. In

Advances in neural information processing systems , pages 5998–6008, 2017.[23] Jeremy Howard et al. fastai. https://github.com/fastai/fastai , 2018.[24] Martin A. Fischler and Robert C. Bolles. Random sample consensus: Aparadigm for model ﬁtting with applications to image analysis and auto-mated cartography.

Commun. ACM , 24(6):381–395, June 1981.[25] Xin Dang, Hanxiang Peng, Xueqin Wang, and Heping Zhang. Theil-senestimators in a multiple linear regression model.

Olemiss Edu , 2008.[26] Kristof T Sch¨utt, Farhad Arbabzadah, Stefan Chmiela, Klaus R M¨uller,and Alexandre Tkatchenko. Quantum-chemical insights from deep tensorneural networks.

Nature communications , 8(1):1–8, 2017.[27] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-chine translation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 , 2014.[28] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero,Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprintarXiv:1710.10903 , 2017.[29] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXivpreprint arXiv:1711.07553 , 2017.[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. Pytorch: An imperative style, high-performance deep learn-ing library. In

Advances in Neural Information Processing Systems , pages8024–8035, 2019.[31] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye,Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. Deep graph library:Towards e cient and scalable deep learning on graphs. arXiv preprintarXiv:1909.01315 , 2019.[32] Jimmy Lei Ba, Jamie Ryan Kiros, and Geo ↵ rey E Hinton. Layer normal-ization. arXiv preprint arXiv:1607.06450 , 2016.3833] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deepinto rectiﬁers: Surpassing human-level performance on imagenet classiﬁ-cation. In Proceedings of the IEEE international conference on computervision , pages 1026–1034, 2015.[34] Noel M O’Boyle, Michael Banck, Craig A James, Chris Morley, Tim Van-dermeersch, and Geo ↵ rey R Hutchison. Open babel: An open chemicaltoolbox. Journal of cheminformatics , 3(1):33, 2011.[35] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules forvirtual screening in the chemical universe database GDB-13.

J. Am. Chem.Soc. , 131:8732, 2009.[36] M. Rupp, A. Tkatchenko, K.-R. M¨uller, and O. A. von Lilienfeld. Fast andaccurate modeling of molecular atomization energies with machine learning.

Physical Review Letters , 108:058301, 2012.[37] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, SrinadhBhojanapalli, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Largebatch optimization for deep learning: Training bert in 76 minutes. arXivpreprint arXiv:1904.00962 , 1(5), 2019.[38] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization inadam. arXiv preprint arXiv:1711.05101 , 2017.[39] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, andAndrew Gordon Wilson. Averaging weights leads to wider optima andbetter generalization. arXiv preprint arXiv:1803.05407 , 2018.[40] Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong.Graph networks as a universal machine learning framework for moleculesand crystals.

Chemistry of Materials , 31(9):3564–3572, 2019.[41] J¨org Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials.

The Journal of Chemical Physics ,134(7):074106, 2011.[42] Brandon Anderson, Truong Son Hy, and Risi Kondor. Cormorant: Co-variant molecular neural networks. In