[PDF] Analyzing Koopman approaches to physics-informed machine learning for long-term sea-surface temperature forecasting

Abstract

Accurately predicting sea-surface temperature weeks to months into the future is an important step toward long term weather forecasting. Standard atmosphere-ocean coupled numerical models provide accurate sea-surface forecasts on the scale of a few days to a few weeks, but many important weather systems require greater foresight. In this paper we propose machine-learning approaches sea-surface temperature forecasting that are accurate on the scale of dozens of weeks. Our approach is based in Koopman operator theory, a useful tool for dynamical systems modelling. With this approach, we predict sea surface temperature in the Gulf of Mexico up to 180 days into the future based on a present image of thermal conditions and three years of historical training data. We evaluate the combination of a basic Koopman method with a convolutional autoencoder, and a newly proposed "consistent Koopman" method, in various permutations. We show that the Koopman approach consistently outperforms baselines, and we discuss the utility of our additional assumptions and methods in this sea-surface temperature domain.

Full PDF

AAnalyzing Koopman approaches to physics-informed machine learningfor long-term sea-surface temperature forecasting

Julian Rice , Wenwei Xu , and Andrew August Marine and Coastal Research, Paciﬁc Northwest National Laboratory California Polytechnic State University, San Luis Obispo Computing and Analytics, Paciﬁc Northwest National LaboratoryJuly 2020

Abstract

Accurately predicting sea-surface temperature weeks to months into the future is an important step towardlong term weather forecasting. Standard atmosphere-ocean coupled numerical models provide accurate sea-surface forecasts on the scale of a few days to a few weeks, but many important weather systems require greaterforesight. In this paper we propose machine-learning approaches sea-surface temperature forecasting that areaccurate on the scale of dozens of weeks. Our approach is based in Koopman operator theory, a useful tool fordynamical systems modelling. With this approach, we predict sea surface temperature in the Gulf of Mexico upto 180 days into the future based on a present image of thermal conditions and three years of historical trainingdata. We evaluate the combination of a basic Koopman method with a convolutional autoencoder, and anewly proposed “consistent Koopman” method, in various permutations. We show that the Koopman approachconsistently outperforms baselines, and we discuss the utility of our additional assumptions and methods in thissea-surface temperature domain.

The sea surface temperature in the Gulf of Mexico is ofgreat environmental, economic, and social signiﬁcance.It regulates the low oxygen “dead zone”[16], fuels in-tense tropical cyclones [21], such as Katrina and Har-vey, and modulates thunderstorms in the southern US[7]. The Gulf of Mexico also has unique ﬂow featuressuch as the Loop Current, which forms when warmsubtropical waters intrude into the Gulf through theYucatan Channel and eventually push out through theFlorida Straits. The Loop Current periodically shedswarm eddies, which propagate westward in the Gulffor 9-12 months [21], causing distinct temperature dy-namics. East of Florida, much of the tail end of theLoop Current forms the Gulf Stream, another uniqueand important ﬂow feature that pushes warm waterinto the North Atlantic. Current state-of-the-art SST predictions are fromatmosphere-ocean coupled numerical models, which donot make predictions beyond a month [5] [14]: forexample NOAA’s Real Time Ocean Forecast System(RTOFS) only forecasts up to ﬁve days [22]. Addition-ally, numerical models are computationally expensiveand rely on initial boundary conditions which are gen-erally hard to obtain. It is also notable that the eddyshedding events which involve the rapid growth of non-linear instability is particularly diﬃcult to forecast innumerical models [14].In this study, we evaluate machine-learning (ML)techniques for long-range SST forecasting. Morespeciﬁcally, we well predict SST in the Gulf of Mexicoup to 180 days in the future from a single thermal im-age of SST, leveraging techniques based in Koopmanoperator theory.

For discrete-time dynamics problems, we seek a trans-formation ϕ that maps a system’s state z at time t its1 a r X i v : . [ phy s i c s . g e o - ph ] S e p tate at the next time step: z t +1 = ϕ ( z t ) (1)where S ⊂ R n is the space of all possible states, z ∈ S ,and ϕ : S → S . We focus speciﬁcally on systems where ϕ is time-invariant; that is, ∀ t ∈ Z equation (1) holds.In dynamical systems problems where ϕ is nonlin-ear, a machine-learning approach is often appealing,since the system can be too complex to model analyt-ically or numerically. However, a purely data-drivenapproach eschews physics knowledge that can be ex-tremely helpful in developing meaningful and general-izable approximations of ϕ . Thus, we choose to utilizephysics-informed machine-learning, whereby our un-derstanding of physical systems is incorporated intomachine-learning models.Additionally, we often do not have access to thestate itself, only “observables” of it, which are sim-ply functions of the state that are much more easilymeasured: y t = f ( z t ) (2)where f : S → R , and y t ∈ R is the value of an observ-able f of the system’s state at time t . Thus, we needa more complex formulation than Eqn. (1). Koopman operator theory [12] has recently been(re-)discovered [4] to be a valuable tool in modellingdynamical systems: it suggests that such a nonlineardynamical system can undergo a transformation intoan inﬁnite dimensional space in which it evolves lin-early in time.To begin, consider an observable function f , perEqn. (2). The Koopman operator, utilizing sometransformation T : S → S , is as follows: K T f ( x ) = f ◦ T ( x ) (3)Since the Koopman operator is, mathematically,just a speciﬁc reformulation of the composition opera-tor, its linearity follows simply [1]: K T [ g + g ]( x ) = [ g + g ] ◦ T ( x )= g ◦ T ( x ) + g ◦ T ( x )= K T g ( x ) + K T g ( x ) (4)To apply the operator to our dynamical system,we use the system dynamics ϕ as the transformation T . Now, instead of the system evolving nonlinearly inthe state space from z t to z t +1 (as in Eqn. (1)), itevolves linearly in the space of the observables (with a slight abuse of notation in the last term, to make theevolution of the observable clear): y t +1 = f ( z t +1 ) = K ϕ f ( z t ) = K ϕ y t (5)Complicating matters is the fact that since the Koop-man operator evolves the system an inﬁnite dimen-sional space, but this can be overcome by the “some-what naive but useful” [1] ideation of the operator as aleft-multiplication by inﬁnite dimensional matrix. Tomake our problem computable, we can then make theassumption (common in data-driven approaches to theKoopman operator [2]) that there exists a mapping χ that encodes the majority of the operator’s action intoa ﬁnite dimensional matrix C : C = χ ◦ K ϕ ◦ χ − (6)and conversely, χ − ◦ C ◦ χ = K ϕ (7)Finally, we can notice that applying the Koopmanoperator repeatedly simpliﬁes to an expression for pre-dicting any k ∈ N number of steps into the future: K ϕ ◦ K ϕ ◦ ... = χ − ◦ C ◦ χ ◦ χ − ◦ C ◦ ... = χ − ◦ C k ◦ χ (8)Therefore we arrive at our formulation of the prob-lem that we will utilize in building our network archi-tectures: y t + k = χ − ◦ C k ◦ χ ( y t ) (9)If we properly construct networks in accordancewith Eqn. (9), we know that those networks will func-tion as good approximations of the Koopman operator.Thus we have applied physics-informed constraints tothe network, though these constraints are informed bythe physics of dynamical systems in general, not thespeciﬁc physical laws governing a particular system .The network is in turn more interpretable; we could,for example, analyze the encoder/decoder structure tobetter understand how the latent space relates to thespace of the observables; or, we could calculate theeigenvalues of C to gain insight into the system’s sta-bility. Figure 1: ExampleConvolution [11]Convolutions are a com-mon operation in machinelearning on spatial data.To provide a basic intu-itive description, convolu-tion works by learning akernel that condenses mul-tiple datapoints into one;in Fig. 1, the shaded 3x3in the blue input layer is This is not a bad thing: This generality is precisely what makes Koopman networks so useful,because they can be applied to a multitude of diﬀerent dynamical systems problems. locality , by takinginput from spatially close nodes in the previous layer,and translation invariance , by applying the same op-eration at all spatial locations in the data.

There have been various approaches to SST forecast-ing with physics-agnostic machine learning models.Generic neural network (NN) approaches to SST fore-casting have been developed for at least two decades[24]. Recently, Long-Short Term Memory [28][29] [30]and Autoencoder [8] architectures have been popu-lar. Another interesting approach has been to learnto model the error of numerical models with a neuralnetwork [15].Application of NNs to physics problems generallyhas a long history [13] [3], as an abundance of data anddiﬃculty in accurately modeling certain systems hasmade the approach appealing. An expressly physics-informed approach to machine learning, however, waspioneered by Raissi et al. [18][17], who utilized NNs tomodel partial diﬀerential equations describing physicalsystems. See Willard et al. [27] for an thorough surveyof the physics-informed ML approach.Recently, the application of physics-informed MLto SST forecasting has received some attention. deBezenec et al. [6] utilize physical knowledge ofadvection-diﬀusion to predict sea-surface temperature.Erichson et al. [9] incorporate constraints inspired byLyapunov stability in their forecasting of various ﬂuidﬂows, including sea-surface temperature.Although the Koopman operator was developedin 1931 [12], it has recently begun to receive atten-tion for its use in analyzing dynamical systems [4].This has spawned many machine-learning approachesto Koopman operator theory [23][20]. Our researchhere is inspired most immediately by Azencot et al.’s[2] recent work on “consistent Koopman” networks, inwhich they use SST prediction as one of many testsfor their novel method.

All code is implemented in Keras (v2.4) with aTensorﬂow (v2.2) backend. Code is available at https://github . com/JRice15/physics-informed-autoencoders . We use the NOAA Optimal-Interpolation (OI) SSTHigh-Resolution dataset [19], which measures dailysnapshots of sea-surface temperature at 0 . ◦ spatialresolution globally, by interpolating satellite, buoy,and ship measurements to minimize bias.For our experiments, we focus on a 70x150 regionof this dataset consisting of the Gulf of Mexico and aportion of the Caribbean (see Fig. 2 for an examplesnapshot of this region). Though the thermal measure-ments are not actually photographic images, we mayrefer to them as “images” and individual locations as“pixels,” as a useful shorthand.We use three consecutive years (1095 days) as atraining set, and the next year as a validation set dur-ing training. At test time, for longer term predictions,we use the three years following the training set as thetest set.Figure 2: Example SST snapshot, day 140 in the train-ing dataset The measurements on land are set as zero, which can-not be confused as a temperature measurement as thatsea surface temperature never occurs in the region.The data is then scaled to be approximately between -1 and 1. At test time, all locations where a zero (land)measurement appears are masked away, and do notcontribute to error measurements. This is also doneduring training for convolutional networks, so that itdoes not contribute to the loss. We don’t bother mask-ing during training for fully-connected networks, as weﬁnd that it is trivial for them to learn the mask in theoutput layer (for example, by learning the weight tobe zero and bias to the target mask value), howeverwe still mask during testing so that averages are allcomputed identically.3 .2 Physics Informed Autoencoders

For more detailed descriptions of the following networkarchitectures, see Appendix A. For training details,such as hardware used and learning rate schedules, seeAppendix B.

To create a neural networks for SST prediction, weutilize an autoencoder (AE) framework. Autoencodersare a class of neural network that compresses data to amuch smaller form without losing much information,by learning an encoder (and its inverse, a decoder ).Looking at our formulation of the problem in Eqn.(9), one will be positively astounded to see that itnaturally suggests such an autoencoder architecture,with χ and its inverse acting as an encoder and de-coder respectively, for each of which we will substitutea clearer notation: χ e and χ d . Furthermore, recentexperiments (see Related Work, 1.2) suggest that sea-surface temperature evolution can be explained largelyby low-dimensional dynamics, further pointing to anautoencoder approach.Each network will thus consist of an encoder, χ e , adynamics matrix than can be applied multiple times, C , and a decoder that maps back to the space of theobservables, χ d :ˆ y t + k = χ d ◦ C k ◦ χ e ( y t ) (10)We will use k = 14 (days) for all of our experiments.We develop and compare four implementationsof Koopman autoencoders (KAEs) in sections 2.2.3–2.2.6, the relationships between which may be helpfulto visualize: No Consistency ConsistencyFully-Connected Simple KAE Cons KAEConvolutional Conv KAE ConvCons KAETable 1: Koopman Network Relationships In order for a network to learn a proper approximationof our Koopman-theoretic formulation in Eqn. (9), weﬁrst must instruct the network to generate accuratepredictions given input y t . It can do this by learningto minimize the diﬀerence between the ground truths y t +1 and its predictions ˆ y t +1 : (cid:107) ˆ y t +1 − y t +1 (cid:107) MSE (11)where (cid:107) · (cid:107)

MSE denotes the mean-squared error (MSE),which squares its input element-wise and then aver-ages over all dimensions and examples. However, since our goal is long-term prediction, we minimize the errorover k steps into the future: L pred = 1 k k (cid:88) n =1 (cid:107) ˆ y t + n − y t + n (cid:107) MSE (12)In addition to producing accurate predictions, wemust also ensure that the network’s elements are prop-erly corresponding to their role in our Koopman-theoretic framework. For C , this is achieved by makingit any linear operation, in our case matrix multiplica-tion. To ensure that χ e and χ d serve only to encodeand decode, and not learn any dynamics – and con-versely, that C does not contribute to the encoding ordecoding – we implement a second loss term: L id = (cid:107) χ d ◦ χ e ( y t ) − y t (cid:107) MSE (13)This enforces the relation that χ d ◦ χ e ≈ I , the identity.The loss of the network is then the sum of the termsderived in Equations (12) and (13), weighted by re-spective hyperparameters λ : L = λ id L id + λ pred L pred (14)This loss (sometimes with a few additional terms) willbe used by all four networks below. The Simple Koopman Autoencoder (SimpleKAE) isan autoencoder with three encoder layers, a dynamicsoperation, and three decoder layers. Each layer of theencoder or decoder is a standard fully-connected layer,consisting of a matrix multiplication of the weight andthe input and addition of a bias term, followed by ahyperbolic tangent activation: x out = tanh( W i x + b i ) The Consistent Koopman Autoencoder (ConsKAE),recently introduced by Azencot et al. [2], extends theSimpleKAE’s structure based on the assumption thatin the latent space, dynamics evolve linearly both for-ward and backward in time. This assumption is notnecessarily true in all dynamical systems; entropic sys-tems may evolve deterministically forward in time, yetit may be impossible to reconstruct past states frompresent ones. Testing consistent Koopman networkswill help reveal to what extent this assumption appliesto this SST domain.For this network, we assume there exists a back-ward dynamics Koopman operator K ψ in addition tothe forward dynamics of K ϕ , such that f ( z t − ) = K ψ f ( z t ) (15) ∀ t ∈ Z , where ψ (the inverse of ϕ ) maps z t to z t − .While we could technically use C − to predict back-ward in our machine-learning approach, Azencot et al.4igure 3: Convolutional Network Architecture. The encoder translates a 70x150 snapshot into an encoded 1x24 rep-resentation in the latent space, which is transformed forward in time by the dynamics C (and optionally backwardby D , if it is a consistent Koopman network), and then decoded back to a 70x150 prediction.ﬁnd that the network is better able to learn backwarddynamics via a seperate operation, D :ˆ y t − k = χ d ◦ D k ◦ χ e ( y t ) (16)The error of backward predictions is computed analo-gously to Eqn. (12) for another loss term, L bwd . Ad-ditionally, in order to enforce that the backward andforward dynamics are consistent —meaning that theyare approximate inverses—we use the following lossterm: L cons = κ (cid:88) n =1 (cid:20) n (cid:107) D n ∗ C ∗ n − I n (cid:107) F + 12 n (cid:107) C n ∗ D ∗ n − I n (cid:107) F (cid:21) (17)where D n ∗ denotes the upper n rows, and D ∗ n theleftmost n columns, of D. The n × n identity matrixis represented by I n , and (cid:107) · (cid:107) F is the Frobenius norm.For the derivation of this term, see section 4 of Azencotet al. [2].The ConsKAE thus uses the following loss, withthe two new loss terms weighted by respective λ hy-perparameters: L = λ id L id + λ pred L pred + λ bwd L bwd + λ cons L cons (18)Azencot et al. [2] tested their model on this sameSST domain, and was our inspiration for doing thesame. We will compare their results with ours, thoughwe utilized what we believe to be a more robust andinformative testing scheme. The Convolutional Koopman Autoencoder (Con-vKAE) uses the same dynamics as the SimpleKAE,but uses a convolutional encoder/decoder instead offully-connected layers. The encoder consists of fourlayers, the ﬁrst three of which consist of 16-ﬁlter con-volutions and hyperbolic tangent activations followedby max-pooling, while the last is a 1-ﬁlter convolutionalone. The decoder is the reverse, with transposed con-volutions and upsampling replacing convolutions andmax-pooling, respectively. See Fig. (3) for a graphicrepresentation.

A combination of the previous two implementations,the Convolutional Consistent KAE (ConvConsKAE)uses both the convolutional encoding/decoding schemeof the Convolutional KAE, and paired forward andbackward dynamics of the Consistent KAE.

Persistence is what it sounds like: a model that passesits input to its output, unchanged. Comparison to thisbaseline reveals the extent to which a network learnsthe actual dynamics of the system, and not simply agood encoding-decoding scheme.

The Learned Average model (LrndAvg) learns andoutputs a temporally constant spatial distribution thatis the same shape as its input. This distribution islearned by minimizing the mean-squared error betweenits inputs and its output, so that over the course oftraining a value akin to the weighted average at eachspatial location is learned. Comparison to LrndAvgreveals whether has devolved to guessing the averageand is not taking its input into account—or, perhapseven worse, whether it is making a legitimate eﬀortthat is still no better than guessing the average.

We trained each model on 12 diﬀerent seed values,meaning 12 trials with unique network initializations.To test the models for long-term prediction, we recordeach seed’s average performance on 180 day predic-tion over three years of test data (see section 2.1 fordescription of this dataset).In testing, we evaluate the following metrics, interms of a prediction ˆ y t and target y t . • Relative prediction error (RelPred) measures theerror scaled by the intensity of the target: E relpred = (cid:107) ˆ y t − y t (cid:107) F (cid:107) y t (cid:107) F (19)5his is used so that the error of hot and coldmonths can be compared on more equal footing. • Celcius Mean-Absolute Error (MAE) is simplythe average absolute diﬀerence between ˆ y t and y t , in units that are universally interpretable (un-like relative prediction error). We will preferthese units because of their universal meaning.We also measured mean-squared error and mean-absolute error, though they will not appear here asthey ﬁnd they do not add any information not alreadycontained in the two metrics above.We calculate not only multiple metrics but multi-ple aggregations of those metrics among the 12 trials ofeach model: average, median, min, max, and standarddeviation. To begin, we evaluate the baselines’ performances inFig. 4. LrndAvg stays relatively ﬂat (as expected),with the error hovering around 1.8 degrees. The Per-sistence model starts oﬀ well, but rapidly diverges toalmost twice the error of the LrndAvg. While the Lrn-dAvg model represents a good target to beat, Persis-tence is more of an upper-bound on how reasonably poor a model can perform.Figure 4: Celcius MAE of Baselines

In our experiments, the SimpleKAE appears to haveperformed the best, followed by the ConvKAE, Con-sKAE, and lastly ConvConsKAE. This order is presentin the average (Fig. 6) and median aggregations ofmodel seeds. It is the same according to the minimum aggregation (Fig. 7), except that ConvConsKAE per-forms better ConsKAE:Figure 6: Average Celcius MAE of Koopman ModelsFigure 7: Minimum Celcius MAE of Koopman ModelsThe median and maximum errors of each modelﬁnish in identical order to the average. It should benoted that the median is noticeably lower than theaverage, suggesting high outliers. Using the relativeprediction error metric yields a very similar picture,enough so that it need not be included.

Complicating the above results, we must note thatthere is a relatively large variance within each model’s12 seeds. Take for example the spread of the best per-forming model’s seeds, in Fig. 8, which conﬁrms theexistence of high outliers:6igure 5: Selected day 30, 100, and 180 ground truths (top) and best SimpleKAE seed predictions (bottom).Though the predictions are lacking the ﬁne-grain detail of the ground truth—which is understandable given thatthe predictions are decoded from a vector of length 12—the predictions follow the ground truth remarkably closelyon the large scale. Note the similarity to ground truth at day 180, including a fairly accurate prediction of boththe Loop Current and the Gulf Stream.Figure 8: SimpleKAE SeedsThere is great variation around their mean of 1 . ◦ Celcius, most notably on the high end with some seedsﬁnishing at double the mean. Interestingly, it is nearlyimpossible to tell from their prediction performanceduring the ﬁrst 25 or so days whether a given seed willperform well or poorly at 180 days. Similarly, train-ing performance is not very indicative of how well aseed will perform at long-term prediction; all modelsachieved very similar training losses (which takes intoaccount only 14 prediction days). This suggests thatduring training, the model is able to ﬁnd minima inthe optimization space for each seed that perform sim-ilarly in the short term, but that some of those minimaare much more or less stable than others for long-termforecasting. Fig. 9 shows all models’ variances, with90% conﬁdence bounds calculated using the standarddeviation of the seeds from their mean. Figure 9: Average Celcius MAE with 90% ConﬁdenceThe key conclusion from from Fig. 9 is that manyconﬁdence intervals overlap with other means, suggest-ing that we may not be able to say we are certainthey are signiﬁcantly diﬀerent. To this end, we applyWelch’s t-test [26] to determine how conﬁdent we arethat the day 180 means are in fact diﬀerent:LrndAvg Simple Cons ConvLrndAvg - 98.375

Cons

Conv 97.853 -ConvCons ≥

90% to be signiﬁcant. Methods with non-signiﬁcant diﬀerences are bolded.7able 2 shows the conﬁdences that the means ofany two methods are in fact signiﬁcantly diﬀerent. Wechose a conﬁdence of 90% because of our small samplesizes, though a 95% conﬁdence would not have madedrastic changes (only the SimpleKAE v. ConsKAEsigniﬁcance).From this data, we cannot conclude that the Sim-pleKAE performs any better than the ConvKAE. Infact, for all Koopman methods, we can see that thediﬀerences between one method and the next bestmethod (according to their day 180 average MAE;see Fig. 6) are not signiﬁcant. However, analysisin terms of the models’ relationships per Table 1 re-veals that there are statistically signiﬁcant diﬀerencesbetween consistent Koopman methods and their non-consistent counterparts (though there are not signiﬁ-cant diﬀerences between non-convolutional and convo-lutional methods), at least at 90% conﬁdence. Giventhat using 95% conﬁdence bound would change thisrelation, we cannot be very conﬁdent that our resultsdo simply reﬂect a bad batch of seeds for the consistentmethods.

Our results consistently validate a Koopman-theoreticapproach to long-term SST forecasting, a challengingdynamical systems problem. The best initializations ofall Koopman methods signiﬁcantly outperformed boththe Persistence and LrndAvg baselines.Our results suggest that the consistent Koopmanmethods do not tend to perform as well as their non-consistent baselines in this SST domain. This is con-tradictory to the results of Azencot et al. [2], whoreported the following data:Model RelPred ErrorMin Max Avg.SimpleKAE 0.315 1.062 0.701ConsKAE 0.269 1.028 0.632Table 3: Day 180 Relative Prediction Error on SSTDataset, with 18 seeds [2]They report an improvement in relative predictionerror of 0.069, but with a variation between min andmax of 0.759. Without further information, it is diﬃ-cult to know what signiﬁcance to assign these results.We ﬁnd the following, using the same relative predic-tion error metric: Model RelPred ErrorMin Max Avg.SimpleKAE 0.292 0.863 0.451ConsKAE 0.412 1.151 0.583Table 4: Our Day 180 Relative Prediction Error, with12 seedsWe ﬁnd approximately double the diﬀerence be-tween means but in the opposite direction, though wedo ﬁnd a similar min-max diﬀerence. We attribute thediﬀerences to both our more robust testing scheme;while Azencot et al. test on the 30 days followingthe training set, we test on the three years following.However, they tested 18 seeds to our 12, suggestingour results may be more susceptible to a bad batch ofseeds. There are also slight diﬀerences in the numberof epochs trained and learning rate schedules of themodels, though we do not expect these diﬀerences tohave a great eﬀect.On other datasets where it is known that con-sistency applies—such as computer-generated data ofideal ﬂuid ﬂow past a cylinder, where we know thatpast states can be predicted from the present just aseasily as future states—Azencot et al. demonstratedsigniﬁcant improvements over the SimpleKAE. Theirresults are much less conclusive in the more challengingreal-world domain of SST forecasting, however. Giventhe unexpectedly large variation between seeds and ourdiﬀering results, it is hard to draw conclusions as towhether the consistent method is better than the itssimple counterpart. At the very least, it seems reason-able to say that Koopman consistency is not especiallybeneﬁcial for forecasting in this domain—though pos-sibly not especially harmful either.Our experiments also suggest that the assumptionsinherent in a convolutional encoder/decoder architec-ture are not especially beneﬁcial. Although the local-ity of the data seems intuitively to be correct, it ispossible that assuming translation invariance in theencoding scheme can make it more diﬃcult to createencodings that extract the important information fromall regions—e.g., the same ﬁlter may not be able to ex-tract the most useful encoding from both the coast ofLouisiana and the surroundings of Cuba.Thus our primary conclusion is that informedmachine-learning is only beneﬁcial when the right prior information and assumptions are used. Thoughour experiments have validated the eﬀectiveness of aKoopman-theoretic method, the additional assump-tions inherent in convolutional and consistent Koop-man methods do not seem to be especially beneﬁcialin the complex and challenging domain of long-termsea-surface temperature forecasting.8

Acknowledgements

This work was supported in part by the U.S. Depart-ment of Energy, Oﬃce of Science, Oﬃce of WorkforceDevelopment for Teachers and Scientists (WDTS) un-der the Science Undergraduate Laboratory InternshipsProgram (SULI).Many thanks to N. Benjamin Erichson for his guid-ance in understanding his recent paper, and for donat-ing his nearly-expired computing hours to us.Thanks also to Paciﬁc Northwest National Labo-ratory, and speciﬁcally to PNNL Research Computingfor giving us access to computing resources to completethis project.

References [1] Hassan Arbabi. Introduction to koopman opera-tor theory of dynamical systems, 06 2018.[2] Omri Azencot, N Benjamin Erichson, VanessaLin, and Michael W Mahoney. Forecasting se-quential data using consistent koopman autoen-coders. arXiv preprint arXiv:2003.02236 , 2020.[3] M. Baymani, Sohrab Eﬀati, Hamid Niazmand,and A. Kerayechian. Artiﬁcial neural networkmethod for solving the navier–stokes equations.

Neural Computing and Applications , 26:765–773,05 2014.[4] Marko Budiˇsi´c, Ryan Mohr, and Igor Mezi´c. Ap-plied koopmanism.

Chaos: An InterdisciplinaryJournal of Nonlinear Science , 22(4):047510, 2012.[5] Fran¸cois Counillon and Laurent Bertino. High-resolution ensemble forecasting for the gulf ofmexico eddies and fronts.

Ocean Dynamics ,59(1):83–95, 2009.[6] Emmanuel de Bezenac, Arthur Pajot, and PatrickGallinari. Deep learning for physical processes:Incorporating prior scientiﬁc knowledge.

Journalof Statistical Mechanics: Theory and Experiment ,2019(12):124009, 2019.[7] Roger Edwards and Steven J Weiss. Compar-isons between gulf of mexico sea surface temper-ature anomalies and southern us severe thunder-storm frequency in the cool season. In , pages 317–320,1996.[8] N Benjamin Erichson, Lionel Mathelin, ZheweiYao, Steven L Brunton, Michael W Mahoney, andJ Nathan Kutz. Shallow learning for ﬂuid ﬂowreconstruction with limited sensors and limiteddata. arXiv preprint arXiv:1902.07358 , 2019. [9] N Benjamin Erichson, Michael Muehlebach, andMichael W Mahoney. Physics-informed autoen-coders for lyapunov-stable ﬂuid ﬂow prediction. arXiv preprint arXiv:1905.10866 , 2019.[10] Xavier Glorot and Yoshua Bengio. Understand-ing the diﬃculty of training deep feedforward neu-ral networks. In Yee Whye Teh and Mike Titter-ington, editors,

Proceedings of the Thirteenth In-ternational Conference on Artiﬁcial Intelligenceand Statistics , volume 9 of

Proceedings of MachineLearning Research , pages 249–256, Chia LagunaResort, Sardinia, Italy, May 2010. PMLR.[11] Irhum Shafkat. Intuitively understandingconvolutions for deep learning. https://towardsdatascience . com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1 , 6 2018. Accessed8/13/2020.[12] Bernard O Koopman. Hamiltonian systems andtransformation in hilbert space. Proceedings of thenational academy of sciences of the united statesof america , 17(5):315, 1931.[13] Michele Milano and Petros Koumoutsakos. Neu-ral network modeling for near wall turbulent ﬂow.

Journal of Computational Physics , 182(1):1–26,2002.[14] L-Y Oey, Tal Ezer, George Forristall, C Cooper,Steven DiMarco, and S Fan. An exercise in fore-casting loop current and eddy frontal positions inthe gulf of mexico.

Geophysical Research Letters ,32(12), 2005.[15] Kalpesh Patil, M. C. Deo, and M. Ravichan-dran. Prediction of sea surface temperatureby combining numerical and neural techniques.

Journal of Atmospheric and Oceanic Technology ,33(8):1715–1726, 08 2016.[16] Nancy N Rabalais, R Eugene Turner, andWilliam J Wiseman Jr. Gulf of mexico hypoxia,aka “the dead zone”.

Annual Review of ecologyand Systematics , 33(1):235–263, 2002.[17] M Raissi, P Perdikaris, and GE Karniadakis.Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial diﬀerentialequations, arxiv preprint (2017). arXiv preprintarXiv:1711.10566 , 2017.[18] Maziar Raissi, Paris Perdikaris, and George EmKarniadakis. Physics informed deep learning(part i): Data-driven solutions of nonlinearpartial diﬀerential equations. arXiv preprintarXiv:1711.10561 , 2017.919] Richard W. Reynolds, Thomas M. Smith, Chun-ying Liu, Dudley B. Chelton, Kenneth S. Casey,and Michael G. Schlax. Daily High-Resolution-Blended Analyses for Sea Surface Temperature.

Journal of Climate , 20(22):5473–5496, 11 2007.[20] Samuel E. Otto and Clarence W. Rowley. Lin-early recurrent autoencoder networks for learningdynamics.

SIAM Journal on Applied DynamicalSystems , 18(1):558–593, 2019.[21] Lynn K Shay, Gustavo J Goni, and Peter GBlack. Eﬀects of a warm oceanic feature on hurri-cane opal.

Monthly Weather Review , 128(5):1366–1383, 2000.[22] T Spindler, I Rivin, and L Burroughs. Noaa real-time ocean forecasting system rtofs-(atlantic):National weather ser-vice.

EMC/MMAB, Tech-nical Procedure Bulletin , 2006.[23] Naoya Takeishi, Yoshinobu Kawahara, and Take-hisa Yairi. Learning koopman invariant subspacesfor dynamic mode decomposition. In

Advancesin Neural Information Processing Systems , pages1130–1140, 2017.[24] FT Tangang, WW Hsieh, and B Tang. Forecast-ing the equatorial paciﬁc sea surface temperaturesby neural network models.

Climate Dynamics ,13(2):135–147, 1997.[25] Laura von Rueden, Sebastian Mayer, Jochen Gar-cke, Christian Bauckhage, and Jannis Sch¨ucker. Informed machine learning - towards a taxonomyof explicit integration of knowledge into machinelearning, 03 2019.[26] B. L. Welch. The generalization of ‘student’s’problem when several diﬀerent population var-lances are involved.

Biometrika , 34(1-2):28–35,01 1947.[27] Jared Willard, Xiaowei Jia, Shaoming Xu,Michael Steinbach, and Vipin Kumar. Integratingphysics-based modeling with machine learning: Asurvey. arXiv preprint arXiv:2003.04919 , 2020.[28] Changjiang Xiao, Nengcheng Chen, Chuli Hu,Ke Wang, Zewei Xu, Yaping Cai, Lei Xu, ZeqiangChen, and Jianya Gong. A spatiotemporal deeplearning model for sea surface temperature ﬁeldprediction using time-series satellite data.

Envi-ronmental Modelling & Software , 120, 2019.[29] Y. Yang, J. Dong, X. Sun, E. Lima, Q. Mu, andX. Wang. A cfcc-lstm model for sea surface tem-perature prediction.

IEEE Geoscience and Re-mote Sensing Letters , 15(2):207–211, 2018.[30] Q. Zhang, H. Wang, J. Dong, G. Zhong, andX. Sun. Prediction of sea surface temperature us-ing long short-term memory.

IEEE Geoscienceand Remote Sensing Letters , 14(10):1745–1749,2017.

A Network Architectures

Other than dynamics layers, all weights are ini-tialized with the Glorot Normal (aka Xavier Normal)initializer [10], while all biases are initialized to zero.Forward dynamics weights are initialized by taking thesingular value decomposition of a matrix generated via a random normal distribution, and then multiplying U and V T to create the weight. Backward dynam-ics weights, when applicable, are initialized with thepseudo-inverse of the forward weights’ transpose. Nei-ther dynamics use a bias term. A.1 Fully-Connected Networks

Component Layer Activation (Output) ShapeInput - - 10500Encoder Fully-Connected 1 tanh 96Fully-Connected 2 tanh 96Fully-Connected 3 - 12Dynamics Fully-Connected - 12Decoder Fully-Connected 1 tanh 96Fully-Connected 2 tanh 96Fully-Connected 3 - 10500Table 5: Fully-Connected Architecture. This architecture was developed by Azencot et al [2]. They performed aparameter search when developing this model. 10 .2 Convolutional Networks

Component Layer Kernel Size Size Modiﬁcation Activation (Output) ShapeInput - - - - 70x150Encoder Convolution 1 3x3 2x2 max-pooling tanh 34x74x16Convolution 2 3x3 2x2 max-pooling tanh 16x36x16Convolution 3 5x5 2x2 max-pooling tanh 6x16x16Convolution 4 5x5 ﬂattening - 24Dynamics Fully-Connected - - - 24Decoder Zero Padding - unﬂatten, 0x0x1x1 padding - 7x17Convolution 1 5x5 - tanh 7x17x16Convolution 2 5x5 2x2 upsampling tanh 22x42x16Convolution 3 3x3 2x2 upsampling tanh 48x88x16Convolution 4 3x3 2x2 upsampling - 100x180Cropping & Masking - crop to 70x150 - 70x150Table 6: Our Convolutional ArchitectureAt the beginning of the decoder, one row and col-umn of zero padding is applied to the bottom and rightof the 6x16 snapshot, and then at the end of the de-coder the output is cropped to the target shape, tomake the input and output shapes of the AE the same.All locations where land is present is masked away inboth the network’s output and in the target images, sothat it does not aﬀect the loss (see section 2.1.1). Allupsampling uses the bilinear interpolation technique.Output shape is given in the form (height x width xchannels).Though we did not perform an exhaustive parame-ter search, we did explore many diﬀerent avenues whendeveloping this architecture. Here are some of the op-tions we explored, and the reason why we did not usethem when it was more complex than simply worseperformance, as guidance for those seeking to improveupon our results: • Batch normalization: we suspect this failed be-cause of the seasonality of SST. When the datais normalized, it can be more diﬃcult to tell thediﬀerence between warmer and colder states thatare equally distributed. • ReLU activation: We suspect that ReLU per-formed poorly because it loses information bydesign, which we want to minimize in the AEframework. Additionally, because of its simplic-ity relative to the hyperbolic tangent, we wouldlikely need a deeper network to get similar re-sults. We found the ReLU often devolved towardthe LrndAvg’s output. • Various numbers of convolution output channels.We use 16/16/16/1; with less than that, such as6/6/6/1, the AEs output was not as smooth, anddid not perform as well. With larger than that,such as 16/32/64/1, the model became unstable,and would often diverge suddenly during long-term prediction at test time. • Convolutional dynamics: we explored thisbrieﬂy, since convolution is also a linear oper-ator, but it performed very poorly. • Average pooling and nearest-neighbor upsam-pling • Kernel size ordering: we use the larger kernels forthe more compressed sizes, to reduce the numberof operations and thus increase the speed of thenetwork.

B Training

Nvidia Tesla K80 and P100-PCIE GPUs were used fortraining. 14 forward prediction steps (and 14 back-ward, for consistent KAEs) were used for traininglosses. All models were trained for 2000 epochs (withearly stopping) with the Adam optimizer, 0.5 gradientnorm clipping, and a batch size of 64. Learning ratewas on a schedule, starting at 0.01 and reducing by afactor of 0.6 ( lr new = 0 . × lr oldold