[PDF] CKNet: A Convolutional Neural Network Based on Koopman Operator for Modeling Latent Dynamics from Pixels

Abstract

With the development of end-to-end control based on deep learning, it is important to study new system modeling techniques to realize dynamics modeling with high-dimensional inputs. In this paper, a novel Koopman-based deep convolutional network, called CKNet, is proposed to identify latent dynamics from raw pixels. CKNet learns an encoder and decoder to play the role of the Koopman eigenfunctions and modes, respectively. The Koopman eigenvalues can be approximated by eigenvalues of the learned state transition matrix. The deterministic convolutional Koopman network (DCKNet) and the variational convolutional Koopman network (VCKNet) are proposed to span some subspace for approximating the Koopman operator respectively. Because CKNet is trained under the constraints of the Koopman theory, the identified latent dynamics is in a linear form and has good interpretability. Besides, the state transition and control matrices are trained as trainable tensors so that the identified dynamics is also time-invariant. We also design an auxiliary weight term for reducing multi-step linearity and prediction losses. Experiments were conducted on two offline trained and four online trained nonlinear forced dynamical systems with continuous action spaces in Gym and Mujoco environment respectively, and the results show that identified dynamics are adequate for approximating the latent dynamics and generating clear images. Especially for offline trained cases, this work confirms CKNet from a novel perspective that we visualize the evolutionary processes of the latent states and the Koopman eigenfunctions with DCKNet and VCKNet separately to each task based on the same episode and results demonstrate that different approaches learn similar features in shapes.

Full PDF

11 CKNet: A Convolutional Neural Network Based onKoopman Operator for Modeling Latent Dynamicsfrom Pixels

Yongqian Xiao, Xin Xu, and Lilin Qian,

Abstract —For systems with only known pixels, it is difﬁcultto identify its dynamics, especially with a linear operator. Inthis work, we present a convolutional neural network (CNN)based on the Koopman operator (CKNet) to identify the latentdynamics from raw pixels. CKNet learned an encoder and de-coder to play the role of the Koopman eigenfunctions and modes,respectively. The Koopman eigenvalues can be approximated bythe eigenvalues of the learned system matrix. We present thedeterministic and variational approaches to realize the encoderseparately. Because CKNet is trained under the constraints of theKoopman theory, the identiﬁed dynamics is linear, controllableand physically-interpretable. Besides, the system matrix andcontrol matrix are trained as trainable tensors. To improve theperformance, we propose the auxiliary weight term for multi-step linearity and prediction losses. Experiments select two classicforced dynamical systems with continuous action space, and theresults show that identiﬁed dynamics with 32-dim can predictvalidly 120 steps and generate clear images.

Index Terms —Koopman Operator, Latent Dynamics, raw pix-els, deep learning.

I. I

NTRODUCTION A S the identiﬁed model is descried linearly, model iden-tiﬁcation with the Koopman operator attracted hugeattentions and achieved great successes in recent years. Utilizingthe identiﬁed model, the controlling performance could beimproved via predicting future states and evaluating thecorresponding losses. In terms of modeling based on theKoopman operator, two main approaches were proposed inrecent years except for deep learning-based methods. Oneis dynamic mode decomposition (DMD)[1] which appliessingular value decomposition (SVD) to extract intrinsic featuresfor approximating the Koopman eigenvalues, eigenfunctions,and modes with the eigenvalues and their corresponding rightand left eigenvectors. The other is extended dynamic modedecomposition (EDMD) [2] and its kernel variant KDMD [3]which transform the modeling into a supervised problem andsolves it with least square methods. Koopman operator-basedapproaches have been applied for approximating the systemdynamics in many ﬁelds, such as ﬂuid dynamics [4], powersystem [5], [6], molecular conformation analysis [7], roboticsystems [8], etc.To overcome the dilemma that DMD and EDMD are onlyapplicable to unforced systems, extended methods for forcedsystems based on DMD and EDMD were designed via dealing

Yongqian Xiao, Xin Xu are with the College of Intelligence Science andTechnology, National University of Defense Technology, Changsha 410073,China. email: [email protected] with the state and control input as an augmented matrix [9],[10]. In this way, nonlinear forced systems can be describedlinearly so that linear control theorems can be applied naturallyafter planning or for the systems which have explicit referencestates [11], [12].Theoretically, the Koopman operator can accurately describenonlinear systems globally in the inﬁnite invariant subspace[13],[14]. However, we can not realize inﬁnite-dimensional oper-ator practically. We usually construct a linear operator in ahigher-dimensional space to approximate the inﬁnite Koopmanoperator where the higher-dimensional is created by liftingthe original state space via designing basis functions. Basisfunctions have a decisive effect on modeling performance.They can be constructed with kernel functions, e.g. radial basisfunctions (RBF), but designing basis functions demands strongexperience and lack of theoretical guidance. Besides, when thestate’s dimension and dataset’s scale are extraordinarily large,it is intractable to load all data into memory and execute theSVD or pseudo-inverse operator. In addition, designing properbasis functions is even more complex. DMD, K-DMD, EDMD,and K-EDMD become infeasible.Deep learning has natural advantages to solve problemsof complex function ﬁtting. On basis functions designingproblem, we can autonomously train a neural network totake the place of basis functions instead of manually tryingdifferent kernel functions with different hyper-parameters.From this perspective, notable researches about combiningdeep learning and the Koopman operator have fueled itsapplication in many ﬁelds such as ﬂuid dynamics[15], powergrid[16], vehicle dynamics [17], molecular kinetics[18], atomicscale dynamics[19], highway trafﬁc dynamics [20], chaossystem [21] et.al. These works usually adopt an anto-encoder(AE) framework, and the encoder is used to approximate theKoopman eigenfunctions. Besides, some works calculate thesystem matrix and control matrix according to the sequenceof latent states outputted by the encoder[22], [23]. In themeantime, some works treat the system and control matricesas trainable weights [17], [24].Previous researches mainly focus on low-dimensional sys-tems. However, under many circumstances, existing reasonssuch as the expensive cost of high-precision sensors, disableus to acquire real-time intrinsic value-wise states, but high-dimensional pixel-wise observations can be grasped via low-cost cameras. Further, pixel-wise observations, such as imagesor lidar point cloud data, usually include lots of invalidinformation and noises. In this way, it is intractable to control a r X i v : . [ ee ss . S Y ] F e b the system only according to the inputting raw images. To dealwith these situations, learning-based algorithms are usuallyapplied, such as learning-based nonlinear MPC (LB-NMPC)[25], and family of deep reinforcement learning algorithms(DRL) [26], [27]. This end-to-end style leads a poor efﬁciency,therefore, encoding-based approaches were proposed to learna neural network model or construct an encoder to improvethe learning efﬁciency, such as MuZero [28], CURL [29],and PlaNet [30]. CURL only extracts features as the state forRL algorithms without predictive ability while MuZero andPlaNet learn a model that takes the extracted feature vectoras the latent state with neural networks in an end-to-end style.Unlike with these methods, after features are extracted withthe encoder, we regard the feature vector as the latent system’sstate and adopt the EDMD theory to approximate the Koopmanoperator of this latent system resulting in an interpretable lineardynamics instead of a non-linear end-to-end model based onneural networks. In this way, we can predict linearly based onthe identiﬁed dynamics leading to a small computing cost.Currently, few works focus on approximating the Koopmanoperator of dynamical systems that take raw pixels as thestate. A DMD-based deep learning framework was constructedfor background/foreground extraction and video classiﬁcation[31]. Firstly, it focused on unforced systems. Secondly, itadopts a hierarchical manner that trains AE ﬁrstly then doesthe DMD operator. DeepKoCo [32] is a similar work thatthe encoder adopts a deterministic approach to output theKoopman eigenfunctions directly. The system matrix consistsof Jordan block and the control matrix is subjected to theprediction and reconstruction the next observation. In thiswork, we adopt EDMD theory with both deterministic andvariational approaches that the encoder outputs basis functionswhich is linear correlation with the Koopman eigenfunctions.The system and control matrices are dealt with trainable tensors.Namely, after training, we obtain the ﬁxed system and controlmatrices and we can do controllability analysis on the identiﬁedlatent dynamics.The main contributions of this work are three-fold:1) A convolutional neural network based on the Koopmanoperator is proposed and realized with the deterministic andvariational approaches for modeling latent dynamics from rawpixels.2) An auxiliary weight term is proposed for multi-steplinearity and prediction losses to improve the predictionperformance. Comparison experiments were designed to studythe inﬂuence of different auxiliary weights on different losses.3) The deterministic and variational approaches are appliedto identify two classic physical systems with continuous actionspace. The results show that the proposed method is validfor identifying the latent dynamics from row pixels and theseidentiﬁed dynamics are controllable.II. D ESIGN OF

CKN ET In this section, we detailedly introduce how to design CKNetfor approximating the Koopman operator of discrete-timeunforced and forced dynamical systems that taking pixel-wisematrices as states. Also, we give the method for sampling basisfunctions while we adopt the variational encoder.

A. CKNet for Unforced Systems

Consider an unforced discrete-time system as follows: x k +1 = f ( x k ) (1)where x ∈ R c × h × w ∈ M denotes the state of dynamical sys-tem f in original high-dimension space M and it consists of c images with length h width w . We utilize a convolutional neuralnetwork (CNN) to approximate the Koopman eigenfunctions,thus the Koopman operator K can be deﬁned by ( K ϕ )( x k ) = ϕ ( f ( x k )) (2)where ϕ ∈ H are the Koopman eigenfunctions, some space of H is usually an inﬁnite-dimensional space. In this manner, theunforced system f can be described as a linear dynamics in H . The linear dynamics evolves linearly in H : x k + p = (cid:88) i =1 ζ i ( K p ϕ i )( x k ) = (cid:88) i =1 ζ i µ pi ϕ i ( x k ) (3)where K p denotes p times Koopman operator, µ i is the i -th Koopman eigenvalue corresponding to the i -th Koopmaneigenfunction φ i , and ζ i is the i -th Koopman mode to remapstates back to M .Though K operator is acted in inﬁnite-dimensional space, itattracts attention because of its linearity. Methods of DMDsapproximate the Koopman operator via approximating theKoopman eigenvalues, eigenfunctions, and modes. In this work,we propose a deep learning framework based on EDMD toapproximate the Koopman operator, unlike DMDs. Because ofthe limitation of length, no more tautology here. Want moredetails of EDMD could refer to [2], [11]. As shown in Fig. 1,CKNet expands a low-dimensional subspace V via the encoder φ to extract intrinsic dynamical features as basis functionsto paly the role of the Koopman eigenfunctions. Meanwhile,a nonlinear CNN decoder is designed to play the role ofthe linear Koopman modes to transform latent states from V back to pixels in H . Therefore, the unforced system f can beapproximated via CKNet: (cid:40) φ ( x k + p ) . = φ K ( x k + p ) = K p φ ( x k ) = A p φ ( x k ) x k = ˜ φ ( x k ) (4)where K is an approximating operator of K in V andrepresented by the square matrix A , φ ( x ) ∈ R L and φ K ( x ) ∈ R L denote the latent state acquired via the encoder and K operator, respectively, ˜ φ ( φ ( x )) ∈ R c (cid:48) × h × w denotes theoutput of the decoder, where c (cid:48) is a hyper-parameter whichequals c or 1. After training, the Koopman eigenvalues areapproximated by the eigenvalues of A . Note that basis functionsis linear correlation with Koopman eigenfunctions. That is, ϕ i = a Ti φ ( x ) , where a i is the right eigenvector correspondingto the i -th eigenvalue.With the deterministic approach, the encoder outputs ba-sis functions φ ( x ) directly. While the encoder adopts thevariational approach, we can acquire the basis functions viasampling from the learned Gaussian distribution. To realizeback-propagation, reparameterized technique is applied to ConvolutionalEncoder

Image sequence MLP

ConvolutionalDecoder

Image sequenceEDMD MLPEDMD

Fig. 1. The framework of CKNet. (a) The encoder of CKNet for expanding some ﬁnite space of V as an invariant subspace of H . It outputs basis functions fortaking the place of the Koopman eigenfunctions. We construct the encoder in two ways, the deterministic approach and the variational approach respectively.The deterministic approach shown in (a.1), outputs basis functions directly after the MLP. (a.2) shows the variational approach via sampling from the learnedGaussian distribution. (b) We adopt a recursive way to realize multi-step training. A high-dimensional nonlinear systems can be described as a low-dimensionallinear dynamics φ ( x k +1 ) = Aφ ( x k ) + Bu k in V , where the system matrix A and control matrix B are obtained as trainable tensors. CKNet is alsoapplicable for unforced systems while the input u k equals 0 constantly. (c) The decoder has the reverse structure with the encoder and it plays the role of theKoopman modes for mapping the latent state from subspace V back to the original observation space. sample basis function in the training process: φ ( x ) = µ φ ( x ) + exp (ln ( σ φ ( x )) (cid:12) ξ ) (5)where µ φ and σ φ are the mean and variance of the learnedGaussian distribution. The µ and ln( σ φ ) are given by thevariational encoder. Where ξ N (0 , I ) is a noise vector from astandard normal distribution. B. CKNet for Forced Dynamics

In this section, we focus on approximating forced dynamicswith CKNet. Consider a discrete-time forced dynamics: x k +1 = f ( x k , u k ) (6)where x k ∈ R c × h × w and u k ∈ R n are the state and controlof the system f . There are several methods of extending theKoopman operator for forced systems which take value-wisevector as states[9], [10], [11]. In this work, we adopt themethod in [11]. The Koopman operator of (6) can be describedas follows: ( K ϕ ) X k = ϕ ( f ( X k )) (7)where X is the extended state of the dynamics: X k = (cid:20) x k u k (cid:21) Similarly, CKNet is also applicable for approximating theKoopman operator in (7): (cid:40) φ ( x k + p ) . = φ K ( x k + p ) = G p Ψ ( x k ) x k = ˜ φ ( x k ) (8)where G = [ A B ] is an approximating operator for forceddynamics. Ψ ( x k ) = [ φ ( x k ) u k ] (cid:62) is the extended state in V . In EDMD, the system matrix A and control matrix B aresolved by constructing a least square problem that builds thedataset with snapshot pairs. However, it is not feasible forlarge-scale dynamical systems. In this work, we treat A and B as trainable tensors and train them in a mini-batch manner.Particularly, we execute the controllability analysis of theidentiﬁed dynamics in subspace V in the training process. Theapproximated discrete-time linear dynamics is controllable ifthe following matrix S is full rank: (cid:40) S = (cid:2) B AB A B ... A L − B (cid:3) R = Rank ( S ) (9) C. Loss functions for CKNet

CKNet is a general framework based on the classic EDMDtheory. Meanwhile, CKNet extends the scope of applicationmainly in three aspects: Firstly, CKNet adopts multi-step lossfunctions to improve the approximating performance whileEDMD adopt one-step loss functions. Secondly, CKNet isapplicable for pixel-wise inputted dynamical systems whileEDMD is only applicable for value-wise systems. Thirdly,CKNet adopts a mini-batch training manner, but EDMD trainedvia solving a least square problem with whole training data sothat EDMD is not feasible for high-dimensional systems.To strength the linear accuracy of the identiﬁed model in V ,we consider the term of linearity loss to restrain the encoder.Multi-step linearity loss technology is applied so that theapproximated dynamics can be identiﬁed better in a global perspective leading to more predictive steps without divergence. L linear = 1 p l p l (cid:88) i =1 (cid:37) i (cid:107) φ ( x k + i , θ e ) − φ K ( x k + i ) (cid:107) F = 1 p l p l (cid:88) i =1 (cid:37) i (cid:13)(cid:13) φ ( x k + i , θ e ) − G i Ψ ( x k ) (cid:13)(cid:13) F (10)where the encoder φ is parameterized with trainable weights θ e , (cid:37) i is the weight of the i -th step linear prediction, G i denotes i times linear recursion from a state φ ( x k , θ e ) with a sequence ofcontrol u k , ..., u k + i in V , and it can be calculated as follows: G i Ψ ( x ) = φ K ( x k + i )= Aφ K ( x k + i − ) + Bu k + i − = · · · = A i φ ( x k , θ e ) + i (cid:88) j =1 A j − Bu k + i − j (11)When we train the encoder, A , and B only with the constraint(10), we can also acquire a linear approximated dynamics in V if we don’t demand to obtain corresponding pixels. However,during the training process, this training way will make theencoder, A , and B converging to zeros gradually which resultsin an invalid model. To avoid this problem, a reconstruction lossfunction is included. This loss restrains the intrinsic featuresextracted by the encoder to contain all the information so thatthe decoder could retrieve original pixels. L recon = 1 p p (cid:88) i =1 (cid:13)(cid:13)(cid:13) x k − ˜ φ ( φ ( x k +1 , θ e ) , θ d ) (cid:13)(cid:13)(cid:13) F (12)where ˜ φ denotes the decoder which is parameterized with θ d .Since we need to generate the corresponding images aftermulti-step prediction in V , the weighted multi-step predictionloss function is designed to further restrain the encoder anddecoder. L pred = 1 p p p p (cid:88) i =1 ι i (cid:13)(cid:13)(cid:13) x k − ˜ φ (cid:0) G i Ψ ( x k ) , θ d (cid:1)(cid:13)(cid:13)(cid:13) F (13)where ι i is the weight for the i -th step linear prediction.Pixel-wise inputted systems are more complex and toughto approximate than value-wise systems because raw pixelsstate contain many invalid noises. Leading to a problem that,after some prediction steps generated images only have thebackground information, but lost all key features. To alleviatethis problem, we add a term of auxiliary weight to increase theimportance of losses of long step’s prediction. (cid:37) i in (10) and ι i have similar functions in this work, and they are deﬁned bya ‘tanh’ function as follows: (cid:26) ι i = 1 + tanh ( τ l i ) (cid:37) i = 1 + tanh ( τ p i ) (14)where τ (cid:63) is a hyper-parameter and it inﬂuences the importancedegree as shown in Fig. 2. In this way, weights ι and (cid:37) arelimited in the range of [1 2] so that they will not cause gradientexplosion.In addition, we add a term of l regulation loss on the Fig. 2. The weight of multi-step loss functions. We can change the importancedegree of the i -th step prediction loss via tune τ . When the total predictionsteps p l or p p in the training process is large, we should tune down the valueof τ , increase it vice-versa. encoder and decoder to avoid over-ﬁtting. l = Θ (15)where Θ denotes the weights of the encoder, decoder, A , B .Finally, CKNet can be trained under the loss function asfollows: L = α L linear + α L recon + α L pred + α l (16)where α , α , α , α are the weights for each loss function.We can train CKNet through minimizing the weighted loss, L ,details shown in Algorithm 1. Algorithm 1

The CKNet Algorithm

Require: p , p l , p p , τ p τ l , c , c (cid:48) , ζ , lr , Epoch = 0 , Epoch max , α i , i = 1 , · · · , , batch size b s , a small scalar (cid:15) > .Initialize θ e , θ d , A , B Ensure: trained θ e , θ d , A , B ; while Epoch > Epoch max do Sample a batch data sequence of images and controlswith the length of ms = max ( p, p l , p p ) . x ms + c , U ms for i = 1 to ms do if Adopt dterministic approach then Obtain the latent state, φ ( x i : i + c , θ e ) , outputted bythe encoder directly; else Sample the latent state φ ( x i : i + c , θ e ) with (5); end if Acquire the reconstruction state ˆ x =˜ φ ( φ ( x i : i + c , θ e ) , θ d ) ; end for for i = 1 to ms do Compute G i Ψ ( x c ) with (11) and ˜ φ ( G i Ψ ( x c )) ; end for Obtain the weighted loss L in (16) with (10), (12), (13),(14), and (15); Update θ e , θ d , A , and B via minimizing L with anAdam optimizer; Epoch = Epoch + 1 end while

TABLE II

NFORMATION OF COLLECTED DATASET . C ART P OLE M OUNTAIN C AR E PISODES

250 240S

TEPS [200 300] [300 400] A LLOCATION [25 25 200] [20 20 200]

III. E

XPERIMENTS

In this work, we adopt the ofﬂine training manner to validateour CKNet via two nonlinear pixel-wise inputted systemswith continuous action space, MountainCar and CartPole,respectively. Namely, we ﬁrst collect the training, testing, andvalidation datasets, and preprocess these data before training.

Fig. 3. The selected two forced dynamics in Gym environment with continuousaction space for validating CKNet. (a) Modiﬁed ‘CartPole-v0’ task withcontinuous action space; (b) ‘MountainCarContinuous-v0’ task.

A. Data collection ‘MountainCarContinuous-v0’ and ‘CartPole-v0’ are two clas-sic tasks for validating reinforcement learning (RL) algorithms.In gym library, the MountainCar task provides the versionwith continuous action space while CartPole task only supportsdiscrete action space with {− , , } . Thus we did a slightmodiﬁcation to support continuous action space for the CartPoletask. In order to obtain comprehensive data in the state space,we utilize trained RL algorithms and plus a term of noiseto the controller for data collection. In collecting process,we record episodes data including the current image s k , theexecuted control u k , and the next image s k +1 . For CartPole, wecollected 250 episodes where 25 for testing, 25 for validation,and the rest 200 for training. The steps of each episode are inthe range of [200 300] . Similarly for MountainCar, informationis detailed in Table. I.In the preprocessing, we ﬁrst convert images to grayscale.Then, we enhance these images by modiﬁng the grayscale to . when a pixel’s value is bigger than 0.8. Lastly, we crop andresize images to an appropriate size so that we can decreasethe calculation cost but still keep enough key information..A single image include position and angle features but itcan not represent information of velocities, such as velocity ofthe car and angular velocity of the pole In the CartPole task.Therefore, we concatenate c continuous images to a multi-channel tensor as the state. Consequently, the size of statetensors is c × × for CartPole environment, c × × for MountainCar. TABLE IIN

EURAL NETWORK STRUCTURE OF THESE TWO EXAMPLES . C ART P OLE M OUNTAIN C AR S TRUCTURE

ACT S

TRUCTURE

ACT × × I NPUT × × I NPUT × × R E LU × × R E LU × × R E LU × × R E LU × × R E LU × × R E LU × × R E LU × × R E LU × × R E LU - -4860 R E LU 4860 R E LU1525 R E LU 1525 R E LU32 -/T

ANH

32 -/T

ANH

B. Training

As shown in Table. II, the neural networks have similarstructures and they are designed simply without any poolinglayer. There is one more convolutional layer considering that theinput image of the CartPole task is bigger. For the activationfunction of the encoder’s last layer, we tried two kinds ofactivation styles, ‘Tanh’ function and without an activationfunction, and simulation results show that these two styles areboth valid. Decoders have completely reversed structures withcorresponding to encoders. Activation functions of decodersare ‘ReLU’ function except the activation function of the lastconvolutional layer is ‘Sigmoid’ function.Hyperparameters are given in Table. III, where lr , bs arethe learning rate and batch size respectively, c and c (cid:48) denotethe number of images for the input of encoders and output ofdecoders. When c (cid:48) = c , it means the decoder is constrainedto output the exact images corresponding to the input ofthe encoder. When c (cid:48) = 1 , it denotes the decoder is onlyconstrained to output the current image which equals the lastimage of the input of the encoder. From Table. III we canknow that hyperparameters of these two tasks are almost thesame except for the learning rate and batch size. In the trainingprocess, CKNet does not need to design network structure andtune hyperparameters deliberately for different tasks.Additionally, we train the network with Pytorch-Lightning1.0.7, which is a framework based on Pytorch. Pytorch-Lightning is convenient for training with multi-GPU andsynchronizing parameters of batch-normalization. We trainthese networks with four NVIDIA GeForce GTX 2080Ti GPUwith batch-normalization.IV. R ESULTS

During training process, we regularly check the controlla-bility of the identiﬁed linear dynamics via recording the rankof S in (9). The change curve of the rank R shown in Fig.4, for CartPole task, the rank R reaches V quickly, where V is the dimension of the latent state. For the MountainCartask, S becomes full rank after around 4.5K training steps.Namely, in the training process, the identiﬁed models of thesetwo tasks for both deterministic and variational approachesbecome controllable. TABLE IIIH

YPER - PARAMETERS OF TWO ENVIRONMENT . C ART P OLE M OUNTAIN C AR H- PARAM V ALUE H- PARAM V ALUE α α α α α α α × − α × − V V τ p . τ p . τ l . τ l . p l p l p p p p p p c c c (cid:48) c (cid:48) lr . lr . bs bs Batch

Fig. 4. The rank in the training process of the matrix S in (9). (cid:63) denotes therealized task (cid:63) with the deterministic approach, and where (cid:63) ∈ { MountainCar,CartPole } . V- (cid:63) denotes the variational approach for the task of (cid:63) . For testing, we do steps prediction of these two tasksfor demonstrating the proposed CKNet. We ﬁrst obtain theoriginal latent state φ ( x ) utilizing the original state x whichconsists of c adjacent images, then we acquire φ K ( x k − ) with a sequence of controls u k as input according to therecurrent rule as shown in Fig. 1 (b).Because the dimension of the latent state equals 32 in thiswork, for intuitive expression, we adopt the mean absoluteerror (MAE) of each step of the latent state and reconstructionimage to evaluate the accuracy of identiﬁed linear dynamics.Generally, there are two ways while utilizing the identiﬁeddynamics. While we use the latent state as the input of systemsfor controlling, we expect a smaller error on predicted latentstates. Similarly, while we use predicted images, we desirea smaller error on generation images. As shown in Fig. 5,we study the inﬂuence of auxiliary weights on the predictionperformance with deterministic and variational approaches.In the MountainCar task, we can knowledge that an appropri-ate auxiliary weight on linearity loss can signiﬁcantly decreasethe prediction of latent states with both deterministic andvariational approaches. For image prediction and generation, anappropriate auxiliary weight also has an obvious improvementwhen the prediction step is within the range of [0 80] steps for the deterministic approach, and [0 50] steps for the variationalapproach.In the CartPole task, auxiliary weights have obvious perfor-mance improvements on prediction of generation images forboth deterministic and variational approaches. For latent statesprediction, auxiliary weights on L pred have a better assistancefor the deterministic approach while auxiliary weights areputted on L linear . Besides, small auxiliary weights on L pred are more suitable.The prediction and images generation result shown in Fig.6 and Fig. 7, identiﬁed dynamics can not only accuratelypredict dynamical intrinsic state, such as the position, angle, andvelocities, but also contain ﬁxed information of environments,i.e. the size of the pole, the shape and slide rail of the car, theshape of the mountain.V. C ONCLUSION

This work proposes a deep learning framework with con-volutional networks for identifying the latent dynamics fromraw images. We construct the encoder in two different ways,the deterministic and variational approaches. Besides, auxiliaryweights are introduced into multi-step linearity and predictionlosses to improve the prediction performance. Training processunder the constraints of the Koopman operator, the identiﬁedmodel is linear, controllable, and physically-interpretable in thesubspace constructed by the encoder. Experiments adopt twoclassic forced and unforced physical systems with continuousaction space and the results show that the identiﬁed model canaccurately predict the latent states and generate clear imagesfor 120 steps linear prediction.A

CKNOWLEDGMENT

The authors would like to thank...R

EFERENCES[1] P. J. Schmid, “Dynamic mode decomposition of numerical and experi-mental data,” Journal of ﬂuid mechanics, vol. 656, pp. 5–28, 2010.[2] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, “A data–drivenapproximation of the koopman operator: Extending dynamic modedecomposition,” Journal of Nonlinear Science, vol. 25, no. 6, pp. 1307–1346, 2015.[3] I. G. Kevrekidis, C. W. Rowley, and M. O. Williams, “A kernel-based method for data-driven koopman spectral analysis,” Journal ofComputational Dynamics, vol. 2, no. 2, pp. 247–265, 2016.[4] I. Mezi´c, “Analysis of ﬂuid ﬂows via spectral properties of the koopmanoperator,” Annual Review of Fluid Mechanics, vol. 45, pp. 357–378,2013.[5] Y. Susuki, I. Mezic, F. Raak, and T. Hikihara, “Applied koopmanoperator theory for power systems technology,” Nonlinear Theory andIts Applications, IEICE, vol. 7, no. 4, pp. 430–459, 2016.[6] M. Netto and L. Mili, “A robust data-driven koopman kalman ﬁlter forpower systems dynamic state estimation,” IEEE Transactions on PowerSystems, vol. 33, no. 6, pp. 7228–7237, 2018.[7] S. Klus, A. Bittracher, I. Schuster, and C. Sch¨utte, “A kernel-basedapproach to molecular conformation analysis,” The Journal of ChemicalPhysics, vol. 149, no. 24, p. 244109, 2018.[8] G. Mamakoukas, M. L. Castano, X. Tan, and T. Murphey, “Localkoopman operators for data-driven control of robotic systems.” inRobotics: science and systems, 2019.[9] J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Dynamic mode decom-position with control,” SIAM Journal on Applied Dynamical Systems,vol. 17, no. 1, pp. 142–161, 2018. M A E MountainCar / Latent State Error M A E MountainCar / Image Error

D-MountainCarD-R01W-MountainCarD-R03W-MountainCarD-L01W-MountainCarD-L03W-MountainCar (a) Deterministic approach for MountainCar M A E MountainCar / Latent State Error M A E MountainCar / Image Error

V-MountainCarV-R01W-MountainCarV-R03W-MountainCarV-L01W-MountainCarV-L03W-MountainCar (b) Variational approach for MountainCar M A E CartPole / Latent State Error M A E CartPole / Image Error

D-CartPoleD-R01W-CartPoleD-R03W-CartPoleD-L01W-CartPoleD-L03W-CartPole (c) Deterministic approach for CartPole M A E CartPole / Latent State Error M A E CartPole / Image Error

V-CartPoleV-R01W-CartPoleV-R03W-CartPoleV-L01W-CartPoleV-L03W-CartPole (d) Variational approach for CartPoleFig. 5. Prediction MAE with the deterministic and variational approaches for MountainCar and CartPole tasks. The left and right columns subﬁguresrespectively show the MAE of prediction latent states and reconstruction images with different auxiliary weights in (14). Where ‘D’ denotes that CKNetadopts the deterministic approach for the encoder while ‘V’ denotes the variational approach. ‘R0 (cid:63)

W’ and ‘L0 (cid:63)

W’ indicate that τ p and τ l in (14) are equal to . × (cid:63) respectively, otherwise both τ p and τ l are equal to zero. The MAE of each curve are calculated with 30 episodes of prediction.

10 20 30 40 50 6070 80 90 100 110 120

Original state TrueModelTrueModel

Fig. 6. The prediction of CartPole task. The solid red line divides the picture into two layers and each layer has two rows. The up row denotes the groundtruth while the second row denotes the generated images via linear prediction in V . Besides, the up left numbers denote the prediction steps. Original state TrueModel

10 20 30 40 50 60 70 80 90 100 110 120-2 -1 0