[PDF] Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings

Abstract

We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, and colors. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learning-based video synthesis methods. We present a probabilistic model that, given a single image of a completed painting, recurrently synthesizes steps of the painting process. We implement this model as a convolutional neural network, and introduce a novel training scheme to enable learning from a limited dataset of painting time lapses. We demonstrate that this model can be used to sample many time steps, enabling long-term stochastic video synthesis. We evaluate our method on digital and watercolor paintings collected from video websites, and show that human raters find our synthetic videos to be similar to time lapse videos produced by real artists. Our code is available at this https URL.

Full PDF

PPainting Many Pasts: Synthesizing Time Lapse Videos of Paintings

Amy ZhaoMIT [email protected]

Guha BalakrishnanMIT [email protected]

Kathleen M. LewisMIT [email protected]

Fr´edo DurandMIT [email protected]

John V. GuttagMIT [email protected]

Adrian V. DalcaMIT, MGH [email protected]

Abstract

We introduce a new video synthesis task: synthesizingtime lapse videos depicting how a given painting might havebeen created. Artists paint using unique combinations ofbrushes, strokes, and colors. There are often many possibleways to create a given painting. Our goal is to learn tocapture this rich range of possibilities.Creating distributions of long-term videos is a challengefor learning-based video synthesis methods. We present aprobabilistic model that, given a single image of a com-pleted painting, recurrently synthesizes steps of the paintingprocess. We implement this model as a convolutional neuralnetwork, and introduce a novel training scheme to enablelearning from a limited dataset of painting time lapses. Wedemonstrate that this model can be used to sample manytime steps, enabling long-term stochastic video synthesis.We evaluate our method on digital and watercolor paint-ings collected from video websites, and show that humanraters ﬁnd our synthetic videos to be similar to time lapsevideos produced by real artists. Our code is available at https://xamyzhao.github.io/timecraft .

1. Introduction

Skilled artists can often look at a piece of artwork and de-termine how to recreate it. In this work, we explore whetherwe can use machine learning and computer vision to mimicthis ability. We deﬁne a new video synthesis problem: givena painting, can we synthesize a time lapse video depictinghow an artist might have painted it?

Artistic time lapses present many challenges for videosynthesis methods. There is a great deal of variation in howpeople create art. Suppose two artists are asked to paint thesame landscape. One artist might start with the sky, whilethe other might start with the mountains in the distance. Onemight ﬁnish each object before moving onto the next, while

Input Synthesized

Figure 1: We present a probabilistic model for synthesizingtime lapse videos of paintings. We demonstrate our modelon

Still Life with a Watermelon and Pomegranates by PaulCezanne (top), and

Wheat Field with Cypresses by Vincentvan Gogh (bottom).the other might work a little at a time on each object. Duringthe painting process, there are often few visual cues indicat-ing where the artist will apply the next stroke. The paint-ing process is also long, often spanning hundreds of paintstrokes and dozens of minutes.In this work, we present a solution to the painting timelapse synthesis problem. We begin by deﬁning the prob-lem and describing its unique challenges. We then derive aprincipled, learning-based model to capture a distribution ofsteps that a human might use to create a given painting. Weintroduce a training scheme that encourages the method toproduce realistic changes over many time steps. We demon-strate that our model can learn to solve this task, even whentrained using a small, noisy dataset of painting time lapsescollected from the web. We show that human evaluatorsalmost always prefer our method to an existing video syn-thesis baseline, and often ﬁnd our results indistinguishablefrom time lapses produced by real artists.This work presents several technical contributions:1. We use a probabilistic model to capture stochastic de-cisions made by artists, thereby capturing a distributionof plausible ways to create a painting.2. Unlike work in future frame prediction or frame interpo-lation, we synthesize long-term videos spanning dozens1 a r X i v : . [ c s . G R ] A p r f time steps and many real-time minutes.3. We demonstrate a model that successfully learns frompainting time lapses “from the wild.” This data is smalland noisy, having been collected from uncontrolled en-vironments with variable lighting, spatial resolution andvideo capture rates.

2. Related work

To the best of our knowledge, this is the ﬁrst work thatmodels and synthesizes distributions of videos of the past,given a single ﬁnal frame. The most similar work to oursis a recent method called visual deprojection [5]. Givena single input image depicting a temporal aggregation offrames, their model captures a distribution of videos thatcould have produced that image. We compare our methodto theirs in our experiments. Here, we review additionalrelated research in three main areas: video prediction, videointerpolation, and art synthesis.

Video prediction, or future frame prediction, is the prob-lem of predicting the next frame or few frames of a videogiven a sequence of past frames. Early work in this areafocused on predicting motion trajectories [8, 16, 34, 51, 55]or synthesizing motions in small frames [40, 41, 50]. Re-cent methods train convolutional neural networks on largevideo datasets to synthesize videos of natural scenes andhuman actions [35, 38, 46, 52, 53]. A recent work on timelapse synthesis focuses on outdoor scenes [43], simulatingillumination changes over time while keeping the contentof the scene constant. In contrast, creating painting timelapses requires adding content while keeping illuminationconstant. Another recent time lapse method outputs only afew frames depicting speciﬁc physical processes: melting,rotting, or ﬂowers blooming [70].Our problem differs from video prediction in several keyways. First, most video prediction methods focus on shorttime scales, synthesizing frames on the order of secondsinto the future, and encompassing relatively small changes.In contrast, painting time lapses span minutes or even hours,and depict dramatic content changes over time. Second,most video predictors output a single most likely sequence,making them ill-suited for capturing a variety of differ-ent plausible painting trajectories. One study [63] uses aconditional variational autoencoder to model a distributionof plausible future frames of moving humans. We buildupon these ideas to model painting changes across multi-ple time steps. Finally, video prediction methods focus onnatural videos, which depict of motions of people and ob-jects [35, 38, 46, 52, 53, 63] or physical processes [43, 70].The input frames often contain visual cues about how themotion, action or physical process will progress, limiting the space of possibilities that must be captured. In contrast,snapshots of paintings provide few visual cues, leading tomany plausible future trajectories.

Our problem can be thought of as a long-term frame in-terpolation task between a blank canvas and a completedwork of art, with many possible painting trajectories be-tween them. In video frame interpolation, the goal is totemporally interpolate between two frames in time. Classi-cal approaches focus on natural videos, and estimate denseﬂow ﬁelds [4, 58, 65] or phase [39] to guide interpolation.More recent methods use convolutional neural networks todirectly synthesize the interpolated frame [45], or combineﬂow ﬁelds with estimates of scene information [28, 44].Most frame interpolation methods predict a single or a fewintermediate frames, and are not easily extended to predict-ing long sequences, or predicting distributions of sequences.

The graphics community has long been interested in sim-ulating physically realistic paint strokes in digital media.Many existing methods focus on physics-based models ofﬂuids or brush bristles [6, 7, 9, 12, 57, 62]. More re-cent learning-based methods leverage datasets of real paintstrokes [31, 36, 68], often posing the artistic stroke syn-thesis problem as a texture transfer or style transfer prob-lem [3, 37]. Several works focus on simulating watercolor-speciﬁc effects such as edge darkening [42, 56]. We fo-cus on capturing large-scale, long-term painting processes,rather than ﬁne-scale details of individual paint strokes.In style transfer, images are transformed to simulate aspeciﬁc style, such as a painting-like style [20, 21] or acartoon-like style [67]. More recently, neural networks havebeen used for generalized artistic style transfer [18, 71]. Weleverage insights from these methods to synthesize a realis-tic progressions of paintings.Several recent papers apply reinforcement learning orsimilar techniques to the process of painting. These ap-proaches involve designing parameterized brush strokes,and then training an agent to apply strokes to produce agiven painting [17, 22, 26, 27, 59, 60, 69]. Some worksfocus on speciﬁc artistic tasks such as hatching or otherrepetitive strokes [29, 61]. These approaches require carefulhand-engineering, and are not optimized to produce variedor realistic painting progressions. In contrast, we learn abroad set of effects from real painting time lapse data.

3. Problem overview

Given a completed painting, our goal is to synthesizedifferent ways that an artist might have created it. Wework with recordings of digital and watercolor paintingtime lapses collected from video websites. Compared to2 ime

Figure 2:

Several real painting progressions of similar-looking scenes . Each artist ﬁlls in the house, sky and ﬁeldin a different order.natural videos of scenes and human actions, videos of paint-ings present unique challenges.

High VariabilityPainting trajectories : Even for the same scene, differ-ent artists will likely paint objects in different temporalorders (Figure 2).

Painting rates : Artists work at different speeds, and ap-ply paint in different amounts.

Scales and shapes : Over the course of a painting, artistsuse strokes that vary in size and shape. Artists often usebroad strokes early on, and add ﬁne details later.

Data availability : Due to the limited number of availablevideos in the wild, it is challenging to gather a dataset thatcaptures the aforementioned types of variability.

Medium-speciﬁc challengesNon-paint effects : In digital art applications ( e.g. , [23]),there are many tools that apply local blurring, smudging,or specialized paint brush shapes. Artists can also applyglobal effects simulating varied lighting or tones.

Erasing effects : In digital art applications, artists canerase or undo past actions, as shown in Figure 3.

Physical effects in watercolor paintings : Watercolorpainting videos exhibit distinctive effects resulting fromthe physical interaction of paint, water, and paper. Theseeffects include specular lighting on wet paint, pigmentsfading as they dry, and water spreading from the point ofcontact with the brush (Figure 4).In this work, we design a learning-based model to han-dle the challenges of high variability and painting medium-speciﬁc effects.

Time

Figure 3:

Example digital painting sequences . These se-quences show a variety of ways to add paint, including ﬁnestrokes and ﬁlling (row 1), and broad strokes (row 3). Weuse red boxes to outline challenges, including erasing (row2) and drastic changes in color and composition (row 3).

Time

Figure 4:

Example watercolor painting sequences . Theoutlined areas highlight some watercolor-speciﬁc chal-lenges, including changes in lighting (row 1), diffusion andfading effects as paint dries (row 2), and specular effects onwet paint (row 3).

4. Method

We begin by formalizing the time lapse video synthesisproblem. Given a painting x T , our task is to synthesize thepast frames x , · · · , x T − . Suppose we have a training setof real time lapse videos { x ( i ) = x ( i )1 , · · · , x ( i ) T ( i ) } . We ﬁrstdeﬁne a principled probabilistic model, and then learn itsparameters using these videos. At test time, given a com-pleted painting, we sample from the model to create newvideos of realistic-looking painting processes. We propose a probabilistic, temporally recurrent modelfor changes made during the painting process. At each timeinstance t , the model predicts a pixel-wise intensity change δ t that should be added to the previous frame to produce thecurrent frame; that is, x t = x t − + δ t . This change couldrepresent one or multiple physical or digital paint strokes,or other effects such as erasing or fading.We model δ t as being generated from a random la-tent variable z t , the completed piece x T , and the imagecontent at the previous time step x t − ; the likelihood is3 𝛿 𝑡 𝑧 𝑡 𝑥 𝑇 𝑥 𝑡−1 𝜃 Final frame

Previous frame Current change

Figure 5:

The proposed probabilistic model . Circles rep-resent random variables; the shaded circle denotes a vari-able that is observed at inference time. The rounded rectan-gle represents model parameters. p θ ( δ t | z t , x t − ; x T ) (Figure 5). Using a random variable z t helps to capture the stochastic nature of painting. Usingboth x T and x t − enables the model to capture time-varyingeffects such as the progression of coarse to ﬁne brush sizes,while the Markovian assumption facilitates learning from asmall number of video examples.It is common to deﬁne such image likelihoods as aper-pixel normal distribution, which results in an L2 im-age similarity loss term in maximum likelihood formula-tions [33]. In synthesis tasks, using L2 loss often pro-duces blurry results [24]. We instead design our image sim-ilarity loss as the L1 distance in pixel space and the L2distance in a perceptual feature space. Perceptual lossesare commonly used in image synthesis and style transfertasks to produce sharper and more visually pleasing re-sults [14, 24, 30, 45, 66]. We use the L2 distance betweennormalized VGG16 features [49] as described in [66]. Welet the likelihood take the form: p θ ( δ t | z t , x t − ; x T ) ∝ e − σ | δ t − ˆ δ t | N (cid:0) V ( x t − + δ t ); V ( x t − + ˆ δ t ) , σ I (cid:1) , (1) where ˆ δ t = g θ ( z t , x t − , x T ) , g θ ( · ) represents a function pa-rameterized by θ , V ( · ) is a function that extracts normalizedVGG16 features, and σ , σ are ﬁxed noise parameters.We assume the latent variable z t is generated from themultivariate standard normal distribution: p ( z t ) = N ( z t ; 0 , I ) . (2)We aim to ﬁnd model parameters θ that best explain allvideos in our dataset: arg max θ Π i Π t p θ ( δ ( i ) t , x ( i ) t − , x ( i ) T ( i ) )= arg max θ Π i Π t (cid:90) z t p θ ( δ ( i ) t | z ( i ) t , x ( i ) t − ; x ( i ) T ( i ) ) dz t . (3)This integral is intractable, and the posterior p ( z t | δ t , x t − ; x T ) is also intractable, preventing theuse of the EM algorithm. We instead use variational infer-ence and introduce an approximate posterior distribution 𝑧 Σ 𝑥 𝑡−1 𝑥 𝑇 C C dense sampling ℒ 𝐾𝐿 መ𝛿 𝑡 ො𝑥 𝑡 transposed conv ℒ 𝐿1 + ℒ 𝐿2 D down-sample Encoding branch(used in training only) 𝑥 𝑡 𝑥 𝑡−1 𝑥 𝑇 Conditioning branch add painting changeconcatenate3x3 conv, pool 𝜇 T T C + concatenate Figure 6:

Neural network architecture . We imple-ment our model using a conditional variational autoencoderframework. At training time, the network is encouraged toreconstruct the current frame x t , while sampling the latent z t from a distribution that is close to the standard normal.At test time, the encoding branch is removed, and z t issampled from the standard normal. We use the shorthand ˆ δ t = g θ ( z t , x t − , x T ) , ˆ x t = x t − + ˆ δ t . p ( z t | δ t , x t − ; x T ) ≈ q φ ( z t | δ t , x t − ; x T ) [32, 63, 64].We let this approximate distribution take the form of amultivariate normal: q φ ( z t | δ t , x t − , x T )= N (cid:0) z t ; µ φ ( δ t , x t − , x T ) , Σ φ ( δ t , x t − , x T ) (cid:1) , (4)where µ φ ( · ) , Σ φ ( · ) are functions parameterized by φ , and Σ φ ( · ) is diagonal. We implement the functions g θ , µ φ and Σ φ as a convo-lutional encoder-decoders parameterized by θ and φ , us-ing a conditional variational autoencoder (CVAE) frame-work [54, 64]. We use an architecture similar to [64]. Wesummarize our architecture in Figure 6 and include full de-tails in the appendix. We learn model parameters using short sequences fromthe training video dataset, which we discuss in further de-tail in Section 5.1. We use two stages of optimization tofacilitate convergence: pairwise optimization , followed by sequence optimization . From Equations (3) and (4), we obtain an expression foreach pair of consecutive frames (a derivation is provided inthe appendix):4 𝜙 𝑞 𝜙 𝑞 𝜙 ෝ𝒙 𝒕+𝟏 ෝ𝒙 𝒕+𝑺 ෝ𝒙 𝒕 ⋯ 𝑥 𝑇 𝑥 𝑡 𝑥 𝑡−1 Image similarity loss 𝑥 𝑡+1 𝑥 𝑡+𝑆 𝑧 𝑡 𝑧 𝑡+1 𝑧 𝑡+𝑆 Image similarity loss Image similarity loss 𝑥 𝑇 𝑥 𝑇 Figure 7:

Sequential CVAE training . Our model is trainedto reconstruct a real frame (outlined in green) while buildingupon its previous predictions for S time steps. log p θ ( δ t , x t − , x T ) ≥ E z t ∼ q φ ( z t | x t − ,δ t ; x T ) (cid:2) log p θ ( δ t | z t , x t − ; x T ) (cid:3) − KL [ q φ ( z t | δ t , x t − ; x T ) || p ( z t )] , (5)where KL [ ·||· ] denotes the Kullback-Liebler divergence.Combining Equations (1), (2), (4), and (5), we minimize: L KL + 1 σ L L ( δ t , ˆ δ t )+ 12 σ L L ( V ( x t − + δ t ) , V ( x t − + ˆ δ t )) , (6)where L KL = (cid:0) − log Σ φ + Σ φ + µ φ (cid:1) , and the imagesimilarity terms L L , L L represent L1 and L2 distance re-spectively.We optimize Equation (6) on single time steps, which weobtain by sampling all pairs of consecutive frames from thedataset. We also train the model to produce the ﬁrst frame x from videos that begin with a blank canvas, given a whiteinput frame x blank , and x T . These starter sequences teachthe model how to start a painting at inference time. To synthesize an entire video, we run our model recurrentlyfor multiple time steps, building upon its own predictedframes. It is common when making sequential predictionsto observe compounding errors or artifacts over time [52].We use a novel training scheme to encourage outputs of themodel to be accurate and realistic over multiple time steps.We alternate between two sequential training modes.

Sequential CVAE training encourages sequences offrames to be well-captured by the learned distribution, byreducing the compounding of errors. We train the modelsequentially for several time steps, predicting each inter-mediate frame ˆ x t using the model’s prediction from theprevious time step: ˆ x t = ˆ x t − + g θ ( z t , ˆ x t − , x T ) for z t ∼ q φ ( z t | x t − ˆ x t − , ˆ x t − , x T ) . We compare each pre-dicted frame to its corresponding real frame using the imagesimilarity losses in Eq. (6). We illustrate this in Figure 7. Sequential sampling training encourages random samplesfrom our learned distribution to look like realistic partially-completed paintings. During inference (described below), ෝ𝒙 𝑡 = 2 ෝ𝒙 𝝉 𝑥 𝑏𝑙𝑎𝑛𝑘 ෝ𝒙 𝑥 𝑇 Image similarity loss 𝑡 = 1 𝑡 = 𝜏𝑧 𝑧 𝜏 𝑧 𝑁(𝑧; 0,𝑰)

𝑁(𝑧; 0,𝑰) 𝑁(𝑧; 0,𝑰) ⋯ 𝑥 𝑇 Critic loss Critic loss Critic loss 𝑥 𝑇 𝑥 𝑇 Figure 8:

Sequential sampling training . We use a con-ditional frame critic to encourage all frames sampled fromour model to look realistic. The image similarity loss on theﬁnal frame encourages the model to complete the paintingin τ time steps.we rely on sampling from the prior p ( z t ) at each time stepto synthesize new videos. A limitation of the variationalstrategy is the limited coverage of the latent space z t dur-ing training [15], sometimes leading to predictions duringinference ˆ x t = ˆ x t − + g θ ( z t , ˆ x t − , x T ) , z t ∼ p ( z t ) thatare unrealistic. To compensate for this, we introduce super-vision on such samples by amending the image similarityterm in Equation (5) with a conditional critic loss term [19]: L critic = E z t ∼ p ( z t ) (cid:2) D ψ (cid:0) ˆ x t , ˆ x t − , x T (cid:1)(cid:3) − E x t (cid:2) D ψ ( x t , x t − , x T ) (cid:3) , (7)where D ψ ( · ) is a critic function with parameters ψ .This critic encourages the distribution of sampled changes ˆ δ t = g θ ( z t , ˆ x t − , x T ) , z t ∼ p ( z t ) to match the distributionof training painting changes δ t . We use a critic architecturebased on [10] and optimize it using WGAN-GP [19].In addition to the critic loss, we apply the image similaritylosses (discussed above) after τ time steps, to encourage themodel to eventually produce the completed painting. Thistraining scheme is summarized in Figure 8. Given a completed painting x T and learned model pa-rameters θ, φ , we synthesize videos by sampling from themodel at each time step. Speciﬁcally, we synthesize eachframe ˆ x t = ˆ x t − + g θ ( z t , ˆ x t − , x T ) using the synthesizedprevious frame ˆ x t − and a randomly sampled z t ∼ p ( z t ) .We start each video using ˆ x = x blank , a blank frame. We implement our model using Keras [11] and Tensor-ﬂow [1]. We experimentally selected the hyperparameterscontrolling the reconstruction loss weights to be σ = 0 . and σ = 0 . , using the validation set.

5. Experiments

We collected time lapse recordings of paintings fromYouTube and Vimeo. We selected digital and watercolor5aintings (which are common painting methods on thesewebsites), and focused on landscapes or still lifes (whichare common subjects for both mediums). We downloadedeach video at × resolution and cropped it temporallyand spatially to include only the painting process (excludingother content such as introductions or sketching). We spliteach dataset in a 70:15:15 ratio into training, validation, andheld-out test video sets. Digital paintings:

We collected digital painting timelapses. The average duration is 4 minutes, with many videoshaving already been sped up by artists using the Procreateapplication [23]. We selected videos with minimal zoom-ing and panning. We manually removed segments that con-tained movements such as translations, ﬂipping and zoom-ing. Figure 3 shows example video sequences.

Watercolor paintings:

We collected watercolor timelapses, with an average duration of 20 minutes. We onlykept videos that contained minimal movement of the paper,and manually corrected any small translations of the paint-ing. We show examples in Figure 4.A challenge with videos of physical paintings is the pres-ence of the hand, paintbrush and shadows in many frames.We trained a simple convolutional neural network to iden-tify and remove frames that contained these artifacts.

We synthesize time lapses at a lower temporal resolutionthan real-time for computational feasibility. We extracttraining sequences from raw videos at a period of γ > frames ( i.e., skipping γ frames in each synthesized timestep), with a maximum variance of (cid:15) frames. Allowingsome variance in the sampling rate is useful for (1) im-proving robustness to varied painting rates, and (2) ex-tracting sequences from watercolor painting videos wheremany frames containing hands or paintbrushes have beenremoved. We select γ and (cid:15) independently for each dataset.We avoid capturing static segments of each video ( e.g., when the artist is speaking) by requiring that adjacentframes in each sequence have at least of the pixelschanging by a ﬁxed intensity threshold. We use a dynamicprogramming method to ﬁnd all training and validation se-quences that satisfy these criteria. We train on sequencesof length 3 or 5 for sequential CVAE training, and length τ = 40 for sequential sampling training, which we deter-mined using experiments on the validation set. For evalua-tions on the test set, we extract a single sequence from eachtest video that satisﬁes the ﬁltering criteria. To facilitate learning from small numbers of videos, we usemultiple crops from each video. We ﬁrst downsample each video spatially to × , so that most patches containvisually interesting content and spatial context, and then ex-tract × crops with minimal overlap. Deterministic video synthesis ( unet ): In image synthesistasks, it is common to use an encoder-decoder architecturewith skip connections, similar to U-Net [24, 47]. We adaptthis technique to synthesize an entire video at once.

Stochastic video synthesis ( vdp ): Visual deprojection syn-thesizes a distribution of videos from a single temporally-projected input image [5].We design each baseline model architecture to have a com-parable number of parameters to our model. Both baselinesoutput videos of a ﬁxed length, which we choose to be to be comparable to our choice of τ = 40 in Section 5.1.1. We conducted both quantitative and qualitative eval-uations. We ﬁrst present a user study quantifying hu-man perception of the realism of our synthesized videos.Next, we qualitatively examine our synthetic videos,and discuss characteristics that contribute to their real-ism. Finally, we discuss quantitative metrics for com-paring sets of sampled videos to real videos. Weshow additional results, including videos and visualiza-tions using the tipiX tool [13] on our project page at https://xamyzhao.github.io/timecraft .We experimented with training each method on digitalor watercolor paintings only, as well as on the combinedpaintings dataset. For all methods, we found that trainingon the combined dataset produced the best qualitative andquantitative results (likely due to our limited dataset size),and we only present results for those models.

We surveyed 158 people using Amazon MechanicalTurk [2]. Participants compared the realism of pairs ofvideos randomly sampled from ours , vdp , or the real videos.In this study, we omit the weaker baseline unet , which per-formed consistently worse on all metrics (discussed below).We ﬁrst trained the participants by showing them severalexamples of real painting time lapses. We then presenteda pair of time lapse videos generated by different methodsfor the center crop of the same painting, and asked “Whichvideo in each pair shows a more realistic painting process?”We repeated this process for 14 randomly sampled paintingsfrom the test set. Full study details are in the appendix.Table 1 indicates that almost every participant thoughtvideos synthesized by our model looked more realistic than6omparison Allpaintings Watercolorpaintings Digitalpaintingsreal > vdp

90% 90% 90%real > ours

55% 60% 51% ours > vdp

91% 90% 88%Table 1:

User study results.

Users compared the realism of pairsof videos randomly sampled from ours , vdp , and real videos. Thevast majority of participants preferred our videos over vdp videos( p < . ). Similarly, most participants chose real videos over vdp videos ( p < . ). Users preferred real videos over ours( p = 0 . ), but many participants confused our videos with thereal videos, especially for digital paintings. those synthesized by vdp ( p < . ). Furthermore, par-ticipants confused our synthetic videos with real videosnearly half of the time. In the next sections, we show ex-ample synthetic videos and discuss aspects that make ourmodel’s results appear more realistic, offering an explana-tion for these promising user study results. Figure 9 shows sample sequences produced by our modelfor two input paintings. Our model chooses different order-ings of semantic regions during the painting process, lead-ing to different paths that still converge to the same com-pleted painting.Figure 10 shows videos synthesized by each method. Toobjectively compare the stochastic methods vdp and ours ,we show the most similar predicted video by L1 distance tothe ground truth video. The ground truth videos show thatartists tend to paint in a coarse-to-ﬁne manner, using broadstrokes near the start of a painting, and ﬁner strokes nearthe end. Artists also tend to focus on one or a few seman-tic regions in each time step. As we highlight with arrows,our method captures these tendencies better than baselines,having learned to make changes within separate semanticregions such as mountains, cabins and trees. Our predictedtrajectories are similar to the ground truth, showing that oursequential modeling approach is effective at capturing real-istic temporal progressions. In contrast, the baselines tendto make blurry changes without separating the scene intocomponents.We examine failure cases from the proposed method inFigure 11, such as making many ﬁne or disjoint changes ina single time step and creating an unrealistic effect.

In a stochastic task, comparing synthesized results to“ground truth” is ill-deﬁned, and developing quantitativemeasures of realism is difﬁcult [25, 48]; these challengesmotivated our user study above. In this section, we explore Figure 9:

Diversity of sampled videos . We show exam-ples of our method applied to a digital (top 3 rows) and awatercolor (bottom 3 rows) painting from the test set. Ourmethod captures diverse and plausible painting trajectories.quantitative metrics designed to measure aspects of timelapse realism. For each video in the test set, we extract a40-frame long sequence according to the criteria describedin Section 5.1.1, and evaluate each method on 5 randomcrops using several video similarity metrics:

Best (across k samples) overall video distance (lower isbetter): For each crop, we draw k sample videos from eachmodel and report the closest sample to the true video by L1distance [5]. If a method has captured the distribution ofreal time lapses well, it should produce better “best” esti-mates as k → ∞ . This captures whether a model producessome realistic samples, and whether the model is diverseenough to capture each artist’s speciﬁc choices. Best (across k samples) painting change shape similar-ity (higher is better): We quantify how similar the set ofpainting change shapes are between the ground truth andeach predicted video, disregarding the order in which theywere performed. We deﬁne the painting change shape asa binary map of the changes made in each time step. Foreach time step in each test video, we compare the artist’schange shape to the most similarly shaped change synthe-sized by each method, as measured by intersection-over-union (IOU). This captures whether a method paints in sim-ilar semantic regions to the artist.We summarize these results in Table 2. We include adeterministic interp baseline, which linearly interpolates intime, as a quantitative lower bound. The deterministic in-terp and unet approaches perform poorly for both metrics.For k = 2000 , vdp and our method produce samples that7 n e t r e a l Input o u r s v d p (a) Similarly to the artist, our method paints in a coarse-to-ﬁne manner . Blue arrows show where our method ﬁrst appliesa ﬂat color, and then adds ﬁne details. Red arrows indicate where the baselines add ﬁne details even in the ﬁrst time step.

Input r e a l u n e t v d p o u r s (b) Our method works on similar regions to the artist, although it does not use the same color layers to achieve the com-pleted painting . Blue arrows show where our method paints similar parts of the scene to the artist (ﬁlling in the backgroundﬁrst, and then the house, and then adding details to the background). Red arrows indicate where the baselines do not paintaccording to semantic boundaries, gradually fading in the background and the house in the same time step.

Figure 10:

Videos predicted from the digital (top) and watercolor (bottom) test sets . For the stochastic methods vdp and ours , we show the nearest sample to the real video out of 2000 samples. We show additional results in the appendix.lead to comparable “best video similarity” by L1 distance,highlighting the strength of methods designed to capturedistributions of videos. The painting change IOU metricshows that our method synthesizes changes that are signiﬁ-cantly more realistic than the other methods.We show the effect of increasing the number of samples k in Figure 12. At low k , the blurry videos produced by interp and unet attain lower L1 distances to the real videothan the videos produced by vdp and ours do, likely becauseL1 distance penalizes samples with different painting pro-gressions more than it penalizes blurry “average” frames.In other words, an artist’s time lapse will typically have ahigher L1 distance to a video of a different but plausible painting process, than it would to a blurry, gradually fad-ing video with “average” frames. As k increases, vdp andour method produce some samples that are close to the realvideo. Together with the user study described above, thesemetrics indicate that our method can capture a realistic va-riety of painting time lapses.

6. Conclusion

In this work, we introduce a new video synthesis prob-lem: making time lapse videos that depict the creation ofpaintings. We proposed a recurrent probabilistic model thatcaptures the stochastic decisions of human artists. We intro-duced an alternating sequential training scheme that encour-8 nput Synthesized (a)

The proposed method does not always synthesize realistic changes for ﬁne details . Blue arrows highlight frames wherethe method makes realistic painting changes, working in one or two semantic regions at a time. Red arrows show exampleswhere our method sometimes ﬁlls in many details in the frame at once.

Input Synthesized (b)

The proposed method sometimes synthesizes changes in disjoint regions . Red arrows indicate where the methodproduces painting changes that ﬁll in small patches that correspond to disparate semantic regions, leaving unrealistic blankgaps throughout the frame. This example also ﬁlls in much of the frame in one time step, although most of the ﬁlled areas inthe second frame are coarse.

Figure 11:

Failure cases . We show unrealistic effects that are sometimes synthesized by our method, for a watercolorpainting (top) and a digital painting (bottom).

Method Digital paintings Watercolor paintingsL1 Change IOU L1 Change IOUinterp .

49 (0 .

13) 0 .

17 (0 .

06) 0 .

38 (0 .

09) 0 .

17 (0 . unet .

18 (0 .

08) 0 .

24 (0 .

08) 0 .

15 (0 .

06) 0 .

27 (0 . vdp . (0 .

06) 0 .

31 (0 . . (0 .

05) 0 .

32 (0 . ours . (0 . . (0 . . (0 . . (0 . Table 2:

Quantitative results . We compare videos synthe-sized from the digital and watercolor painting test sets to theartists’ videos. For the stochastic methods vdp and ours , wedraw 2000 video samples and report the closest one to theground truth.ages the model to make realistic predictions over many timesteps. We demonstrated our model on digital and water-color paintings, and used it to synthesize realistic and variedpainting videos. Our results, including human evaluations,indicate that the proposed model is a powerful ﬁrst tool forcapturing stochastic changes from small video datasets.

7. Acknowledgments

We thank Zoya Bylinskii of Adobe Inc. for her insightsaround designing effective and accurate user studies. Thiswork was funded by Wistron Corporation.

References [1] Mart´ın Abadi et al. Tensorﬂow: Large-scale machine learn-ing on heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 , 2016. [2] Inc. Amazon Mechanical Turk. Amazon mechanical turk:Overview, 2005.[3] Ryoichi Ando and Reiji Tsuruno. Segmental brush synthesiswith stroke images. 2010.[4] Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth,Michael J Black, and Richard Szeliski. A database and eval-uation methodology for optical ﬂow.

International Journalof Computer Vision , 92(1):1–31, 2011.[5] Guha Balakrishnan, Adrian V Dalca, Amy Zhao, John VGuttag, Fredo Durand, and William T Freeman. Visual de-projection: Probabilistic recovery of collapsed dimensions.In

Proceedings of the IEEE International Conference onComputer Vision , pages 171–180, 2019.[6] William Baxter, Yuanxin Liu, and Ming C Lin. A viscouspaint model for interactive applications.

Computer Anima-tion and Virtual Worlds , 15(3-4):433–441, 2004.[7] William V Baxter and Ming C Lin. A versatile interactive3d brush model. In

Computer Graphics and Applications,2004. PG 2004. Proceedings. 12th Paciﬁc Conference on ,pages 319–328. IEEE, 2004.[8] Maren Bennewitz, Wolfram Burgard, and Sebastian Thrun.Learning motion patterns of persons for mobile servicerobots. In

Proceedings 2002 IEEE International Confer-ence on Robotics and Automation (Cat. No. 02CH37292) ,volume 4, pages 3601–3606. IEEE, 2002.[9] Zhili Chen, Byungmoon Kim, Daichi Ito, and Huamin Wang.Wetbrush: Gpu-based 3d painting simulation at the bristlelevel.

ACM Transactions on Graphics (TOG) , 34(6):200,2015.[10] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,Sunghun Kim, and Jaegul Choo. Stargan: Uniﬁed genera-tive adversarial networks for multi-domain image-to-image a) Digital paintings test set.(b) Watercolor paintings test set. Figure 12:

Quantitative results for varying numbers of samples . As we draw more samples from each stochastic method(solid lines), the best video similarity to the real video improves. This indicates that some samples are close to the artist’sspeciﬁc painting choices. We use L1 distance as the metric on the left (lower is better), and change IOU on the right (higheris better). Shaded regions show standard deviations of the stochastic methods. We highlight several insights from these plots.(1) Both our method and vdp produce samples that are comparably similar to the real video by L1 distance (left). However,our method synthesizes painting changes that are more similar in shape to those used by artists (right). (2) At low numbers ofsamples, the deterministic unet method is closer (by L1 distance) to the real video than samples from vdp or ours , since L1favors blurry frames that average many possibilities. (3) Our method shows more improvement in L1 distance and paintingchange IOU than vdp as we draw more samples, indicating that our method captures a more varied distribution of videos. translation. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 8789–8797,2018.[11] Franc¸ois Chollet et al. Keras. https://github.com/fchollet/keras , 2015.[12] Nelson S-H Chu and Chiew-Lan Tai. Moxi: real-time inkdispersion in absorbent paper. In

ACM Transactions onGraphics (TOG) , volume 24, pages 504–511. ACM, 2005.[13] Adrian V Dalca, Ramesh Sridharan, Natalia Rost, and PolinaGolland. tipix: Rapid visualization of large image collec- tions.

MICCAI-IMIC Interactive Medical Image ComputingWorkshop , 2014.[14] Alexey Dosovitskiy and Thomas Brox. Generating imageswith perceptual similarity metrics based on deep networks.In

Advances in neural information processing systems , pages658–666, 2016.[15] Jesse Engel, Matthew Hoffman, and Adam Roberts. Latentconstraints: Learning to generate conditionally from uncon-ditional generative models. In

International Conference onLearning Representations , 2018.[16] Scott Gaffney and Padhraic Smyth. Trajectory clustering ith mixtures of regression models. In KDD , volume 99,pages 63–72, 1999.[17] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Es-lami, and Oriol Vinyals. Synthesizing programs for im-ages using reinforced adversarial learning. arXiv preprintarXiv:1804.01118 , 2018.[18] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.A neural algorithm of artistic style. arXiv preprintarXiv:1508.06576 , 2015.[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In

Advances in neural information pro-cessing systems , pages 5767–5777, 2017.[20] Aaron Hertzmann. Painterly rendering with curved brushstrokes of multiple sizes. In

Proceedings of the 25th an-nual conference on Computer graphics and interactive tech-niques , pages 453–460. ACM, 1998.[21] Fay Huang, Bo-Hui Wu, and Bo-Ru Huang. Synthesis ofoil-style paintings. In

Paciﬁc-Rim Symposium on Image andVideo Technology , pages 15–26. Springer, 2015.[22] Zhewei Huang, Wen Heng, and Shuchang Zhou. Learningto paint with model-based deep reinforcement learning. In

IEEE International Conference on Computer Vision (ICCV) ,2019.[23] Savage Interactive.

Procreate Artists’ Handbook . Savage,2016.[24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. arXiv preprint arXiv:1611.07004 , 2016.[25] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1125–1134,2017.[26] Biao Jia, Jonathan Brandt, Radom´ır Mech, ByungmoonKim, and Dinesh Manocha. Lpaintb: Learning to paint fromself-supervision.

CoRR , abs/1906.06841, 2019.[27] Biao Jia, Chen Fang, Jonathan Brandt, Byungmoon Kim, andDinesh Manocha. Paintbot: A reinforcement learning ap-proach for natural media painting.

CoRR , abs/1904.02201,2019.[28] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-HsuanYang, Erik Learned-Miller, and Jan Kautz. Super slomo:High quality estimation of multiple intermediate frames forvideo interpolation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 9000–9008, 2018.[29] Pierre-Marc Jodoin, Emric Epstein, Martin Granger-Pich´e,and Victor Ostromoukhov. Hatching by example: a statisti-cal approach. In

Proceedings of the 2nd international sympo-sium on Non-photorealistic animation and rendering , pages29–36. ACM, 2002.[30] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. In

European conference on computer vision , pages 694–711.Springer, 2016. [31] Mikyung Kim and Hyun Joon Shin. An example-based ap-proach to synthesize artistic strokes using graphs. In

Com-puter Graphics Forum , volume 29, pages 2145–2152. WileyOnline Library, 2010.[32] Diederik P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. Semi-supervised learning withdeep generative models. In

Advances in Neural InformationProcessing Systems , pages 3581–3589, 2014.[33] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 , 2013.[34] Ce Liu, Jenny Yuen, and Antonio Torralba. Sift ﬂow: Densecorrespondence across scenes and its applications.

IEEEtransactions on pattern analysis and machine intelligence ,33(5):978–994, 2010.[35] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, andAseem Agarwala. Video frame synthesis using deep voxelﬂow. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 4463–4471, 2017.[36] Jingwan Lu, Connelly Barnes, Stephen DiVerdi, and AdamFinkelstein. Realbrush: painting with examples of physicalmedia.

ACM Transactions on Graphics (TOG) , 32(4):117,2013.[37] Michal Luk´aˇc, Jakub Fiˇser, Paul Asente, Jingwan Lu, EliShechtman, and Daniel S`ykora. Brushables: Example-basededge-aware directional texture painting. In

Computer Graph-ics Forum , volume 34, pages 257–267. Wiley Online Library,2015.[38] Michael Mathieu, Camille Couprie, and Yann Lecun. Deepmulti-scale video prediction beyond mean square error. 112016.[39] Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse,and Alexander Sorkine-Hornung. Phase-based frame inter-polation for video. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1410–1418,2015.[40] Vincent Michalski, Roland Memisevic, and Kishore Konda.Modeling deep temporal dependencies with recurrent gram-mar cells””. In

Advances in neural information processingsystems , pages 1925–1933, 2014.[41] Roni Mittelman, Benjamin Kuipers, Silvio Savarese, andHonglak Lee. Structured recurrent temporal restricted boltz-mann machines. In

International Conference on MachineLearning , pages 1647–1655, 2014.[42] Santiago E Montesdeoca, Hock Soon Seah, Pierre B´enard,Romain Vergne, Jo¨elle Thollot, Hans-Martin Rall, and Da-vide Benvenuti. Edge-and substrate-based effects for wa-tercolor stylization. In

Proceedings of the Symposium onNon-Photorealistic Animation and Rendering , page 2. ACM,2017.[43] Seonghyeon Nam, Chongyang Ma, Menglei Chai, WilliamBrendel, Ning Xu, and Seon Joo Kim. End-to-end time-lapsevideo synthesis from a single outdoor image. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 1409–1418, 2019.[44] Simon Niklaus and Feng Liu. Context-aware synthesis forvideo frame interpolation. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages1701–1710, 2018.

45] Simon Niklaus, Long Mai, and Feng Liu. Video frame inter-polation via adaptive separable convolution. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 261–270, 2017.[46] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, MichaelMathieu, Ronan Collobert, and Sumit Chopra. Video (lan-guage) modeling: a baseline for generative models of naturalvideos. arXiv preprint arXiv:1412.6604 , 2014.[47] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In

International Conference on Medical image com-puting and computer-assisted intervention , pages 234–241.Springer, 2015.[48] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved techniquesfor training gans. In

Advances in neural information pro-cessing systems , pages 2234–2242, 2016.[49] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[50] Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor.The recurrent temporal restricted boltzmann machine. In

Advances in neural information processing systems , pages1601–1608, 2009.[51] Dizan Vasquez and Thierry Fraichard. Motion prediction formoving objects: a statistical approach. In

IEEE InternationalConference on Robotics and Automation, 2004. Proceedings.ICRA’04. 2004 , volume 4, pages 3931–3936. IEEE, 2004.[52] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn,Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. In

ICML , 2017.[53] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.Generating videos with scene dynamics. In

Advances InNeural Information Processing Systems , pages 613–621,2016.[54] Jacob Walker, Carl Doersch, Abhinav Gupta, and MartialHebert. An uncertain future: Forecasting from static imagesusing variational autoencoders. In

European Conference onComputer Vision , pages 835–851. Springer, 2016.[55] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patchto the future: Unsupervised visual prediction. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 3302–3309, 2014.[56] Miaoyi Wang, Bin Wang, Yun Fei, Kanglai Qian, WenpingWang, Jiating Chen, and Jun-Hai Yong. Towards photo wa-tercolorization with artistic verisimilitude.

IEEE transac-tions on visualization and computer graphics , 20(10):1451–1460, 2014.[57] Der-Lor Way and Zen-Chung Shih. The Synthesis of RockTextures in Chinese Landscape Painting.

Computer Graph-ics Forum , 2001.[58] Manuel Werlberger, Thomas Pock, Markus Unger, and HorstBischof. Optical ﬂow guided tv-l 1 video interpolation andrestoration. In

International Workshop on Energy Minimiza-tion Methods in Computer Vision and Pattern Recognition ,pages 273–286. Springer, 2011. [59] Ning Xie, Hirotaka Hachiya, and Masashi Sugiyama.Artist agent: A reinforcement learning approach to auto-matic stroke generation in oriental ink painting.

CoRR ,abs/1206.4634, 2012.[60] Ning Xie, Tingting Zhao, Feng Tian, Xiaohua Zhang, andMasashi Sugiyama. Stroke-based stylization learning andrendering with inverse reinforcement learning. In

Proceed-ings of the 24th International Conference on Artiﬁcial Intel-ligence , IJCAI’15, pages 2531–2537. AAAI Press, 2015.[61] Jun Xing, Hsiang-Ting Chen, and Li-Yi Wei. Autocompletepainting repetitions.

ACM Transactions on Graphics (TOG) ,33(6):172, 2014.[62] Songhua Xu, Min Tang, Francis Lau, and Yunhe Pan. Asolid model based virtual hairy brush. In

Computer GraphicsForum , volume 21, pages 299–308. Wiley Online Library,2002.[63] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Free-man. Visual dynamics: Probabilistic future frame synthesisvia cross convolutional networks. In

Advances in neural in-formation processing systems , pages 91–99, 2016.[64] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee.Attribute2image: Conditional image generation from visualattributes. In

European Conference on Computer Vision ,pages 776–791. Springer, 2016.[65] Zhefei Yu, Houqiang Li, Zhangyang Wang, Zeng Hu, andChang Wen Chen. Multi-level video frame interpolation:Exploiting the interaction among different levels.

IEEETransactions on Circuits and Systems for Video Technology ,23(7):1235–1248, 2013.[66] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-man, and Oliver Wang. The unreasonable effectiveness ofdeep features as a perceptual metric. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 586–595, 2018.[67] Yong Zhang, Weiming Dong, Chongyang Ma, Xing Mei, KeLi, Feiyue Huang, Bao-Gang Hu, and Oliver Deussen. Data-driven synthesis of cartoon faces using different styles.

IEEETransactions on image processing , 26(1):464–478, 2017.[68] Ming Zheng, Antoine Milliez, Markus Gross, and Robert WSumner. Example-based brushes for coherent stylizedrenderings. In

Proceedings of the Symposium on Non-Photorealistic Animation and Rendering , page 3. ACM,2017.[69] Ningyuan Zheng, Yifan Jiang, and Dingjiang Huang. Stro-kenet: A neural painting environment. In

International Con-ference on Learning Representations , 2019.[70] Yipin Zhou and Tamara L. Berg. Learning temporal transfor-mations from time-lapse videos. volume 9912, pages 262–277, 10 2016.[71] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In

Proceedings of the IEEEinternational conference on computer vision , pages 2223–2232, 2017. ppendix A. ELBO derivation We provide the full derivation of our model and lossesfrom Equation (3). We start with our goal of ﬁnding modelparameters θ that maximize the following probability for allvideos and all t : p θ ( δ t , x t − ; x T ) ∝ p θ ( δ t | x t − ; x T )= (cid:90) z t p θ ( δ t | z t , x t − ; x T ) p ( z t ) dz t . We use variational inference and introduce an approxi-mate posterior distribution q φ ( z t | δ t , x t − ; x T ) [32, 63, 64]. (cid:90) z t p θ ( δ t | z t , x t − ; x T ) p ( z t ) dz t = (cid:90) z t p θ ( δ t | z t , x t − ; x T ) p ( z t ) q φ ( z t | δ t , x t − ; x T ) q φ ( z t | δ t , x t − ; x T ) dz t ∝ log (cid:90) z t p θ ( δ t | z t , x t − ; x T ) p ( z t ) q φ ( z t | δ t , x t − ; x T ) q φ ( z t | δ t , x t − ; x T ) dz t = log (cid:90) z t p θ ( δ t | z t , x t − ; x T ) p ( z t ) q φ ( z t | δ t , x t − ; x T ) q φ ( z t | δ t , x t − ; x T ) dz t = log E z ∼ q φ ( z t | δ t ,x t − ; x T ) (cid:20) p θ ( δ t | z t , x t − ; x T ) p ( z t ) q φ ( z t | δ t , x t − ; x T ) (cid:21) . (8) We use the shorthand z t ∼ q φ for z ∼ q φ ( z t | δ t , x t − ; x T ) ,and apply Jensen’s inequality: log E z t ∼ q φ (cid:20) p θ ( δ t | z t , x t − ; x T ) p ( z t ) q φ ( z t | δ t , x t − ; x T ) (cid:21) ≥ E z t ∼ q φ (cid:2) log p θ ( δ t | z t , x t − ; x T ) (cid:3) + E z ∼ q φ (cid:2) log p ( z t ) q φ ( z t | δ t , x t − ; x T ) (cid:3) ≥ E z t ∼ q φ (cid:2) log p θ ( δ t | z t , x t − ; x T ) (cid:3) − KL [ q φ ( z t | δ t , x t − ; x T ) || p ( z t )] , (9)where KL [ ·||· ] is the Kullback-Liebler divergence, arrivingat the ELBO presented in Equation (5) in the paper.Combining the ﬁrst term in Equation (5) with our imagelikelihood deﬁned in Equation (1): E z t ∼ q φ log p θ ( δ t | z t , x t − ; x T ) ∝ E z t ∼ q φ (cid:2) log e − σ | δ t − ˆ δ t | + log N (cid:0) V ( x t − + δ t ); V ( x t − + ˆ δ t ) , σ I (cid:1)(cid:3) = E z t ∼ q φ (cid:20) − σ | δ t − ˆ δ t | + log 1 (cid:112) πσ exp (cid:0) − ( V ( x t − + δ t ) − V ( x t − + ˆ δ t )) σ (cid:1)(cid:21) ∝ E z t ∼ q φ (cid:20) − σ | δ t − ˆ δ t |− σ ( V ( x t − + δ t ) − V ( x t − + ˆ δ t )) (cid:21) , (10) giving us the image similarity losses in Equation (6). Wederive L KL in Equation (6) by similarly taking the loga-rithm of the normal distributions deﬁned in Equations (2)and (4). Appendix B. Network architecture

We provide details about the architecture of our recurrentmodel and our critic model in Figure 13.

Appendix C. Human study

We surveyed 150 human participants. Each participanttook a survey containing a training section followed by 14questions.

Calibration : We ﬁrst trained the participants by showingthem several examples of real digital and watercolor paint-ing time lapses.

Evaluation : We then showed each participant 14 pairs oftime lapse videos, comprised of a mix of watercolor anddigital paintings selected randomly from the test sets. Al-though each participant only saw a subset of the test paint-ings, every test painting was included in the surveys. Eachpair contained videos of the same center-cropped painting.The videos were randomly chosen from all pairwise com-parisons between real, vdp , and ours , with the orderingwithin each pair randomized as well. Samples from vdp and ours were generated randomly.

Validation : Within the survey, we also showed two re-peated questions comparing a real video with a linearly in-terpolated video (which we described as interp in Table 2 inthe paper) to validate that users understood the task. We didnot use results from users who chose incorrect answers forone or both validation questions.

Appendix D. Additional results

We include additional qualitative results in Figures 14and 15. We encourage the reader to view the supplementaryvideo, which illustrates many of the discussed effects.13 aint stroke model

Conditioning branchEncoding branch16 16 16 16 32 32 64 64 5 64 6464 64 64 64 6464 64 64 64 sample

D D D D concatenate layer outputs

Critic model

Legend convolution with N filters N strided convolution with N filters D Dense layer with N elementsBlur and spatially downsample transposed convolution with N filters

NNN N convolution with N filters

128 128 128 128 128 128 1 * All kernels are Figure 13:

Neural network architecture details . We use an encoder-decoder style architecture for our model. For our critic,we use a similar architecture to StarGAN [10], and optimize the critic using WGAN-GP [19] with a gradient penalty weightof 10, and 5 critic training iterations for each iteration of our model. All strided convolutions and downsampling layers reducethe size of the input volume by a factor of 2. 14 n e t r e a l Input o u r s v d p (a) The proposed method paints similar regions to the artist . Red arrows in the second row show where unet adds ﬁne details everywherein the scene, ignoring the semantic boundary between the rock and the water, and contributing to an unrealistic fading effect. The videosynthesized by vdp produces more coarse changes early on, but introduces an unrealistic-looking blurring and fading effect on the rock (redarrows in the third row). Blue arrows highlight that our method makes similar changes to the artist, ﬁlling in the base color of the water, thenthe base colors of the rock, and then ﬁne details throughout the painting. u n e t r e a l Input o u r s v d p (b) The proposed method identiﬁes appropriate colors and shape for each layer of paint . Red arrows indicate where the baselines ﬁllin details that the artist does not complete until much later in the sequence (not shown in the real sequence, but visible in the input image).Blue arrows show where our method adds a base layer for the vase with a reasonable color and shape, and then adds ﬁne details to it later.

Figure 14:

Videos synthesized from the watercolor paintings test set . For the stochastic methods vdp and ours , weexamine the nearest sample to the real video out of 2000 samples. We discuss the variability among samples from ourmethod in Section 5, and in the supplementary video. 15 a) The proposed method paints using coarse-to-ﬁne layers of different colors, similarly to the real artist . Red arrows indicate wherethe baseline methods ﬁll in details of the house and bush at the same time, adding ﬁne-grained details even early in the painting. Blue arrowshighlight where our method makes similar changes to the artist, adding a ﬂat base color for the bush ﬁrst before ﬁlling in details, and usinglayers of different colors. u n e t r e a l Input o u r s v d p (b) The proposed method synthesizes watercolor-like effects such as paint fading as it dries . Red arrows indicate where the baselinesﬁll in the house and the background at the same time. Blue arrows in the ﬁrst two video frames of the last row show that our method usescoarse changes early on. Blue arrows in frames 3-5 show where our method simulates paint drying effects (with the intensity of the colorfading over time), which are common in real watercolor videos.

Figure 15:

Videos synthesized from the watercolor paintings test set . For the stochastic methods vdp and oursours