BézierSketch: A generative model for scalable vector sketches
Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song
BB´ezierSketch: A generative model for scalablevector sketches
Ayan Das , , Yongxin Yang , , Timothy Hospedales , , Tao Xiang , , andYi-Zhe Song , SketchX, CVSSP, University of Surrey, United Kingdom { a.das,yongxin.yang,t.xiang,y.song } @surrey.ac.uk iFlyTek-Surrey Joint Research Centre on Artificial Intelligence University of Edinburgh, United Kingdom [email protected]
Abstract.
The study of neural generative models of human sketches isa fascinating contemporary modeling problem due to the links betweensketch image generation and the human drawing process. The landmarkSketchRNN provided breakthrough by sequentially generating sketchesas a sequence of waypoints. However this leads to low-resolution imagegeneration, and failure to model long sketches. In this paper we presentB´ezierSketch, a novel generative model for fully vector sketches that areautomatically scalable and high-resolution. To this end, we first introducea novel inverse graphics approach to stroke embedding that trains an en-coder to embed each stroke to its best fit B´ezier curve. This enables us totreat sketches as short sequences of paramaterized strokes and thus traina recurrent sketch generator with greater capacity for longer sketches,while producing scalable high-resolution results. We report qualitativeand quantitative results on the
Quick, Draw! benchmark.
Keywords:
Sketch generation, Scalable graphics, B´ezier curve
Fig. 1: Left: SketchRNN [8] generates sketches by sampling waypoints (red dots)which lead to coarse images upon zoom. Right: Our B´ezierSketch samples smoothcurves (green control points) thus providing scalable vector graphic generation.
Generative neural modeling of images [6,12] is now an established research areain contemporary machine learning and computer vision. Rapid progress has beenmade in generating photos [11,24], with effort being focused on fidelity, diversity, a r X i v : . [ c s . C V ] J u l A. Das et al. and resolution of image generation, along with stability of training; as well assequential models for text and video [2,31]. Generative modeling of human sketches in particular has recently gained interest, along with other applicationsof sketch analysis such as recognition [34,33], retrieval [28,21,4] and forensics [13]– all facilitated by the growth of large scale sketch datasets [8,28].Sketch generation provides an excellent opportunity to study sequential gen-erative models, and is particularly fascinating due to the potential to establishlinks between learned generative models and human sketching – a communica-tion modality that comes innately to children, and has existed for millennia.Recent breakthroughs in this area include SketchRNN [8], which provided thefirst neural generative sequential model for sketch images, and Learn2Sketch [30]which provided the first conditional image to sequential sketch model. While con-ventional image generation models focus on producing ever-larger pixel arraysin high fidelity, these methods aim to model sketches using a more human-likerepresentation consisting of a collection of strokes.SketchRNN [8], the landmark neural sketch generation algorithm, treatssketches as a digitized sequence of 2D points on a drawing canvas sampled alongthe trajectory of the ink-flow. This model of sketches has several issues, however:It is inefficient, due to the dense representation of redundant information likehighly correlated temporal samples; and as sketches are ultimately pixels on agrid, it is prone to sampling noise. Crucially it provides limited graphical scala-bility: SketchRNN sets out to achieve vector graphic generation (and claims toachieve this). However it does not generate truly scalable vector graphs as re-quired by applications such as digital art. Since generated sketches are composedof dense line segments, its samples are only somewhat smoother than rastergraphics (Fig. 1). Finally, it suffers from limited capacity. Because it modelssketches as a sequence of pixels, it is limited in the length of sketch it can modelbefore the underlying recurrent neural network begins to run out of capacity.In this paper we propose a fundamental paradigm change in the represen-tation of sketches that enables the above issues to be addressed. Specifically,we aim to represent sketches in terms of parameterized smooth curves [27].These provide a scalable representation of a finite length curve using few
Con-trol Points . From a large family of parametric curves, we choose B´ezier curvesdue to their simple structure. In order to train a generative model of humansketches with this representation, the key question is how to encode humansketches as parameterized curves. To this end, a key technical contribution is avision-as-inverse-graphics [14,26,5] approach, that learns to embed human sketchstrokes as interpretable parameterized B´ezier curves. We train B´ezierEncoder inan inverse-graphics manner by learning to reconstruct strokes through a white-box graphics (B´ezier) decoder. Given this new low-dimensional stroke represen-tation, we then train B´ezierSketch to generate sketches. Our stroke-level gener-ative model requires many fewer iterations than the segment-level SketchRNN,and thus provides better generation of longer sketches, while providing high-resolution scalable vector-graphic sketch generation (Fig. 1). ´ezierSketch: A generative model for scalable vector sketches 3
In summary, the contributions of our work are: (1) B´ezierEncoder, a novelinverse-graphics approach for mapping strokes to parameterized B´eziers, (2)B´ezierSketch, a sequential generative model for sketches that produces high-resolution and low-noise vector graphic samples with improved scalability tolonger sketches compared to the previous state of the art SketchRNN.
Parameterized Curves
B´ezier curves are a powerful tool in the field of com-puter graphics and are extensively used in interactive curve and surface design[27], as are a more general family of curves known as
Splines [3]. Optimizationalgorithms to fit B´ezier curves and Splines from data have been studied. Fewspecially crafted algorithms do exist specifically for cubic B´ezier curves [29,20].However the challenge for most curve and spline-fitting methods is the existenceof latent variables t that correspond training points and the location of theirprojection onto the curve. This leads to two-stage alternating algorithms forseparately optimizing the curve parameters (control points) and latent parame-ter t [17,22]. Importantly, such methods [17,22] including few promising ones [35]require expensive per-sample alternating optimization, or iterative inference inexpensive generative models [25,15] which make them unsuitable for large scaleor online applications. In contrast, we uniquely take the approach of learning aneural network that maps strokes to B´ezier curves in a single shot. This neuralencoder is a model that needs to be trained, but unlike per-sample optimizationapproaches, it is inductive. So once trained it can provide one-shot estimationof curve parameters and point association from an input stroke. Generative Models
Generative models have been studied extensively in themachine learning literature, often in terms of density estimation with directed[23,1] or undirected [10] graphical models. Research in this field accelerated afterthe emergence of Generative adversarial networks (GAN) [6], Variational Au-toencoder (VAE) [12] and their derivatives. Handling sequences are of particularimportance and hence specialized algorithms [2,31] were developed. AlthoughRNNs have been successfully used for generating handwriting [7] without vari-ational training, these methods lacked flexibility in terms of generation quality.The emergence of VAE and variational training methods allows the fusion ofRNNs with variational objective led to the first successful generative sequencemodel [2] in the domain of Natural Language Processing (NLP). It was quicklyadapted by SketchRNN [8] in order to extend [7] to free-hand sketches.
Inverse Graphics “Inverse Graphics” is line of work that aims to estimate3D scene parameters from raster images without supervision. Instead it predictsthe input parameters of a computer graphics pipeline that can reconstruct theimage. Several attempts were made [26,14] to estimate explicit model parametersof 3D objects from raw images. A specialized case of the generic Inverse Graphicsidea is to estimate parameters of 2D objects such as curves. As a recent example,an RNN based agent named SPIRAL [5] learned to draw characters in terms of
A. Das et al. pen an brush curves. SPIRAL, however, is extremely costly due to its relianceon Policy Gradient [32] reinforcement learning training and black-box renderer.
Learning for Curves
Few works have studied learning for curve generation.The recent SVG Font Generator [18] trains an excellent font embedding witha recurrent vector font image generator. However it is trained with supervi-sion rather than inverse graphics, and limited to the more structured domain offont images. Other attempts [16] also use supervised learning on synthetic data,rather than unsupervised learning on real human sketches as we consider here.
Background: Conventional Sketch representation and Generation
Acommon format [8] for a digitally acquired sketch S is as a sequence of 2-tuples,each containing a 2 D coordinate on the canvas sampled from a continuous draw-ing flow and a pen-state bit denoting whether the pen touches the canvas or not. S = (cid:2) ( X i , q i ) (cid:3) Li =1 (1)where X i (cid:44) (cid:2) x y (cid:3) Ti ∈ R , q i ∈ { PenUp , PenDown } and L is the cardinalityof S representing the length of the sketch. The state-of-the-art sketch generatorSketchRNN [8] learns a parametric Recurrent Neural Network (RNN) to modelthe joint distribution of coordinates and pen state as a product of conditionals,i.e. p sketchrnn ( S ; θ ) = (cid:81) Li =1 p (cid:0) X i , q i | X
We are interested in moving fromsuch a segment-level representation toward stroke-level. To this end we modifythe structure of our input data to ¯ S (cid:44) (cid:2) T j (cid:3) Nj =1 , with T j (cid:44) (cid:2) X ( j ) i (cid:3) N j i =1 where T j is the j th stroke of length N j (cid:44) | T j | segregated from the sketch by following thepen-state bit, and consequently (cid:80) Nj =1 N j = L . Towards a Stroke-Level Generative Model
Existing generative sketchmodels [8,30] generate a segment at each iteration. Given a stroke-segmentedtraining set ¯ S , we would like to train a generative model analogous to SketchRNN.That is, to model the distribution over possible sketches with a parametric model p model ( ¯ S ; θ ) and that approximates the original data distribution p data ( ¯ S ). Dif-ferent sketches having different lengths N makes this problem suitable for Re-current Neural Networks (RNN). One could model the probability of a sketchas a product of the probabilities of individual strokes T j conditioned on allits previously seen strokes T 30 20 10 0 10 20 P P P P t =0.0 t =0.1 t =0.2 t =0.3 t =0.4 t =0.5 t =0.6 t =0.7 t =0.8 t =0.9 t =1.0 (a) 20 0 20101520 20 0 201416182022 20 0 20 (b) Fig. 2: (a) An example of B´ezier curve of degree n = 3 with n + 1 control points.(b) B´ezier curves with Gaussian noise ( µ = , Σ = 5 I ) added to control pointsproduce similar curves in image space.and corresponding non-parametric decoder d ( · ) such that T j ≈ d ( e j ). We thenmodel the encoded sketch e ( ¯ S ) (cid:44) (cid:8) e j (cid:9) Nj =1 as p model ( e ( ¯ S ); θ ) = N (cid:89) j =1 p ( e j | e Inverse Graphics Decoder B´ezier curves, used heavily in computer graph-ics, are smooth curves representable in a closed functional form parameterized bya sequence of n +1 anchor coordinates P (cid:44) (cid:2) P x P y (cid:3) T ∈ R termed control points .A degree n B´ezier curve with control points (cid:2) P , P , · · · P n (cid:3) is represented as C ( t ; { P i } ) = n (cid:88) i =0 B i,n ( t ) · P i (3)where t ∈ [0 , 1] is the parameter of the curve, B i,n ( t ) (cid:44) (cid:0) ni (cid:1) t i (1 − t ) n − i is theBernstein Basis Polynomial in t and C ( t ) (cid:44) (cid:2) C x ( t ) C y ( t ) (cid:3) T ∈ R denotes a point A. Das et al. on the curve at t = t . As t assumes values 0 → 1, the curve starts from P andends at P n and the control points (cid:2) P , · · · , P n − (cid:3) control the trajectory of thecurve, as illustrated in Fig. 2(a). We further use P n (cid:44) (cid:2) P x , P y , · · · , P xn , P yn (cid:3) ∈ R n +1) to denote elements (curves) in the continuous space of n + 1 controlpoints. The decoder function d : P → T can be trivially realized by Eq. 3 withthe set of t -values chosen as per resolution requirement.We now denote ( T , P ) as an arbitrary stroke and its B´ezier representation,where we have dropped the subscript j and superscript n for notational brevity.Using P as an embedding space for T leads to an extremely useful and keyproperty: Given a choice of n , two similar points in P space correspond to similarstrokes in T space. As a consequence, we can sample from the conditionals inEq. 2 to generate variations of a stroke. Property 1. Given a ( T , P ) pair where T = d ( P ) and sample (cid:98) P ∼ N ( P , σ ),then the decoded (cid:98) T = d ( (cid:98) P ) is distributed as N ( T , σ (cid:48) ). Proof. Refer to Appendix A in the supplementary document for the proof. Il-lustrative examples are given in Fig. 2(b). A stroke to B´ezier encoder We wish to learn an embedding function e ( · )that will map a given stroke T to its best fit B´ezier representation P . Due tothe variable length of strokes T , we model B´ezierEncoder with a bi-directionalRNN, with forward and backward states −→ s i , ←− s i ∈ R h at time-step i as (cid:2) −→ s i , ←− s i (cid:3) = BiRNN( X i − , s i − ; θ ) (4)However, unlike regular encoder RNNs, we further transform the last hiddenstate to get a B´ezier curve representation P = W P (cid:2) −→ s end ; ←− s end (cid:3) (5)where the ‘ end ’ subscript denotes the state of the RNN at last time-step, [ ; ]denotes the concatenation operator and W P ∈ R n +1) × h .The formulation so far enables extracting a curve P from data T . However,while P is now a sufficient representation to decode the B´ezier by means ofEq. 3, we do not have sufficient information to compute a reconstruction losslike (cid:107) T − d ( e ( T )) (cid:107) because we lack the association between input coordinates X i and interpolation parameters t i . This is where many classic B´ezier fittingtechniques [17,35] resort to slow alternating optimization techniques.We take a different approach and ask our encoder to also predict the corre-sponding interpolation parameter t i for each input point X i . In order to makevalid predictions for t we note the properties it requires due to its role in B´eziercurves generation: (cid:54) (cid:98) t i (cid:54) (cid:98) t i (cid:54) (cid:98) t i +1 (dueto sequential nature of X i ). Apart from these, we impose another property with-out any lose of generality: t = 0 and t end = 1 (this will make X and X end coincide with P and P n respectively). Please refer to the experiment sectionfor an implementation trick to do so. ´ezierSketch: A generative model for scalable vector sketches 7 ⋯ ⋯ Cumulative Sum Δ ˆ ˆ ( ; ) ( ; ) ( ; ) ˆ ˆ ˆ Fig. 3: Inverse graphics training of our B´ezierEncoder architecture for model-based single-pass stroke [ X i ] to B´ezier P mapping.To enable our encoder to meet these requirements above, we do not compute t i s directly, but instead compute increments ∆t i (cid:44) t i − t i − (with t (cid:44) (cid:2) −→ s i ; ←− s i (cid:3) at every step i . The t i -values can then be easily computed as acumulative sum of all ∆t i up to i . Thus, the second path of our encoder predicts (cid:98) t i = i (cid:88) i (cid:48) =1 (cid:99) ∆t i (cid:48) , with (cid:99) ∆t i = Softmax i ( W t · (cid:2) −→ s i ; ←− s i (cid:3) ) . (6)The usage of Softmax() enforces all three requirements stated above.To summarize: Our full architecture, as shown in Figure 3 thus has twopathways: A B´ezier embedding pathway that predicts the curve P for the entirestroke input T and an interpolation parameter pathway that further predicts theestimated curve parameter (cid:98) t i for each input point X i in T . Given the ( X i , (cid:98) t i )pairs and P predicted by our encoder, we can now train our model with thefollowing reconstruction loss: L ( θ, W P , W t ) (cid:44) (cid:88) i (cid:13)(cid:13) C ( (cid:98) t i , P ) − X i (cid:13)(cid:13) (7)which is optimized w.r.t. encoder parameters { θ, W P , W t } by SGD. Once trained,we can compute the best-fit B´ezier for any stroke using Eq. 5, which provides afeed-forward single pass solution to a typically alternating optimization. A Multi-Degree Representation Extension To add more flexibility, wecan extend this basic building block to learn a multi-degree representation ofa given stroke T . In order to do so, we encode the stroke using the the sameRNN in Eq. 4 parameterized by θ but use a set of different W P n and W nt for apredefined range of degree n ∈ [ n min , · · · , n max ] to predict B´ezier representationsof different degrees along with their corresponding t ni -values. A. Das et al. (cid:98) t ni = i (cid:88) i (cid:48) =1 (cid:99) ∆t ni (cid:48) , with (cid:99) ∆t ni = Softmax i ( W nt · (cid:2) −→ s i ; ←− s i (cid:3) ) and P n = W P n (cid:2) −→ s end ; ←− s end (cid:3) (8)The total loss is now the sum of losses at every order n : L total (cid:44) n max (cid:88) n = n min L n , with L n ( θ, W P n , W nt ) (cid:44) (cid:88) i (cid:13)(cid:13) C ( (cid:98) t ni , P n ) − X i (cid:13)(cid:13) (9)Inference in this model can now predict a set of B´ezier representations fordifferent degrees, where higher order curves fit the data better at the cost ofmore control points. The preferred order can then be chosen manually accordingto user requirement, or automatically by heuristic. An effective heuristics is toevaluate the loss L n for all n and choose the smallest n for which L n ≤ L tolerance . Smoothness Regularizer Our training objectives Eq. 7 or Eq. 9 may lead tooverfitting in the domain of B´ezier curves during encoder learning. To avoid thiswe add a smoothness regularizer (with regularization strength β ) that preferssequential control points to be nearby. Specifically, we add β · R n with L n foreach n , where R n ( P n ) (cid:44) n (cid:88) i =1 (cid:107) P i +1 − P i (cid:107) . We next leverage our choice of B´ezier representation space, and encoding model P = e ( · ) to define two alternative vector graphic generative models for sketches. Control Point mode Given a sketch as a sequence of stroke embeddings {P j } Ni =1 obtained from the raw input strokes as P = e ( T ), we can modify theoriginal data structure in Eq. 1 and substitute the set of absolute co-ordinatesof every stroke by the set of control points of its B´ezier representation. Themodified sketch S cp would be S cp = (cid:104)(cid:16) P ( j )0 , q ( j )0 (cid:17) , · · · , (cid:16) P ( j ) i , q ( j ) i (cid:17) , · · · , (cid:16) P ( j ) n j , q ( j ) n j (cid:17)(cid:105) Nj =1 (10)When encoded this way by our B´ezier encoder, each sketch is representedby a relatively shorter (mostly) list of parametric control points rather thanthe original long list of coordinates. In this format, different strokes can havedifferent degrees, as indicated by the use of n j above.Given this sequential representation of a sketch dataset, we can now train agenerative sketch model. Since S cp is structurally same as original S apart fromits length and the interpretation of its co-ordinates, we can re-use exactly thesame architecture and training procedure as SketchRNN [8]. We use a variational ´ezierSketch: A generative model for scalable vector sketches 9 sequence-to-sequence autoencoder [31] with a latent vector encoding the wholesketch. Thus one sketch is encoded first to a list of B´ezier curves, and then toa latent vector in SketchRNN architecture; and decoded first to a list of curveparameters, and then rendered by the B´ezier renderer. Please refer to AppendixB for a brief review of the SketchRNN architecture in the context of our problem. Stroke mode Given a sketch S as set of strokes { T j } Nj =1 , we transformit as S st = {P j } Nj =1 where P j = e ( T j ). We model the whole sketch using asequence-to-sequence autoencoder, where each time-step processes one strokerepresented as a fixed order B´ezier curve. We use a bi-directional RNN to encodethe whole sketch stroke-by-stroke. The hidden states (forward and backward) ofthe encoder −→ h j , ←− h j at time-step j is given as (cid:104) −→ h j , ←− h j (cid:105) = BiRNN( P j − , h i − ; Θ )A latent vector z ∈ R N z encoding the whole sketch is sampled using theparameters of a Gaussian distribution computed from the last hidden states z ∼ N ( µ z , diag( σ z )), with [ µ z , σ z ] = f (cid:16)(cid:104) −→ h N ; ←− h N (cid:105) ; Θ (cid:17) An unidirectional decoder RNN is initialized using z and models the proba-bility of j th stroke embedding conditioned on the hidden state g j ∈ R H d p ( P j | g j ; Θ ) = GMM (cid:16) P j ; (cid:8) µ mj ( g j ) , Σ mj ( g j ) , π mj ( g j ) (cid:9) Mm =1 (cid:17) g j = DecoderRNN([ P j − ; z ] , g j − ; Θ ) (11)where (cid:8) µ mj , Σ mj , π mj (cid:9) are the parameters of the M -component GMM for the j th stroke. For computational efficiency, we consider diagonal Σ mj and by definition (cid:80) m π mj = 1. Given a trained model, we can sample from this distribution togenerate similar P j which will resemble its original domain data T j as guaranteedby property 1. Along with P j at every step j , we also predict a stop bit (cid:98) b j ∈ [0 , b j (cid:44) j = N . The sketch generator is trained with the following objective function L ( {P j } Ni =1 ; Θ ) = − N max N (cid:88) j =1 log GMM (cid:16) P j | (cid:8) µ mj , Σ mj , π mj (cid:9) Mm =1 ; Θ (cid:17) − N max N (cid:88) j =1 b j log (cid:98) b j − N z N z (cid:88) i =1 (cid:0) σ i z − µ i z − exp ( σ i z ) (cid:1) (12)The first two terms of L are the log-likelihood of a sequence {P j } Ni =1 underthe model and the loss due to the stop bit respectively. The third term denotesthe KL-divergence loss for imposing a Gaussian prior on the latent code z . Thediagonal entries of Σ mj have been raised by exp ( · ) to make them non-negativeand Softmax ( · ) has been used to ensure (cid:80) m π mj = 1. Dataset Quick, Draw! is a large sketch dataset [8] collected as a part of anonline game to draw a given category within a time-limit, in which thousandsof people around the world participated. Due to the problem definition andstructure of data used by our framework (see Eq.1), Quick, Draw! is the mostsuitable dataset to validate it. Different versions of the dataset use differentsampling rates at which the sketches are stored as point sequences. SketchRNNis known to work well only on data with lower sampling rate (i.e., E T (cid:2) | T | (cid:3) islower) than the raw data ( E T (cid:2) | T | (cid:3) is higher) recorded. Due to fixed length ofB´ezier representations, our framework can adapt to data with both high and lowsampling rates without any modification. Although our method is generalizableacross all categories, we experimented with few categories to validate our claims.Our framework has two main components: Embedding each stroke into itsB´ezier representation. Training a generative model with the encoded sketcheseither in control point mode or stroke mode . As our B´ezierEncoder is a key con-tribution, we validate this in isolation, before comparing our whole B´ezierSketchframework to SketchRNN [8]. We created a dataset of all strokes from all sketchesin a category of Quick, Draw! in order to train the stroke embedding model de-scribed in Section 3.1. We adopted some tricks that made the training and rep-resentation more efficient in practice. We normalized all strokes to start from theorigin (i.e., X = [0 , T ). Furthermore, we assumed that the first control point P of a B´ezier representation is always aligned to the first absolute coordinateof the stroke (i.e., X = P ). Given these design choices, we can ignore the firstcontrol point (fixing it to origin) and only predict successive differences of controlpoints (i.e., ∆ P (cid:44) P − P , ∆ P (cid:44) P − P and so on) and then decode P i as P i = (cid:80) ii (cid:48) =1 ∆ P i (cid:48) while evaluating the loss in Eq. 7. We chose the hidden state di-mension to be h = 256 and n min = 3 , n max = 9 for learning multi-degree B´ezierrepresentation. To exclude over complicated strokes, we apply some heuristicsto split a stroke into two or more. Specifically, we split a stroke into multipleparts based on two criteria: 1. Every part is within a maximum length and 2.Every part has only one sharp bend (determined by computing its curvature ata given point). We set the regularizer weight β = 10 − . Results We first qualitatively demonstrate the results of inferring B´ezier rep-resentations of input strokes. Fig. 4(top left) shows fitting results for variouscurve orders (columns) – showing variable amounts of detail being captured atdifferent orders. It also shows fitting examples at both low (above) and high(below) sampling rates – confirming that our encoder can adapt to both.We next qualitatively illustrate the training dynamics of our model via thefit estimated as training progresses. The results in Fig. 4(middle) show the es-timated fit during training in terms of B´ezier curve (red) and control points ´ezierSketch: A generative model for scalable vector sketches 11 n =3 n =4 n =5 n =6 n =7 n =8 n =9 n =3 n =4 n =5 n =6 n =7 n =8 n =9 Cat Bird Pig Clock Butterfly Mosquito S t a b l e t e s t l o ss 1e 3 Trained on CatTrained of selfAvg. initial loss Fig. 4: Evaluating our B´ezierEncoder. (Top left) Learned representations ofmulti-degree B´ezier stroke embedding. Top and bottom rows contain moderateand high-sampling rate respectively. (Top right) Test loss for various categorieswhen trained on same category vs “Cat”, demonstrating transferability of theencoder. (Middle) Visualising training dynamics. Blue: Stroke to fit. Red andGreen: B´ezier curve and control points. Cyan: Estimated point correspondence.(Bottom) Examples of full sketches and their learned B´ezier representation.(green) for a stroke defined by (blue) points. Recall that our encoder also pre-dicts the interpolation parameters t that match each input point to a locationon the curve. These correspondences are indicated in (cyan). Clearly both thefit and the estimated correspondences improve with training iterations. Refer toAppendix C in the supplementary document for similar visualization of moresamples.Given that our training data is grouped into categories, we next verify thatour encoder indeed learns a generic B`ezier embedding, and is not overfitted toa specific category. Specifically, we compare the test loss for reconstructing dataof each category when the encoder is trained on the same category as testing vstrained vs a disjoint category to testing. The results in Fig. 4(top right) showsthat the embedding generalizes quite well to categories it is not trained on.Finally, Fig. 4(bottom) shows examples of full sketches encoded by our en-coder, and then decoded as B´eziers. We can see that the encoded sketches reflectthe input, but are smoother and cleaner. In control point mode, a fully trained multi-degree embedding modelis used to restructure all sketches in our dataset as S cp . We set L tolerance =10 − to select the best n . We then train a SketchRNN-like model [8] using therestructured data. As data augmentation, we added 2D standard normal noiseat all control points. Sampling from the latent space and decoding it by thedecoder will generate sequence of control points and stroke/sketch ending bits.Treating one entire stroke as a set of control points, we can then draw it on acanvas using Eq. 3 with any required level of granularity.In stroke mode, we encode each stroke with a fixed degree of n = 9. Verysimilar to control point mode , we use a Bi-LSTM to encode the whole sketchstroke-by-stroke and extract N z dimensional latent vector. By conditioning onthe latent vector, the decoder produces B´ezier representation P of one stroke ateach time-step. Thus, the length of a sketch coincides with the number of strokespresent in the sketch. At each step of the decoder, we sample one stroke from p ( P j | g j , Θ ) which is modeled as a GMM with M = 10 mixture components.However, unlike the control point mode and its corresponding SketchRNN-likearchitecture, we do not use correlation parameter in the constituent Gaussians.This design choice makes the individual dimensions of the Gaussians indepen-dent, sampling from which is justified given property. 1. Apart from P j , wepredict one more quantity in practice: the start location v j (cid:44) ( v x , v y ) Tj of thestroke w.r.t the whole sketch. The need for v j arises due to the practical con-sideration of relocating the start of each individual stroke at the origin whileencoding them. Results Qualitative results of generated unconditional sketch samples fromboth our model variants are shown in Fig. 5(a). We can see that, similarlyto SketchRNN, B´ezierSketch generates diverse and plausible samples. However,uniquely our samples are high-resolution vector graphic sketches. Fig. 5(b) alsoshows examples of conditional samples where the right group of three images aresamples conditioned on the left sketch encoding.The use of B´ezier curves as stroke representation reduces the average lengthof a given stroke’s representation significantly and as a direct consequence, thedescription length for whole sketches as well. In Fig. 6, we compare the length his-tograms of original data and its B´ezier representation both on stroke and sketchlevel, confirming that B´eziers are systematically shorter (left). This is the samefor strokes and sketches sampled by vanilla and SketchRNN and B´ezierSketchrespectively (right).This property of shorter representations for any given sketch means thatour generator should have an advantage modeling longer sketches compared tovanilla SketchRNN since it only needs to model shorter sequences. To evaluatethis, we use a modified Fr´echet Inception Distance (FID) [9] score to compare thegenerated samples from both models. We first trained both our generator modeland SketchRNN on the entire dataset (of each category). We then create a subsetof sketches whose original length is l ± 20 and use them to generate samples. ´ezierSketch: A generative model for scalable vector sketches 13(a)(b) Fig. 5: Qualitatively evaluating B´ezierSketch. (a) Samples drawn unconditionallyin control point mode (left half) and stroke mode (right half). (b) Sketch samplesgenerated by conditioning on the first sketch (double bordered) in each set.All original and generated samples are rendered on a canvas and projected downto a concise feature vector using pre-trained Sketch-a-Net 2.0 [33] classifier. Wecompute the empirical mean and covariance of both real samples and generatedsamples as ( µ r , Σ r ) and ( µ g , Σ g ) and then estimate modified FID as: FID = (cid:107) µ r − µ g (cid:107) + Tr( Σ r + Σ g − Σ r Σ g ) / )The results in Fig. 7 plots the modified FID score with increasing lengthvalue l for both SketchRNN and our model on each category of sketches. We cansee that our model leads to improved (lower) FID score, especially for longersketches. This is illustrated qualitatively in Fig. 7, where we can see that forlonger sketches, our framework produces much more reliable reconstruction thanQuickDraw, which fails to make reasonable reconstruction in these cases. Other applications Although crafted with sketches in mind, our frameworkcan be adapted to other applications like handwriting generation (in line withthe work of [7]) with little to no modification. In fact, any 2D sequence datawith two-level hierarchical representation (e.g., stroke and sketch) can be mod-eled using the same framework. Online handwritten characters are composed ofrelatively short strokes which we model with B´ezier curves. We use the onlinehandwritten sentences from the IAM handwriting database [19], embed the con-stituent strokes with our B´ezier representation and train our generative modelfor words. Fig. 8 shows qualitative samples from our resulting word generator. F r e q u e n c y H i s t o g r a m ( N o r m a li z e d ) Original (low sampling)Original (high sampling)Beizerrepresentation Original (low sampling)Original (high sampling)Beizerrepresentation Generated (low sampling)Beizerrepresentation Generated (low sampling)Beizerrepresentation Fig. 6: Stroke/Sketch Length histogram for original data (left) and generatedsamples (right). B´ezier encodings are shorter sequences than the raw data. 25 50 75 100 1254681012 CatSketchRNNOurs (control point)Ours (stroke) 25 50 75 100 1258101214 Pig 25 50 75 100 1258101214161820 Butterfly25 50 75 100 12510121416182022 Clock 25 50 75 100 1258101214161820 Bird 25 50 75 100 125810121416 Mosquito Fig. 7: Left: FID score ( ↓ ) vs length of sketch shows the effectiveness of ourgenerative model on longer sketches. Right: Qualitative samples of long sketches.Three columns denote the original sketch, SketchRNN and our B´ezierSketch. In this paper we presented an inverse graphics approach to training an efficientmodel-based single-pass stroke-to-B´ezier encoder via reconstruction through aB´ezier decoder. Such approach surpasses the conventional fitting-based methodsin terms of quality and efficiency. Furthermore, this enabled us to advance gener-ative sketch models by generating sketches as sequences of parameterized curvesrather than pixels, leading to arbitrary-resolution scalable vector graphic sam-ples. This new representation also enables better generation of longer sketchescompared to existing state of the art. In future work we will investigate extend-ing to more complex parameterized curves such as B-splines, and developing anencoder to predict curves from rasterized images directly.Fig. 8: Unconditionally generating handwritten words from the IAM database. upplementary material forB´ezierSketch: A generative model for scalablevector sketches Ayan Das , , Yongxin Yang , , Timothy Hospedales , , Tao Xiang , , andYi-Zhe Song , SketchX, CVSSP, University of Surrey, United Kingdom { a.das,yongxin.yang,t.xiang,y.song } @surrey.ac.uk iFlyTek-Surrey Joint Research Centre on Artificial Intelligence University of Edinburgh, United Kingdom [email protected] Property 1. Given a ( T , P ) pair where T = d ( P ) for an arbitrary set of t , and (cid:98) P ∼ N ( P , Σ ), then the decoded (cid:98) T = d ( (cid:98) P ) with the same set of t , is distributedas N ( T , Σ (cid:48) ), where Σ and Σ (cid:48) are diagonal covariance matrices. Proof. As Σ is diagonal, we can separate each dimension of N ( P , Σ ) into indi-vidual Gaussians and then group x − y components of each control point withits own Gaussian with diagonal covariance Σ i (cid:44) (cid:20) σ x i , , σ y i (cid:21) N ( P , Σ ) = n (cid:89) i =0 N ( P i , Σ i )By drawing samples from the gaussians of individual control points, we get (cid:98) P (cid:44) (cid:104) (cid:98) P i (cid:105) ni =0 where (cid:98) P i ∼ N ( P i , Σ i ). Decoding (cid:98) P by d ( · ) gives (cid:98) T = d ( (cid:98) P ) = n (cid:88) i =0 B i,n ( t ) · (cid:98) P i (1)Given any value of t = t , the random variable (cid:98) T is a weighted sum of n independent gaussian random variables with weights [ B i,n ( t )] ni =0 . Hence, (cid:98) T isdistributed as (cid:98) T ∼ N (cid:32) n (cid:88) i =0 B i,n ( t ) · P i , n (cid:88) i =0 B i,n ( t ) · Σ i (cid:33) (2)Now we know that n (cid:88) i =0 B i,n ( t ) · P i (cid:44) T and we denote n (cid:88) i =0 B i,n ( t ) · Σ i (cid:44) Σ (cid:48) .So, (cid:98) T ∼ N ( T , Σ (cid:48) ) Sketch-RNN [8] is considered the state-of-the-art generative model for free-hand vector sketches. Sketch-RNN models the consecutive differences of 2D way-points of a sketch along with three bits denoting “touching”, “stroke-end” and“sketch-end” state of the pen. In control point mode of B´ezierSketch, we adoptedthe same architecture and data representation as Sketch-RNN but with controlpoints instead of waypoints. Hence, a sketch S cp is transformed to a list (oflength N ) of 5-tuples s i (cid:44) ( ∆P x , ∆P y , q , q , q ) i where [ ∆P x , ∆P y ] T (cid:44) ∆ P isthe successive difference of control points and ( q , q , q ) (cid:44) q are the three flagbits described above. As a normalization step, all sketches have been assumedto start from the origin (i.e., [0 , T ).The core model of Sketch-RNN is a Sequence-to-Sequence Variational Au-toencoder (Seq2Seq-VAE) [31] with a standard sequence encoder and an autore-gressive decoder. The whole sketch sequence is fed into a Bidirectional encoderLSTM with hidden state given as h i (cid:44) (cid:104) −→ h i ; ←− h i (cid:105) = Bi-LSTM( s i , h i − ) (3)and the last state h N is used as a compact representation of the sketch. h N isthen used to generate the parameters of a gaussian distribution following theVAE framework [12]. A sample is then drawn from the distribution as z ∼ N ( µ, σ ), where [ µ, σ ] = f ( h N ) ∈ R Z and decoded by an autoregressive decoder. An unidirectional LSTM is employedto initialize from z and produce a reconstruction of the sketch sequence similarto [7]. At each time-step j of the decoder, the hidden state is given as g j = LSTM([ z ; s j ] , g j − ), with g = tanh( z )The decoder, at every time-step, outputs the parameters of a GMM (with M mixtures) on [ ∆P x , ∆P y ] T and also a categorical distribution on three flag bitsdiscussed above. Samples from these distributions are fed back as input s j +1 atnext time step s (cid:48) j = (cid:0) ∆ P (cid:48) j , q (cid:48) j (cid:1) , where ∆ P (cid:48) j ∼ GMM( ∆ P ; g j ) and q (cid:48) j ∼ Cat( q ; g j ) (4)The network is trained with the following loss that comprises of log-likelihoodof the GMM, categorical cross-entropy of the flag bits and a variational KLdivergance loss L = − N max N (cid:88) j =1 log GMM( ∆ P (cid:48) j ) + N max (cid:88) j =1 q j log q (cid:48) j − Z (1 + σ − µ − exp ( σ )) (5) ´ezierSketch: A generative model for scalable vector sketches 17 We provide visualizations (Refer to Fig. 1) of the optimization dynamics overtime. We also annotate a discrete point of the stroke and its corresponding pointon the B´ezier curve by joining them by a connector.Fig. 1: Visualization of intermediate stages of the fitting for B´ezierEncoder net-work. Each row corresponds to one sample and columns denote increasing iter-ations of training. References 1. Bishop, C.M.: Mixture density networks. Tech. rep., Aston University (1994)2. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., Bengio, S.: Gener-ating sentences from a continuous space. In: CoNLL (2016)3. De Boor, C., De Boor, C., Math´ematicien, E.U., De Boor, C., De Boor, C.: Apractical guide to splines, vol. 27. Springer-Verlag New York (1978)8 A. Das et al.4. Dey, S., Riba, P., Dutta, A., Llados, J., Song, Y.Z.: Doodle to search: Practicalzero-shot sketch-based image retrieval. In: CVPR (2019)5. Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S.M.A., Vinyals, O.: Synthesizingprograms for images using reinforced adversarial learning. In: ICML (2018)6. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)7. Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)8. Ha, D., Eck, D.: A neural representation of sketch drawings. In: ICLR (2018)9. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trainedby a two time-scale update rule converge to a local nash equilibrium. In: NIPS(2017)10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-ral networks. Science (5786), 504–507 (2006)11. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: CVPR (2017)12. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2014)13. Klare, B., Li, Z., Jain, A.: Matching forensic sketches to mug shot photos. IEEETransactions on Pattern Analysis and Machine Intelligence (3), 639 –646 (march2011)14. Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional in-verse graphics network. In: NIPS (2015)15. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learningthrough probabilistic program induction. Science (6266), 1332–1338 (2015)16. Laube, P., Franz, M.O., Umlauf, G.: Deep learning parametrization for b-splinecurve approximation. In: 2018 International Conference on 3D Vision (3DV) (2018)17. Liu, Y., Wang, W.: A revisit to least squares orthogonal distance fitting of para-metric curves and surfaces. In: GMP (2008)18. Lopes, R.G., Ha, D., Eck, D., Shlens, J.: A learned representation for scalablevector graphics. In: ICCV (2019)19. Marti, U.V., Bunke, H.: A full english sentence database for off-line handwritingrecognition. In: ICDAR (1999)20. Masood, A., Ejaz, S.: An efficient algorithm for robust curve fitting using cubicbezier curves. In: ICIC (2010)21. Pang, K., Li, K., Yang, Y., Zhang, H., Hospedales, T.M., Xiang, T., Song, Y.Z.:Generalising fine-grained sketch-based image retrieval. In: CVPR (2019)22. Plass, M., Stone, M.: Curve-fitting with piecewise parametric cubics. In: SIG-GRAPH (1983)23. Rabiner, L., Juang, B.: An introduction to hidden markov models. IEEE ASSPMagazine (1), 4–16 (1986)24. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning withdeep convolutional generative adversarial networks. In: ICLR (2016)25. Revow, M., Williams, C.K.I., Hinton, G.E.: Using generative models for hand-written digit recognition. IEEE Transactions on Pattern Analysis and MachineIntelligence (6), 592606 (1996)26. Romaszko, L., Williams, C.K.I., Moreno, P., Kohli, P.: Vision-as-inverse-graphics:Obtaining a rich 3d explanation of a scene from a single image. In: ICCVW (2017)27. Salomon, D.: Curves and surfaces for computer graphics. Springer Science & Busi-ness Media (2007)28. Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: Learning toretrieve badly drawn bunnies. In: SIGGRAPH (2016)´ezierSketch: A generative model for scalable vector sketches 1929. Shao, L., Zhou, H.: Curve fitting with bezier cubics. Graphical models and imageprocessing (3), 223–232 (1996)30. Song, J., Pang, K., Song, Y., Xiang, T., Hospedales, T.M.: Learning to sketch withshortcut cycle consistency. In: CVPR (2018)31. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of videorepresentations using lstms. In: ICML (2015)32. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methodsfor reinforcement learning with function approximation. In: NIPS (1999)33. Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-net: Adeep neural network that beats humans. International Journal of Computer Vision , 411425 (2017)34. Yu, Q., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-net that beatshumans. In: BMVC (2015)35. Zheng, W., Bo, P., Liu, Y., Wang, W.: Fast b-spline curve fitting by l-bfgs. Com-puter Aided Geometric Design29