[PDF] BézierSketch: A generative model for scalable vector sketches

Abstract

The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark.

Full PDF

BB´ezierSketch: A generative model for scalablevector sketches

Ayan Das , , Yongxin Yang , , Timothy Hospedales , , Tao Xiang , , andYi-Zhe Song , SketchX, CVSSP, University of Surrey, United Kingdom { a.das,yongxin.yang,t.xiang,y.song } @surrey.ac.uk iFlyTek-Surrey Joint Research Centre on Artiﬁcial Intelligence University of Edinburgh, United Kingdom [email protected]

Abstract.

The study of neural generative models of human sketches isa fascinating contemporary modeling problem due to the links betweensketch image generation and the human drawing process. The landmarkSketchRNN provided breakthrough by sequentially generating sketchesas a sequence of waypoints. However this leads to low-resolution imagegeneration, and failure to model long sketches. In this paper we presentB´ezierSketch, a novel generative model for fully vector sketches that areautomatically scalable and high-resolution. To this end, we ﬁrst introducea novel inverse graphics approach to stroke embedding that trains an en-coder to embed each stroke to its best ﬁt B´ezier curve. This enables us totreat sketches as short sequences of paramaterized strokes and thus traina recurrent sketch generator with greater capacity for longer sketches,while producing scalable high-resolution results. We report qualitativeand quantitative results on the

Quick, Draw! benchmark.

Keywords:

Sketch generation, Scalable graphics, B´ezier curve

Fig. 1: Left: SketchRNN [8] generates sketches by sampling waypoints (red dots)which lead to coarse images upon zoom. Right: Our B´ezierSketch samples smoothcurves (green control points) thus providing scalable vector graphic generation.

Generative neural modeling of images [6,12] is now an established research areain contemporary machine learning and computer vision. Rapid progress has beenmade in generating photos [11,24], with eﬀort being focused on ﬁdelity, diversity, a r X i v : . [ c s . C V ] J u l A. Das et al. and resolution of image generation, along with stability of training; as well assequential models for text and video [2,31]. Generative modeling of human sketches in particular has recently gained interest, along with other applicationsof sketch analysis such as recognition [34,33], retrieval [28,21,4] and forensics [13]– all facilitated by the growth of large scale sketch datasets [8,28].Sketch generation provides an excellent opportunity to study sequential gen-erative models, and is particularly fascinating due to the potential to establishlinks between learned generative models and human sketching – a communica-tion modality that comes innately to children, and has existed for millennia.Recent breakthroughs in this area include SketchRNN [8], which provided theﬁrst neural generative sequential model for sketch images, and Learn2Sketch [30]which provided the ﬁrst conditional image to sequential sketch model. While con-ventional image generation models focus on producing ever-larger pixel arraysin high ﬁdelity, these methods aim to model sketches using a more human-likerepresentation consisting of a collection of strokes.SketchRNN [8], the landmark neural sketch generation algorithm, treatssketches as a digitized sequence of 2D points on a drawing canvas sampled alongthe trajectory of the ink-ﬂow. This model of sketches has several issues, however:It is ineﬃcient, due to the dense representation of redundant information likehighly correlated temporal samples; and as sketches are ultimately pixels on agrid, it is prone to sampling noise. Crucially it provides limited graphical scala-bility: SketchRNN sets out to achieve vector graphic generation (and claims toachieve this). However it does not generate truly scalable vector graphs as re-quired by applications such as digital art. Since generated sketches are composedof dense line segments, its samples are only somewhat smoother than rastergraphics (Fig. 1). Finally, it suﬀers from limited capacity. Because it modelssketches as a sequence of pixels, it is limited in the length of sketch it can modelbefore the underlying recurrent neural network begins to run out of capacity.In this paper we propose a fundamental paradigm change in the represen-tation of sketches that enables the above issues to be addressed. Speciﬁcally,we aim to represent sketches in terms of parameterized smooth curves [27].These provide a scalable representation of a ﬁnite length curve using few

Con-trol Points . From a large family of parametric curves, we choose B´ezier curvesdue to their simple structure. In order to train a generative model of humansketches with this representation, the key question is how to encode humansketches as parameterized curves. To this end, a key technical contribution is avision-as-inverse-graphics [14,26,5] approach, that learns to embed human sketchstrokes as interpretable parameterized B´ezier curves. We train B´ezierEncoder inan inverse-graphics manner by learning to reconstruct strokes through a white-box graphics (B´ezier) decoder. Given this new low-dimensional stroke represen-tation, we then train B´ezierSketch to generate sketches. Our stroke-level gener-ative model requires many fewer iterations than the segment-level SketchRNN,and thus provides better generation of longer sketches, while providing high-resolution scalable vector-graphic sketch generation (Fig. 1). ´ezierSketch: A generative model for scalable vector sketches 3

In summary, the contributions of our work are: (1) B´ezierEncoder, a novelinverse-graphics approach for mapping strokes to parameterized B´eziers, (2)B´ezierSketch, a sequential generative model for sketches that produces high-resolution and low-noise vector graphic samples with improved scalability tolonger sketches compared to the previous state of the art SketchRNN.

Parameterized Curves

B´ezier curves are a powerful tool in the ﬁeld of com-puter graphics and are extensively used in interactive curve and surface design[27], as are a more general family of curves known as

Splines [3]. Optimizationalgorithms to ﬁt B´ezier curves and Splines from data have been studied. Fewspecially crafted algorithms do exist speciﬁcally for cubic B´ezier curves [29,20].However the challenge for most curve and spline-ﬁtting methods is the existenceof latent variables t that correspond training points and the location of theirprojection onto the curve. This leads to two-stage alternating algorithms forseparately optimizing the curve parameters (control points) and latent parame-ter t [17,22]. Importantly, such methods [17,22] including few promising ones [35]require expensive per-sample alternating optimization, or iterative inference inexpensive generative models [25,15] which make them unsuitable for large scaleor online applications. In contrast, we uniquely take the approach of learning aneural network that maps strokes to B´ezier curves in a single shot. This neuralencoder is a model that needs to be trained, but unlike per-sample optimizationapproaches, it is inductive. So once trained it can provide one-shot estimationof curve parameters and point association from an input stroke. Generative Models

Generative models have been studied extensively in themachine learning literature, often in terms of density estimation with directed[23,1] or undirected [10] graphical models. Research in this ﬁeld accelerated afterthe emergence of Generative adversarial networks (GAN) [6], Variational Au-toencoder (VAE) [12] and their derivatives. Handling sequences are of particularimportance and hence specialized algorithms [2,31] were developed. AlthoughRNNs have been successfully used for generating handwriting [7] without vari-ational training, these methods lacked ﬂexibility in terms of generation quality.The emergence of VAE and variational training methods allows the fusion ofRNNs with variational objective led to the ﬁrst successful generative sequencemodel [2] in the domain of Natural Language Processing (NLP). It was quicklyadapted by SketchRNN [8] in order to extend [7] to free-hand sketches.

Inverse Graphics “Inverse Graphics” is line of work that aims to estimate3D scene parameters from raster images without supervision. Instead it predictsthe input parameters of a computer graphics pipeline that can reconstruct theimage. Several attempts were made [26,14] to estimate explicit model parametersof 3D objects from raw images. A specialized case of the generic Inverse Graphicsidea is to estimate parameters of 2D objects such as curves. As a recent example,an RNN based agent named SPIRAL [5] learned to draw characters in terms of

A. Das et al. pen an brush curves. SPIRAL, however, is extremely costly due to its relianceon Policy Gradient [32] reinforcement learning training and black-box renderer.

Learning for Curves

Few works have studied learning for curve generation.The recent SVG Font Generator [18] trains an excellent font embedding witha recurrent vector font image generator. However it is trained with supervi-sion rather than inverse graphics, and limited to the more structured domain offont images. Other attempts [16] also use supervised learning on synthetic data,rather than unsupervised learning on real human sketches as we consider here.

Background: Conventional Sketch representation and Generation

Acommon format [8] for a digitally acquired sketch S is as a sequence of 2-tuples,each containing a 2 D coordinate on the canvas sampled from a continuous draw-ing ﬂow and a pen-state bit denoting whether the pen touches the canvas or not. S = (cid:2) ( X i , q i ) (cid:3) Li =1 (1)where X i (cid:44) (cid:2) x y (cid:3) Ti ∈ R , q i ∈ { PenUp , PenDown } and L is the cardinalityof S representing the length of the sketch. The state-of-the-art sketch generatorSketchRNN [8] learns a parametric Recurrent Neural Network (RNN) to modelthe joint distribution of coordinates and pen state as a product of conditionals,i.e. p sketchrnn ( S ; θ ) = (cid:81) Li =1 p (cid:0) X i , q i | X

We are interested in moving fromsuch a segment-level representation toward stroke-level. To this end we modifythe structure of our input data to ¯ S (cid:44) (cid:2) T j (cid:3) Nj =1 , with T j (cid:44) (cid:2) X ( j ) i (cid:3) N j i =1 where T j is the j th stroke of length N j (cid:44) | T j | segregated from the sketch by following thepen-state bit, and consequently (cid:80) Nj =1 N j = L . Towards a Stroke-Level Generative Model

Existing generative sketchmodels [8,30] generate a segment at each iteration. Given a stroke-segmentedtraining set ¯ S , we would like to train a generative model analogous to SketchRNN.That is, to model the distribution over possible sketches with a parametric model p model ( ¯ S ; θ ) and that approximates the original data distribution p data ( ¯ S ). Dif-ferent sketches having diﬀerent lengths N makes this problem suitable for Re-current Neural Networks (RNN). One could model the probability of a sketchas a product of the probabilities of individual strokes T j conditioned on allits previously seen strokes T
30 20 10 0 10 20 P P P P t =0.0 t =0.1 t =0.2 t =0.3 t =0.4 t =0.5 t =0.6 t =0.7 t =0.8 t =0.9 t =1.0 (a)

20 0 20101520 20 0 201416182022

20 0 20 (b)

Fig. 2: (a) An example of B´ezier curve of degree n = 3 with n + 1 control points.(b) B´ezier curves with Gaussian noise ( µ = , Σ = 5 I ) added to control pointsproduce similar curves in image space.and corresponding non-parametric decoder d ( · ) such that T j ≈ d ( e j ). We thenmodel the encoded sketch e ( ¯ S ) (cid:44) (cid:8) e j (cid:9) Nj =1 as p model ( e ( ¯ S ); θ ) = N (cid:89) j =1 p ( e j | e
Inverse Graphics Decoder

B´ezier curves, used heavily in computer graph-ics, are smooth curves representable in a closed functional form parameterized bya sequence of n +1 anchor coordinates P (cid:44) (cid:2) P x P y (cid:3) T ∈ R termed control points .A degree n B´ezier curve with control points (cid:2) P , P , · · · P n (cid:3) is represented as C ( t ; { P i } ) = n (cid:88) i =0 B i,n ( t ) · P i (3)where t ∈ [0 ,

1] is the parameter of the curve, B i,n ( t ) (cid:44) (cid:0) ni (cid:1) t i (1 − t ) n − i is theBernstein Basis Polynomial in t and C ( t ) (cid:44) (cid:2) C x ( t ) C y ( t ) (cid:3) T ∈ R denotes a point A. Das et al. on the curve at t = t . As t assumes values 0 →

1, the curve starts from P andends at P n and the control points (cid:2) P , · · · , P n − (cid:3) control the trajectory of thecurve, as illustrated in Fig. 2(a). We further use P n (cid:44) (cid:2) P x , P y , · · · , P xn , P yn (cid:3) ∈ R n +1) to denote elements (curves) in the continuous space of n + 1 controlpoints. The decoder function d : P → T can be trivially realized by Eq. 3 withthe set of t -values chosen as per resolution requirement.We now denote ( T , P ) as an arbitrary stroke and its B´ezier representation,where we have dropped the subscript j and superscript n for notational brevity.Using P as an embedding space for T leads to an extremely useful and keyproperty: Given a choice of n , two similar points in P space correspond to similarstrokes in T space. As a consequence, we can sample from the conditionals inEq. 2 to generate variations of a stroke. Property 1.

Given a ( T , P ) pair where T = d ( P ) and sample (cid:98) P ∼ N ( P , σ ),then the decoded (cid:98) T = d ( (cid:98) P ) is distributed as N ( T , σ (cid:48) ). Proof.

Refer to Appendix A in the supplementary document for the proof. Il-lustrative examples are given in Fig. 2(b).

A stroke to B´ezier encoder

We wish to learn an embedding function e ( · )that will map a given stroke T to its best ﬁt B´ezier representation P . Due tothe variable length of strokes T , we model B´ezierEncoder with a bi-directionalRNN, with forward and backward states −→ s i , ←− s i ∈ R h at time-step i as (cid:2) −→ s i , ←− s i (cid:3) = BiRNN( X i − , s i − ; θ ) (4)However, unlike regular encoder RNNs, we further transform the last hiddenstate to get a B´ezier curve representation P = W P (cid:2) −→ s end ; ←− s end (cid:3) (5)where the ‘ end ’ subscript denotes the state of the RNN at last time-step, [ ; ]denotes the concatenation operator and W P ∈ R n +1) × h .The formulation so far enables extracting a curve P from data T . However,while P is now a suﬃcient representation to decode the B´ezier by means ofEq. 3, we do not have suﬃcient information to compute a reconstruction losslike (cid:107) T − d ( e ( T )) (cid:107) because we lack the association between input coordinates X i and interpolation parameters t i . This is where many classic B´ezier ﬁttingtechniques [17,35] resort to slow alternating optimization techniques.We take a diﬀerent approach and ask our encoder to also predict the corre-sponding interpolation parameter t i for each input point X i . In order to makevalid predictions for t we note the properties it requires due to its role in B´eziercurves generation: (cid:54) (cid:98) t i (cid:54) (cid:98) t i (cid:54) (cid:98) t i +1 (dueto sequential nature of X i ). Apart from these, we impose another property with-out any lose of generality: t = 0 and t end = 1 (this will make X and X end coincide with P and P n respectively). Please refer to the experiment sectionfor an implementation trick to do so. ´ezierSketch: A generative model for scalable vector sketches 7 ⋯ ⋯ Cumulative Sum Δ ˆ ˆ ( ;  ) ( ;  ) ( ;  )  ˆ ˆ ˆ  Fig. 3: Inverse graphics training of our B´ezierEncoder architecture for model-based single-pass stroke [ X i ] to B´ezier P mapping.To enable our encoder to meet these requirements above, we do not compute t i s directly, but instead compute increments ∆t i (cid:44) t i − t i − (with t (cid:44) (cid:2) −→ s i ; ←− s i (cid:3) at every step i . The t i -values can then be easily computed as acumulative sum of all ∆t i up to i . Thus, the second path of our encoder predicts (cid:98) t i = i (cid:88) i (cid:48) =1 (cid:99) ∆t i (cid:48) , with (cid:99) ∆t i = Softmax i ( W t · (cid:2) −→ s i ; ←− s i (cid:3) ) . (6)The usage of Softmax() enforces all three requirements stated above.To summarize: Our full architecture, as shown in Figure 3 thus has twopathways: A B´ezier embedding pathway that predicts the curve P for the entirestroke input T and an interpolation parameter pathway that further predicts theestimated curve parameter (cid:98) t i for each input point X i in T . Given the ( X i , (cid:98) t i )pairs and P predicted by our encoder, we can now train our model with thefollowing reconstruction loss: L ( θ, W P , W t ) (cid:44) (cid:88) i (cid:13)(cid:13) C ( (cid:98) t i , P ) − X i (cid:13)(cid:13) (7)which is optimized w.r.t. encoder parameters { θ, W P , W t } by SGD. Once trained,we can compute the best-ﬁt B´ezier for any stroke using Eq. 5, which provides afeed-forward single pass solution to a typically alternating optimization. A Multi-Degree Representation Extension

To add more ﬂexibility, wecan extend this basic building block to learn a multi-degree representation ofa given stroke T . In order to do so, we encode the stroke using the the sameRNN in Eq. 4 parameterized by θ but use a set of diﬀerent W P n and W nt for apredeﬁned range of degree n ∈ [ n min , · · · , n max ] to predict B´ezier representationsof diﬀerent degrees along with their corresponding t ni -values. A. Das et al. (cid:98) t ni = i (cid:88) i (cid:48) =1 (cid:99) ∆t ni (cid:48) , with (cid:99) ∆t ni = Softmax i ( W nt · (cid:2) −→ s i ; ←− s i (cid:3) ) and P n = W P n (cid:2) −→ s end ; ←− s end (cid:3) (8)The total loss is now the sum of losses at every order n : L total (cid:44) n max (cid:88) n = n min L n , with L n ( θ, W P n , W nt ) (cid:44) (cid:88) i (cid:13)(cid:13) C ( (cid:98) t ni , P n ) − X i (cid:13)(cid:13) (9)Inference in this model can now predict a set of B´ezier representations fordiﬀerent degrees, where higher order curves ﬁt the data better at the cost ofmore control points. The preferred order can then be chosen manually accordingto user requirement, or automatically by heuristic. An eﬀective heuristics is toevaluate the loss L n for all n and choose the smallest n for which L n ≤ L tolerance . Smoothness Regularizer

Our training objectives Eq. 7 or Eq. 9 may lead tooverﬁtting in the domain of B´ezier curves during encoder learning. To avoid thiswe add a smoothness regularizer (with regularization strength β ) that preferssequential control points to be nearby. Speciﬁcally, we add β · R n with L n foreach n , where R n ( P n ) (cid:44) n (cid:88) i =1 (cid:107) P i +1 − P i (cid:107) . We next leverage our choice of B´ezier representation space, and encoding model P = e ( · ) to deﬁne two alternative vector graphic generative models for sketches. Control Point mode

Given a sketch as a sequence of stroke embeddings {P j } Ni =1 obtained from the raw input strokes as P = e ( T ), we can modify theoriginal data structure in Eq. 1 and substitute the set of absolute co-ordinatesof every stroke by the set of control points of its B´ezier representation. Themodiﬁed sketch S cp would be S cp = (cid:104)(cid:16) P ( j )0 , q ( j )0 (cid:17) , · · · , (cid:16) P ( j ) i , q ( j ) i (cid:17) , · · · , (cid:16) P ( j ) n j , q ( j ) n j (cid:17)(cid:105) Nj =1 (10)When encoded this way by our B´ezier encoder, each sketch is representedby a relatively shorter (mostly) list of parametric control points rather thanthe original long list of coordinates. In this format, diﬀerent strokes can havediﬀerent degrees, as indicated by the use of n j above.Given this sequential representation of a sketch dataset, we can now train agenerative sketch model. Since S cp is structurally same as original S apart fromits length and the interpretation of its co-ordinates, we can re-use exactly thesame architecture and training procedure as SketchRNN [8]. We use a variational ´ezierSketch: A generative model for scalable vector sketches 9 sequence-to-sequence autoencoder [31] with a latent vector encoding the wholesketch. Thus one sketch is encoded ﬁrst to a list of B´ezier curves, and then toa latent vector in SketchRNN architecture; and decoded ﬁrst to a list of curveparameters, and then rendered by the B´ezier renderer. Please refer to AppendixB for a brief review of the SketchRNN architecture in the context of our problem. Stroke mode

Given a sketch S as set of strokes { T j } Nj =1 , we transformit as S st = {P j } Nj =1 where P j = e ( T j ). We model the whole sketch using asequence-to-sequence autoencoder, where each time-step processes one strokerepresented as a ﬁxed order B´ezier curve. We use a bi-directional RNN to encodethe whole sketch stroke-by-stroke. The hidden states (forward and backward) ofthe encoder −→ h j , ←− h j at time-step j is given as (cid:104) −→ h j , ←− h j (cid:105) = BiRNN( P j − , h i − ; Θ )A latent vector z ∈ R N z encoding the whole sketch is sampled using theparameters of a Gaussian distribution computed from the last hidden states z ∼ N ( µ z , diag( σ z )), with [ µ z , σ z ] = f (cid:16)(cid:104) −→ h N ; ←− h N (cid:105) ; Θ (cid:17) An unidirectional decoder RNN is initialized using z and models the proba-bility of j th stroke embedding conditioned on the hidden state g j ∈ R H d p ( P j | g j ; Θ ) = GMM (cid:16) P j ; (cid:8) µ mj ( g j ) , Σ mj ( g j ) , π mj ( g j ) (cid:9) Mm =1 (cid:17) g j = DecoderRNN([ P j − ; z ] , g j − ; Θ ) (11)where (cid:8) µ mj , Σ mj , π mj (cid:9) are the parameters of the M -component GMM for the j th stroke. For computational eﬃciency, we consider diagonal Σ mj and by deﬁnition (cid:80) m π mj = 1. Given a trained model, we can sample from this distribution togenerate similar P j which will resemble its original domain data T j as guaranteedby property 1. Along with P j at every step j , we also predict a stop bit (cid:98) b j ∈ [0 , b j (cid:44) j = N . The sketch generator is trained with the following objective function L ( {P j } Ni =1 ; Θ ) =  − N max N (cid:88) j =1 log GMM (cid:16) P j | (cid:8) µ mj , Σ mj , π mj (cid:9) Mm =1 ; Θ (cid:17) − N max N (cid:88) j =1 b j log (cid:98) b j  − N z N z (cid:88) i =1 (cid:0) σ i z − µ i z − exp ( σ i z ) (cid:1) (12)The ﬁrst two terms of L are the log-likelihood of a sequence {P j } Ni =1 underthe model and the loss due to the stop bit respectively. The third term denotesthe KL-divergence loss for imposing a Gaussian prior on the latent code z . Thediagonal entries of Σ mj have been raised by exp ( · ) to make them non-negativeand Softmax ( · ) has been used to ensure (cid:80) m π mj = 1. Dataset

Quick, Draw! is a large sketch dataset [8] collected as a part of anonline game to draw a given category within a time-limit, in which thousandsof people around the world participated. Due to the problem deﬁnition andstructure of data used by our framework (see Eq.1),

Quick, Draw! is the mostsuitable dataset to validate it. Diﬀerent versions of the dataset use diﬀerentsampling rates at which the sketches are stored as point sequences. SketchRNNis known to work well only on data with lower sampling rate (i.e., E T (cid:2) | T | (cid:3) islower) than the raw data ( E T (cid:2) | T | (cid:3) is higher) recorded. Due to ﬁxed length ofB´ezier representations, our framework can adapt to data with both high and lowsampling rates without any modiﬁcation. Although our method is generalizableacross all categories, we experimented with few categories to validate our claims.Our framework has two main components: Embedding each stroke into itsB´ezier representation. Training a generative model with the encoded sketcheseither in control point mode or stroke mode . As our B´ezierEncoder is a key con-tribution, we validate this in isolation, before comparing our whole B´ezierSketchframework to SketchRNN [8]. We created a dataset of all strokes from all sketchesin a category of

Quick, Draw! in order to train the stroke embedding model de-scribed in Section 3.1. We adopted some tricks that made the training and rep-resentation more eﬃcient in practice. We normalized all strokes to start from theorigin (i.e., X = [0 , T ). Furthermore, we assumed that the ﬁrst control point P of a B´ezier representation is always aligned to the ﬁrst absolute coordinateof the stroke (i.e., X = P ). Given these design choices, we can ignore the ﬁrstcontrol point (ﬁxing it to origin) and only predict successive diﬀerences of controlpoints (i.e., ∆ P (cid:44) P − P , ∆ P (cid:44) P − P and so on) and then decode P i as P i = (cid:80) ii (cid:48) =1 ∆ P i (cid:48) while evaluating the loss in Eq. 7. We chose the hidden state di-mension to be h = 256 and n min = 3 , n max = 9 for learning multi-degree B´ezierrepresentation. To exclude over complicated strokes, we apply some heuristicsto split a stroke into two or more. Speciﬁcally, we split a stroke into multipleparts based on two criteria: 1. Every part is within a maximum length and 2.Every part has only one sharp bend (determined by computing its curvature ata given point). We set the regularizer weight β = 10 − . Results

We ﬁrst qualitatively demonstrate the results of inferring B´ezier rep-resentations of input strokes. Fig. 4(top left) shows ﬁtting results for variouscurve orders (columns) – showing variable amounts of detail being captured atdiﬀerent orders. It also shows ﬁtting examples at both low (above) and high(below) sampling rates – conﬁrming that our encoder can adapt to both.We next qualitatively illustrate the training dynamics of our model via theﬁt estimated as training progresses. The results in Fig. 4(middle) show the es-timated ﬁt during training in terms of B´ezier curve (red) and control points ´ezierSketch: A generative model for scalable vector sketches 11 n =3 n =4 n =5 n =6 n =7 n =8 n =9 n =3 n =4 n =5 n =6 n =7 n =8 n =9 Cat Bird Pig Clock Butterfly Mosquito S t a b l e t e s t l o ss

1e 3 Trained on CatTrained of selfAvg. initial loss

Fig. 4: Evaluating our B´ezierEncoder. (Top left) Learned representations ofmulti-degree B´ezier stroke embedding. Top and bottom rows contain moderateand high-sampling rate respectively. (Top right) Test loss for various categorieswhen trained on same category vs “Cat”, demonstrating transferability of theencoder. (Middle) Visualising training dynamics. Blue: Stroke to ﬁt. Red andGreen: B´ezier curve and control points. Cyan: Estimated point correspondence.(Bottom) Examples of full sketches and their learned B´ezier representation.(green) for a stroke deﬁned by (blue) points. Recall that our encoder also pre-dicts the interpolation parameters t that match each input point to a locationon the curve. These correspondences are indicated in (cyan). Clearly both theﬁt and the estimated correspondences improve with training iterations. Refer toAppendix C in the supplementary document for similar visualization of moresamples.Given that our training data is grouped into categories, we next verify thatour encoder indeed learns a generic B`ezier embedding, and is not overﬁtted toa speciﬁc category. Speciﬁcally, we compare the test loss for reconstructing dataof each category when the encoder is trained on the same category as testing vstrained vs a disjoint category to testing. The results in Fig. 4(top right) showsthat the embedding generalizes quite well to categories it is not trained on.Finally, Fig. 4(bottom) shows examples of full sketches encoded by our en-coder, and then decoded as B´eziers. We can see that the encoded sketches reﬂectthe input, but are smoother and cleaner. In control point mode, a fully trained multi-degree embedding modelis used to restructure all sketches in our dataset as S cp . We set L tolerance =10 − to select the best n . We then train a SketchRNN-like model [8] using therestructured data. As data augmentation, we added 2D standard normal noiseat all control points. Sampling from the latent space and decoding it by thedecoder will generate sequence of control points and stroke/sketch ending bits.Treating one entire stroke as a set of control points, we can then draw it on acanvas using Eq. 3 with any required level of granularity.In stroke mode, we encode each stroke with a ﬁxed degree of n = 9. Verysimilar to control point mode , we use a Bi-LSTM to encode the whole sketchstroke-by-stroke and extract N z dimensional latent vector. By conditioning onthe latent vector, the decoder produces B´ezier representation P of one stroke ateach time-step. Thus, the length of a sketch coincides with the number of strokespresent in the sketch. At each step of the decoder, we sample one stroke from p ( P j | g j , Θ ) which is modeled as a GMM with M = 10 mixture components.However, unlike the control point mode and its corresponding SketchRNN-likearchitecture, we do not use correlation parameter in the constituent Gaussians.This design choice makes the individual dimensions of the Gaussians indepen-dent, sampling from which is justiﬁed given property. 1. Apart from P j , wepredict one more quantity in practice: the start location v j (cid:44) ( v x , v y ) Tj of thestroke w.r.t the whole sketch. The need for v j arises due to the practical con-sideration of relocating the start of each individual stroke at the origin whileencoding them. Results

Qualitative results of generated unconditional sketch samples fromboth our model variants are shown in Fig. 5(a). We can see that, similarlyto SketchRNN, B´ezierSketch generates diverse and plausible samples. However,uniquely our samples are high-resolution vector graphic sketches. Fig. 5(b) alsoshows examples of conditional samples where the right group of three images aresamples conditioned on the left sketch encoding.The use of B´ezier curves as stroke representation reduces the average lengthof a given stroke’s representation signiﬁcantly and as a direct consequence, thedescription length for whole sketches as well. In Fig. 6, we compare the length his-tograms of original data and its B´ezier representation both on stroke and sketchlevel, conﬁrming that B´eziers are systematically shorter (left). This is the samefor strokes and sketches sampled by vanilla and SketchRNN and B´ezierSketchrespectively (right).This property of shorter representations for any given sketch means thatour generator should have an advantage modeling longer sketches compared tovanilla SketchRNN since it only needs to model shorter sequences. To evaluatethis, we use a modiﬁed Fr´echet Inception Distance (FID) [9] score to compare thegenerated samples from both models. We ﬁrst trained both our generator modeland SketchRNN on the entire dataset (of each category). We then create a subsetof sketches whose original length is l ±

20 and use them to generate samples. ´ezierSketch: A generative model for scalable vector sketches 13(a)(b)

Fig. 5: Qualitatively evaluating B´ezierSketch. (a) Samples drawn unconditionallyin control point mode (left half) and stroke mode (right half). (b) Sketch samplesgenerated by conditioning on the ﬁrst sketch (double bordered) in each set.All original and generated samples are rendered on a canvas and projected downto a concise feature vector using pre-trained Sketch-a-Net 2.0 [33] classiﬁer. Wecompute the empirical mean and covariance of both real samples and generatedsamples as ( µ r , Σ r ) and ( µ g , Σ g ) and then estimate modiﬁed FID as: FID = (cid:107) µ r − µ g (cid:107) + Tr( Σ r + Σ g − Σ r Σ g ) / )The results in Fig. 7 plots the modiﬁed FID score with increasing lengthvalue l for both SketchRNN and our model on each category of sketches. We cansee that our model leads to improved (lower) FID score, especially for longersketches. This is illustrated qualitatively in Fig. 7, where we can see that forlonger sketches, our framework produces much more reliable reconstruction thanQuickDraw, which fails to make reasonable reconstruction in these cases. Other applications

Although crafted with sketches in mind, our frameworkcan be adapted to other applications like handwriting generation (in line withthe work of [7]) with little to no modiﬁcation. In fact, any 2D sequence datawith two-level hierarchical representation (e.g., stroke and sketch) can be mod-eled using the same framework. Online handwritten characters are composed ofrelatively short strokes which we model with B´ezier curves. We use the onlinehandwritten sentences from the IAM handwriting database [19], embed the con-stituent strokes with our B´ezier representation and train our generative modelfor words. Fig. 8 shows qualitative samples from our resulting word generator. F r e q u e n c y H i s t o g r a m ( N o r m a li z e d ) Original (low sampling)Original (high sampling)Beizerrepresentation

Original (low sampling)Original (high sampling)Beizerrepresentation

Generated (low sampling)Beizerrepresentation

Generated (low sampling)Beizerrepresentation

Fig. 6: Stroke/Sketch Length histogram for original data (left) and generatedsamples (right). B´ezier encodings are shorter sequences than the raw data.

25 50 75 100 1254681012 CatSketchRNNOurs (control point)Ours (stroke) 25 50 75 100 1258101214 Pig 25 50 75 100 1258101214161820 Butterfly25 50 75 100 12510121416182022 Clock 25 50 75 100 1258101214161820 Bird 25 50 75 100 125810121416 Mosquito

Fig. 7: Left: FID score ( ↓ ) vs length of sketch shows the eﬀectiveness of ourgenerative model on longer sketches. Right: Qualitative samples of long sketches.Three columns denote the original sketch, SketchRNN and our B´ezierSketch. In this paper we presented an inverse graphics approach to training an eﬃcientmodel-based single-pass stroke-to-B´ezier encoder via reconstruction through aB´ezier decoder. Such approach surpasses the conventional ﬁtting-based methodsin terms of quality and eﬃciency. Furthermore, this enabled us to advance gener-ative sketch models by generating sketches as sequences of parameterized curvesrather than pixels, leading to arbitrary-resolution scalable vector graphic sam-ples. This new representation also enables better generation of longer sketchescompared to existing state of the art. In future work we will investigate extend-ing to more complex parameterized curves such as B-splines, and developing anencoder to predict curves from rasterized images directly.Fig. 8: Unconditionally generating handwritten words from the IAM database. upplementary material forB´ezierSketch: A generative model for scalablevector sketches

Ayan Das , , Yongxin Yang , , Timothy Hospedales , , Tao Xiang , , andYi-Zhe Song , SketchX, CVSSP, University of Surrey, United Kingdom { a.das,yongxin.yang,t.xiang,y.song } @surrey.ac.uk iFlyTek-Surrey Joint Research Centre on Artiﬁcial Intelligence University of Edinburgh, United Kingdom [email protected]

Property 1.

Given a ( T , P ) pair where T = d ( P ) for an arbitrary set of t , and (cid:98) P ∼ N ( P , Σ ), then the decoded (cid:98) T = d ( (cid:98) P ) with the same set of t , is distributedas N ( T , Σ (cid:48) ), where Σ and Σ (cid:48) are diagonal covariance matrices. Proof. As Σ is diagonal, we can separate each dimension of N ( P , Σ ) into indi-vidual Gaussians and then group x − y components of each control point withits own Gaussian with diagonal covariance Σ i (cid:44) (cid:20) σ x i , , σ y i (cid:21) N ( P , Σ ) = n (cid:89) i =0 N ( P i , Σ i )By drawing samples from the gaussians of individual control points, we get (cid:98) P (cid:44) (cid:104) (cid:98) P i (cid:105) ni =0 where (cid:98) P i ∼ N ( P i , Σ i ). Decoding (cid:98) P by d ( · ) gives (cid:98) T = d ( (cid:98) P ) = n (cid:88) i =0 B i,n ( t ) · (cid:98) P i (1)Given any value of t = t , the random variable (cid:98) T is a weighted sum of n independent gaussian random variables with weights [ B i,n ( t )] ni =0 . Hence, (cid:98) T isdistributed as (cid:98) T ∼ N (cid:32) n (cid:88) i =0 B i,n ( t ) · P i , n (cid:88) i =0 B i,n ( t ) · Σ i (cid:33) (2)Now we know that n (cid:88) i =0 B i,n ( t ) · P i (cid:44) T and we denote n (cid:88) i =0 B i,n ( t ) · Σ i (cid:44) Σ (cid:48) .So, (cid:98) T ∼ N ( T , Σ (cid:48) ) Sketch-RNN [8] is considered the state-of-the-art generative model for free-hand vector sketches. Sketch-RNN models the consecutive diﬀerences of 2D way-points of a sketch along with three bits denoting “touching”, “stroke-end” and“sketch-end” state of the pen. In control point mode of B´ezierSketch, we adoptedthe same architecture and data representation as Sketch-RNN but with controlpoints instead of waypoints. Hence, a sketch S cp is transformed to a list (oflength N ) of 5-tuples s i (cid:44) ( ∆P x , ∆P y , q , q , q ) i where [ ∆P x , ∆P y ] T (cid:44) ∆ P isthe successive diﬀerence of control points and ( q , q , q ) (cid:44) q are the three ﬂagbits described above. As a normalization step, all sketches have been assumedto start from the origin (i.e., [0 , T ).The core model of Sketch-RNN is a Sequence-to-Sequence Variational Au-toencoder (Seq2Seq-VAE) [31] with a standard sequence encoder and an autore-gressive decoder. The whole sketch sequence is fed into a Bidirectional encoderLSTM with hidden state given as h i (cid:44) (cid:104) −→ h i ; ←− h i (cid:105) = Bi-LSTM( s i , h i − ) (3)and the last state h N is used as a compact representation of the sketch. h N isthen used to generate the parameters of a gaussian distribution following theVAE framework [12]. A sample is then drawn from the distribution as z ∼ N ( µ, σ ), where [ µ, σ ] = f ( h N ) ∈ R Z and decoded by an autoregressive decoder. An unidirectional LSTM is employedto initialize from z and produce a reconstruction of the sketch sequence similarto [7]. At each time-step j of the decoder, the hidden state is given as g j = LSTM([ z ; s j ] , g j − ), with g = tanh( z )The decoder, at every time-step, outputs the parameters of a GMM (with M mixtures) on [ ∆P x , ∆P y ] T and also a categorical distribution on three ﬂag bitsdiscussed above. Samples from these distributions are fed back as input s j +1 atnext time step s (cid:48) j = (cid:0) ∆ P (cid:48) j , q (cid:48) j (cid:1) , where ∆ P (cid:48) j ∼ GMM( ∆ P ; g j ) and q (cid:48) j ∼ Cat( q ; g j ) (4)The network is trained with the following loss that comprises of log-likelihoodof the GMM, categorical cross-entropy of the ﬂag bits and a variational KLdivergance loss L = − N max  N (cid:88) j =1 log GMM( ∆ P (cid:48) j ) + N max (cid:88) j =1 q j log q (cid:48) j  − Z (1 + σ − µ − exp ( σ )) (5) ´ezierSketch: A generative model for scalable vector sketches 17 We provide visualizations (Refer to Fig. 1) of the optimization dynamics overtime. We also annotate a discrete point of the stroke and its corresponding pointon the B´ezier curve by joining them by a connector.Fig. 1: Visualization of intermediate stages of the ﬁtting for B´ezierEncoder net-work. Each row corresponds to one sample and columns denote increasing iter-ations of training.

References

1. Bishop, C.M.: Mixture density networks. Tech. rep., Aston University (1994)2. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., Bengio, S.: Gener-ating sentences from a continuous space. In: CoNLL (2016)3. De Boor, C., De Boor, C., Math´ematicien, E.U., De Boor, C., De Boor, C.: Apractical guide to splines, vol. 27. Springer-Verlag New York (1978)8 A. Das et al.4. Dey, S., Riba, P., Dutta, A., Llados, J., Song, Y.Z.: Doodle to search: Practicalzero-shot sketch-based image retrieval. In: CVPR (2019)5. Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S.M.A., Vinyals, O.: Synthesizingprograms for images using reinforced adversarial learning. In: ICML (2018)6. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)7. Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)8. Ha, D., Eck, D.: A neural representation of sketch drawings. In: ICLR (2018)9. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trainedby a two time-scale update rule converge to a local nash equilibrium. In: NIPS(2017)10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-ral networks. Science (5786), 504–507 (2006)11. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: CVPR (2017)12. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2014)13. Klare, B., Li, Z., Jain, A.: Matching forensic sketches to mug shot photos. IEEETransactions on Pattern Analysis and Machine Intelligence (3), 639 –646 (march2011)14. Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional in-verse graphics network. In: NIPS (2015)15. Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learningthrough probabilistic program induction. Science (6266), 1332–1338 (2015)16. Laube, P., Franz, M.O., Umlauf, G.: Deep learning parametrization for b-splinecurve approximation. In: 2018 International Conference on 3D Vision (3DV) (2018)17. Liu, Y., Wang, W.: A revisit to least squares orthogonal distance ﬁtting of para-metric curves and surfaces. In: GMP (2008)18. Lopes, R.G., Ha, D., Eck, D., Shlens, J.: A learned representation for scalablevector graphics. In: ICCV (2019)19. Marti, U.V., Bunke, H.: A full english sentence database for oﬀ-line handwritingrecognition. In: ICDAR (1999)20. Masood, A., Ejaz, S.: An eﬃcient algorithm for robust curve ﬁtting using cubicbezier curves. In: ICIC (2010)21. Pang, K., Li, K., Yang, Y., Zhang, H., Hospedales, T.M., Xiang, T., Song, Y.Z.:Generalising ﬁne-grained sketch-based image retrieval. In: CVPR (2019)22. Plass, M., Stone, M.: Curve-ﬁtting with piecewise parametric cubics. In: SIG-GRAPH (1983)23. Rabiner, L., Juang, B.: An introduction to hidden markov models. IEEE ASSPMagazine (1), 4–16 (1986)24. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning withdeep convolutional generative adversarial networks. In: ICLR (2016)25. Revow, M., Williams, C.K.I., Hinton, G.E.: Using generative models for hand-written digit recognition. IEEE Transactions on Pattern Analysis and MachineIntelligence (6), 592606 (1996)26. Romaszko, L., Williams, C.K.I., Moreno, P., Kohli, P.: Vision-as-inverse-graphics:Obtaining a rich 3d explanation of a scene from a single image. In: ICCVW (2017)27. Salomon, D.: Curves and surfaces for computer graphics. Springer Science & Busi-ness Media (2007)28. Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: Learning toretrieve badly drawn bunnies. In: SIGGRAPH (2016)´ezierSketch: A generative model for scalable vector sketches 1929. Shao, L., Zhou, H.: Curve ﬁtting with bezier cubics. Graphical models and imageprocessing (3), 223–232 (1996)30. Song, J., Pang, K., Song, Y., Xiang, T., Hospedales, T.M.: Learning to sketch withshortcut cycle consistency. In: CVPR (2018)31. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of videorepresentations using lstms. In: ICML (2015)32. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methodsfor reinforcement learning with function approximation. In: NIPS (1999)33. Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-net: Adeep neural network that beats humans. International Journal of Computer Vision , 411425 (2017)34. Yu, Q., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-net that beatshumans. In: BMVC (2015)35. Zheng, W., Bo, P., Liu, Y., Wang, W.: Fast b-spline curve ﬁtting by l-bfgs. Com-puter Aided Geometric Design29

Related Researches

Is Space-Time Attention All You Need for Video Understanding?

by Gedas Bertasius

RMOPP: Robust Multi-Objective Post-Processing for Effective Object Detection

by Mayuresh Savargaonkar

Classification of Handwritten Names of Cities and Handwritten Text Recognition using Various Deep Learning Models

by Daniyar Nurseitov

DetCo: Unsupervised Contrastive Learning for Object Detection

by Enze Xie

Residue Density Segmentation for Monitoring and Optimizing Tillage Practices

by Jennifer Hobbs

On the Robustness of Multi-View Rotation Averaging

by Xinyi Li

UVTomo-GAN: An adversarial learning based approach for unknown view X-ray tomographic reconstruction

by Mona Zehni

Deep learning architectural designs for super-resolution of noisy images

by Angel Villar-Corrales

An underwater binocular stereo matching algorithm based on the best search domain

by Yimin Peng

Dynamic Neural Networks: A Survey

by Yizeng Han

SG2Caps: Revisiting Scene Graphs for Image Captioning

by Subarna Tripathi

RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization

by Yizhou Wang

Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

by Soravit Changpinyo

Ensembling object detectors for image and video data analysis

by Kateryna Chumachenko

The Role of the Input in Natural Language Video Description

by Silvia Cascianelli

Deep Multilabel CNN for Forensic Footwear Impression Descriptor Identification

by Marcin Budka

An application of a pseudo-parabolic modeling to texture image recognition

by Joao B. Florindo

Robust Motion In-betweening

by Félix G. Harvey

How Unique Is a Face: An Investigative Study

by Michal Balazia

Negative Data Augmentation

by Abhishek Sinha

Flow-Mixup: Classifying Multi-labeled Medical Images with Corrupted Labels

by Jintai Chen

Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning

by Yu Liu

Online Clustering-based Multi-Camera Vehicle Tracking in Scenarios with overlapping FOVs

by Elena Luna

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

by Aojun Zhou

Fast and Reliable Probabilistic Face Embeddings in the Wild

by Kai Chen

«

1

2

3

4

»

Submitted on 4 Jul 2020 (v1), last revised 14 Jul 2020 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar