[PDF] Sketch-R2CNN: An Attentive Network for Vector Sketch Recognition

Abstract

Freehand sketching is a dynamic process where points are sequentially sampled and grouped as strokes for sketch acquisition on electronic devices. To recognize a sketched object, most existing methods discard such important temporal ordering and grouping information from human and simply rasterize sketches into binary images for classification. In this paper, we propose a novel single-branch attentive network architecture RNN-Rasterization-CNN (Sketch-R2CNN for short) to fully leverage the dynamics in sketches for recognition. Sketch-R2CNN takes as input only a vector sketch with grouped sequences of points, and uses an RNN for stroke attention estimation in the vector space and a CNN for 2D feature extraction in the pixel space respectively. To bridge the gap between these two spaces in neural networks, we propose a neural line rasterization module to convert the vector sketch along with the attention estimated by RNN into a bitmap image, which is subsequently consumed by CNN. The neural line rasterization module is designed in a differentiable way to yield a unified pipeline for end-to-end learning. We perform experiments on existing large-scale sketch recognition benchmarks and show that by exploiting the sketch dynamics with the attention mechanism, our method is more robust and achieves better performance than the state-of-the-art methods.

Full PDF

SSketch-R2CNN: An Attentive Network for Vector Sketch Recognition

Lei LiHKUST Changqing ZouUniversity of Maryland, College Park Youyi ZhengZhejiang University Qingkun SuAlibaba A.I. LabsHongbo FuCity University of Hong Kong Chiew-Lan TaiHKUST

Abstract

RNN-Rasterization-CNN ( Sketch-R2CNN for short) to fully leverage the dy-namics in sketches for recognition. Sketch-R2CNN takes asinput only a vector sketch with grouped sequences of points,and uses an RNN for stroke attention estimation in the vec-tor space and a CNN for 2D feature extraction in the pixelspace respectively. To bridge the gap between these twospaces in neural networks, we propose a neural line ras-terization module to convert the vector sketch along withthe attention estimated by RNN into a bitmap image, whichis subsequently consumed by CNN. The neural line raster-ization module is designed in a differentiable way to yielda uniﬁed pipeline for end-to-end learning. We perform ex-periments on existing large-scale sketch recognition bench-marks and show that by exploiting the sketch dynamics withthe attention mechanism, our method is more robust andachieves better performance than the state-of-the-art meth-ods.

1. Introduction

Freehand sketching is an easy and quick means of com-munication because of its simplicity and expressiveness.While a human has the innate ability to interpret drawingsemantics, the vast capacity of expressiveness in sketchesposes great perception challenges to machines. For betterhuman-computer interactions, sketch analysis has been anactive research topic in the computer vision and graphicsﬁelds, spanning a wide spectrum including sketch recog-nition [3, 44, 47], sketch segmentation [35, 11, 17, 18], sketch-based retrieval [4, 38, 30, 42] and modeling [26], etc.In this paper, we focus on developing a novel learning-basedmethod for freehand sketch recognition.The goal of sketch classiﬁcation or recognition is to iden-tify the object category of an input sketch, which is morechallenging than image classiﬁcation due to the lack of richtexture details, inherent ambiguities, and large shape varia-tions in the input. Traditional studies [3, 31, 19] commonlycast sketch recognition as an image classiﬁcation task byconverting sketches into binary images and then extractinglocal image features. With the quantiﬁed feature descrip-tors, a typical classiﬁer such as Support Vector Machine(SVM) is trained for object category prediction. Recentyears have witnessed the success of deep learning in im-age classiﬁcation [14]. Similar neural network designs havealso been used to address the recognition problem of sketchimages [44, 30]. Although these deep learning-based meth-ods outperform the traditional ones, the unique propertiesof sketches, as discussed in the following, are often over-looked, leaving room for further improving the performanceof sketch recognition.In general, sketch has two widely-used representationsfor processing, which are raster pixel sketch and vectorsketch. Raster pixel sketches are binary images with pix-els covered by strokes having the value one and the rest ofpixels the value zero, resulting in a large portion of voidpixels and thus a sparse representation. This representationdoes not allow the state-of-the-art convolutional neural net-works (CNNs) to easily distinguish which strokes are moreimportant or which strokes can be ignored for better recog-nition [31]. Following the deﬁnition in [42], a vector sketchin our work refers to a sequence of strokes containing thepoints in the drawing order (Fig. 1). A vector sketch canbe easily converted into a bitmap image through rasteriza-tion but not vice versa. Notably, vector sketches containrich temporal ordering and grouping (i.e., strokes) informa-tion, which has been shown to be useful for learning moredescriptive features [42]. However, these information cuesare all discarded during the rasterization process for pixelimages and thus inaccessible by subsequent recognition al- a r X i v : . [ c s . C V ] N ov orithms.Motivated by the above discussions, to address theincapacity of existing CNN-based methods for strokeimportance interpretation, we propose a novel single-branch attentive network architecture RNN-Rasterization-CNN ( Sketch-R2CNN for short), for vector sketch recog-nition. Sketch-R2CNN takes advantages of both vector andraster representations of sketches during the learning pro-cess and is able to focus on adaptively learned importantstrokes, with an attention mechanism, for better recogni-tion (Fig. 1). It takes only a vector sketch (i.e., grouped se-quences of points) as input, and employs a recurrent neuralnetwork (RNN) in the ﬁrst stage for analyzing the tempo-ral ordering and grouping information in the input and pro-ducing attention estimations for the stroke points. We thendevelop a novel neural line rasterization (NLR) module, ca-pable of converting the vector sketch with the computed at-tentions into an attention map in a differentiable way. Sub-sequently, Sketch-R2CNN uses a CNN to consume the ob-tained attention map for guided hierarchical understandingand feature extraction on critical strokes to identify the tar-get object category. Our proposed NLR module is the keyto connecting the vector sketch space and the raster sketchspace in neural networks and allows gradient informationto back propagate from CNN to RNN for end-to-end learn-ing. Experiments on existing large-scale sketch recognitionbenchmarks [3, 8] show that our method, leveraging morehuman factors in the input, performs better than the state-of-the-art methods, and our RNN-Rasterization-CNN designconsistently improves the performance of CNN-only meth-ods.In summary, our contributions in this work are: (1)the ﬁrst single-branch attentive network with an RNN-Rasterization-CNN design for vector sketch recognition; (2)a novel differentiable neural line rasterization module thatuniﬁes the vector sketch space and raster sketch space inneural networks, allowing end-to-end learning. We willmake our code publicly available.

2. Related Work

To recognize sketched objects, traditional methods gen-erally take preprocessed raster sketches as input. To quan-tify a sketch image, existing studies have tried to adapt sev-eral types of local features originally intended for photos(e.g., bag-of-features [3], Fisher Vectors with SIFT fea-tures [31], HOG features [19]) to line drawing images.With the extracted features, classiﬁers (e.g., SVMs) arethen trained to recognize unseen sketches [3, 31]. Differ-ent learning schemes, such as multiple kernel learning [19]or active learning [43], may be employed for performanceimprovement. Another line of traditional methods has alsoattempted to utilize additional cues for recognition, such asprior knowledge for domain-speciﬁc sketches [1, 15, 27, 23, 32, 2] or object context for sketched scenes [47, 48]. Whileprogress has been made in sketch recognition, these meth-ods still cannot robustly handle freehand sketches with largeshape or style variations, especially those hastily drawn indozens of seconds [8], struggling to achieve performance onpar with human on existing benchmarks like the TU-Berlinbenchmark [3].Recently, deep learning has revolutionized many re-search ﬁelds, including sketch recognition, with state-of-the-art performance. Research efforts [30, 46, 39, 44]have been made to employ deep neural networks, such asAlexNet [14] or GoogLeNet [36], to learn more discrimina-tive image features in the sketch domain to replace hand-engineered ones. Yu et al. [44] proposed

Sketch-a-Net ,an AlexNet-like architecture speciﬁcally adapted for sketchimages by using large kernels in convolutions to accommo-date the sparsity of stroke pixels. Their method achievedsuperior classiﬁcation accuracy (77.95%) on the TU-Berlinbenchmark [3], surpassing human performance (73.1%) forthe ﬁrst time. Their method still follows the existing learn-ing process of image classiﬁcation, i.e., using the raster im-age representation of sketches as CNN inputs, and thus can-not easily learn the awareness of stroke importance in anend-to-end manner for further improvement. In contrast,our network directly consumes vector sketches as input forlearning stroke importance effectively and adaptively by ex-ploiting the temporal ordering and grouping informationtherein with RNNs.Vector representation of sketches has been consideredfor certain tasks such as sketch generation [7, 8, 33] orsketch hashing [42] with deep learning. For example,

SketchRNN [8], which has received much attention re-cently, is built upon RNNs to process vector sketches. Itis composed of an RNN encoder followed by an RNNdecoder, and is able to model the underlying distributionof points in vector sketches for a speciﬁc object category.To learn to hash sketches for retrieval, Xu et al. [42] hasdemonstrated that an RNN branch, exploiting temporal or-dering in vector sketches, can complement the other CNNbranch for extracting more descriptive features. They fusetwo types of features, produced by RNN and CNN respec-tively, via a late-fusion layer by concatenation. Our workshares a similar spirit with [42], advocating that the tempo-ral and grouping information in vector sketches also offeradditional cues for more accurate sketch recognition. Incontrast to their two-branch network with simple concate-nation, our RNN-Rasterization-CNN design seeks to boostthe synergy between the two networks in a single branchduring the learning process. To this end, inspired by [12],which proposed an approximate gradient for in-networkmesh rendering and rasterization, we design a novel neu-ral line rasterization module, allowing gradients to back-propagate from CNN (raster sketch space) to RNN (vector p p n p p p n x , y , 0x , y , 0 … x n , y n , 1 a a a n … point sequence per-point attention Neural Line

Raster … p p p n a a a n …… vector sketch RNN classificationcat … attention map CNN

Figure 1. Illustration of our single-branch attentive network architecture for vector sketch recognition. (Neural Line Raster stands for ourneural line rasterization (NLR) module.) sketch space) for end-to-end learning.For a sketch, its constituent strokes may contribute dif-ferently to its recognition. With a trained SVM, Schnei-der et al. [31] qualitatively analyzed how stroke importanceaffects classiﬁcation scores by iteratively removing eachstroke from the corresponding raster sketch image. To auto-matically capture stroke importance during the learning pro-cess, researchers have attempted to adapt an attention mech-anism in network design [34]. Attention mechanism hasbeen widely used in many visual tasks, such as image clas-siﬁcation [24, 40, 37, 10], image caption [41, 22] or VisualQuestion Answering (VQA) [25]. A simple attention mod-ule generally works by computing soft masks over the spa-tial image grid [37, 41], or even feature channels [10], to ob-tain weighted combination of features. Song et al. [34] hasincorporated a spatial attention module for raster sketches intheir network for ﬁne-grained sketch-based image retrieval.Differently, Riaz Muhammad et al. [28] tackled the sketchabstraction task with reinforcement learning, which aims todevelop a stroke removal policy by considering the strokeinﬂuence to recognizability. As discussed in existing stud-ies [44, 42, 6, 5], CNNs may suffer from the sparsity ofinputs (e.g., raster sketches), though they excel at buildinghierarchical representations of 2D inputs. Instead of strug-gling to estimate attention from binary images that con-tain limited information [34], we argue that additional cues,such as the temporal ordering and grouping information invector sketches, are essential to learn reliable attention forstrokes. In our method, we resort to RNNs for comput-ing attention for each point in a vector sketch, and use ourNLR module for in-network vector-to-raster conversion. Toour best knowledge, no existing work has tried to derive anattention map from vector sketches with RNNs for CNN-based sketch recognition.

3. Method

Our network architecture, as illustrated in Fig. 1, is com-posed of two cascaded sub-networks: an RNN for stroke at-tention estimation in the vector sketch space and a CNN for2D feature extraction in the raster sketch space (Sec. 3.2).The key enabler for linking the two sub-networks that op-erate in completely different spaces is a novel neural line rasterization (NLR) module, which converts a vector sketchwith the estimated attention to a raster pixel sketch in a dif-ferentiable way (Sec. 3.3). More speciﬁcally, during theforward inference pass, given a vector sketch as input, theRNN takes in a point at each time step and computes acorresponding attention value for the point. Our proposedNLR module then rasterizes the vector sketch, together withthe estimated per-point attention, into an attention map andcomputes the corresponding gradients for the backward op-timization pass. A subsequent CNN consumes the attentionmap as input for hierarchical understanding and producescategory predictions as the ﬁnal output.

The input to our network is a vector sketch, formed bya sequence of strokes, each stroke being represented by asequence of points. This storing format is widely adoptedfor sketches in existing crowdsourced datasets [8, 30, 3].Following [7], we denote a vector sketch as an orderedpoint sequence S = { p i = ( x i , y i , s i ) } i =1 ··· n , where n isthe total number of points in all strokes. For each point p i , x i and y i are the 2D coordinates, and s i is a binary strokestate. Speciﬁcally, state s i = 0 indicates that the currentstroke has not ended and that the stroke connects p i to p i +1 ; s i = 1 indicates that p i is the last point of the current strokeand p i +1 will be the starting point of another stroke. Ournetwork takes only the vector sketch S as input for end-to-end learning. Our network architecture is formed by two sequentially-arranged sub-networks, which are linked with a differen-tiable NLR module. The ﬁrst sub-network is an RNN,which analyzes the temporal ordering and grouping infor-mation in the input. The RNN consumes a vector sketch S and estimates per-point attention as output at each itera-tion step. Speciﬁcally, we use a bidirectional Long Short-Term Memory (LSTM) unit with two layers as the ﬁrst sub-network. We set the size of the hidden state to be 512 andadopt dropout with probability = 0.5. For the hidden stateat step i , after the LSTM cell takes in p i , we pass it througha fully-connected layer followed by a sigmoid function toroduce per-point attention, denoted as a i . That is, for eachpoint p i , we obtain a corresponding scalar a i , signifying thepoint importance in the subsequent 2D visual understandingby CNN. Similar to [8], instead of using absolute coordi-nates, for each p i fed into the RNN, we compute the offsetsfrom its previous point p i − as its coordinates.Next, we pass the point sequence along with the esti-mated attention, i.e., ( p i , a i ) i =1 ··· n , through our NLR mod-ule, as detailed in Sec. 3.3. The output of the module is araster sketch image I , which can also be viewed as an at-tention map with the intensity of each stroke pixel as thecorresponding attention. A deep CNN then takes the image I as input for hierarchical 2D feature extraction. Sketch-a-Net [44] or ResNet50 [9] can be used as the backbone net-work, which is then connected to a fully-connected layer toproduce estimations over all the possible object categories.We use the cross entropy loss for optimizing the whole net-work.Our network architecture for sketch recognition differsfrom the one proposed by Xu et al. [42] for sketch retrievalin several aspects. First, their network has two branches forfeature extraction, one branch with a RNN and the otherbranch with a CNN. During learning, their RNN and CNNindividually work on two different sketch spaces with littleinteraction, except at the last concatenation layer for featurefusion. In contrast, our single-branch design allows moreinformation ﬂow between RNN and CNN owing to ourNLR module, that is, the RNN can complement the CNNby producing a more informative input whereas the CNNprovides guidance on attention estimation with learned hi-erarchical representations during back propagation. In ad-dition, our network only uses vector sketches as input andperforms in-network vector-to-raster conversion, while thetwo-branch late-fusion network [42] requires both vectorand raster sketches as input, thus a preprocessing stage forrasterization is needed. To convert a point sequence with attention ( p i , a i ) i =1 ··· n to a pixel image I , the basic operation is to draw each validline segment p i p i +1 (Sec. 3.1) onto the canvas image. Asillustrated in Fig. 2, to determine whether or not a pixel I k is on the target line segment, we simply compute the dis-tance from its center to the line segment p i p i +1 and checkwhether it is smaller than a predeﬁned threshold (cid:15) (we set (cid:15) = 1 in our experiments). If I k is a stroke pixel, we com-pute its attention by linear interpolation [12]; otherwise itsattention is set to zero. More speciﬁcally, let p k be the pro-jection point of I k ’s center onto p i p i +1 . The intensity orattention of I k is then deﬁned as I k = (1 − α k ) · a i + α k · a i +1 , (1) p i p i+1 I k p k Figure 2. Rasterization of line segment p i p i +1 and linear interpo-lation of the attention value for stroke pixel I k . where α k = (cid:107) p k − p i (cid:107) / (cid:107) p i +1 − p i (cid:107) , and p k , p i and p i +1 are in absolute coordinates. This rasterization processfor line segments can be efﬁciently done in parallel on GPUwith a CUDA kernel. Note that in the implementation weneed to record the relevant information, such as line seg-ment index and α k at each pixel I k , for subsequent gradientcomputation.Through the above process, a vector sketch can be eas-ily converted into a raster image in the forward inferencepass. In order to propagate gradients w.r.t the loss functionfrom CNN to RNN in the backward optimization pass, weneed to derive gradients for the above rasterization process.Thanks to the simplicity of the used linear interpolation, thegradients can be computed as follows: ∂ I k ∂a i = 1 − α k , ∂ I k ∂a i +1 = α k . (2)Let L be the loss function and δ I k be the gradient back-propagated into I k w.r.t L through the CNN. By the chainrule, we have ∂L∂a i = (cid:88) k δ I k · (1 − α k ) , ∂L∂a i +1 = (cid:88) k δ I k · α k , (3)where k iterates over all the stroke pixels covered by the linesegment p i p i +1 . If p i is adjacent to another line segment p i − p i , we accumulate the gradients.Our NLR module is simple and easy to implement, but itis crucial to bridge the gap between the vector sketch spaceand the raster sketch space in neural networks for end-to-end learning. Unlike existing methods [37, 34] that de-rive attention from feature maps produced by CNNs, withour NLR module, we can take advantage of additional cues(i.e., temporal ordering and grouping information) in vec-tor sketches for better attention map estimation, as shownin experiments (Sec. 4.2). These additional cues, however,are not accessible for the methods with raster inputs. . Experiments We have performed various experiments on two exist-ing large-scale sketch recognition benchmarks, i.e., the TU-Berlin benchmark [3] and the QuickDraw benchmark [8],to validate the performance of our Sketch-R2CNN. Thesetwo benchmarks differ in several aspects, such as sketchingstyle, acquisition procedure, and sketch quantity per cate-gory. Notably, sketches in the TU-Berlin benchmark tendto be more realistic while the ones in QuickDraw are moreiconic and abstract (Fig. 4). The TU-Berlin benchmark [3]contains 250 object categories with 80 sketches per cate-gory. Each sketch was created within 30 minutes by a par-ticipant from Amazon Mechanical Turk (AMT). The Quick-Draw benchmark [8] contains 345 object categories with75K sketches per category. During acquisition, the partici-pants were given only 20 seconds to sketch an object.Similar to [8], to simplify sketches in the TU-Berlinbenchmark, we applied the Ramer-Douglas-Peucker (RDP)algorithm, resulting a maximum point sequence length of448 for RNN. Following [44], we used three-fold cross val-idation on this benchmark (i.e., two folds for training, onefold for testing). Sketches in the QuickDraw benchmarkhave already been preprocessed with the RDP simpliﬁcationalgorithm and the maximum number of points in a sketch is321. In each QuickDraw category, the 75K sketches havealready been divided into training, validation and testingsets with sizes of 70K, 2.5K and 2.5K, respectively.We implemented our Sketch-R2CNN and NLR modulewith PyTorch. We adopted Adam [13] for stochastic gra-dient descent update with a mini-batch size of 48. Weused a learning rate of 0.0001 for training on QuickDrawand 0.00005 for training or ﬁne-tuning on TU-Berlin (seeSec. 4.2 for the pre-training and training procedures). Dueto the limited training data in the TU-Berlin benchmark, wefollowed [44] to perform data augmentation, including hor-izontal reﬂection, stroke removal and sketch deformation.

Results on TU-Berlin Benchmark.

We have comparedour method with a variety of existing methods on the TU-Berlin benchmark. Table 1 includes the results of somemethods reported in [44]. These methods can be gen-erally categorized into two groups. The ﬁrst group fol-lows the conventional pipeline using hand-crafted features+ classiﬁer, including the HOG-SVM method [3], struc-tured ensemble matching [20], multi-kernel SVM [19], andthe Fisher Vector based method [31]. The second groupuses deep learning, including the state-of-the-art networkSketch-a-Net (the earlier version Sketch-a-Net v1 [45] andthe later improved version Sketch-a-Net v2 [44]) and thosenetworks that have been evaluated in [44]: LeNet [16], AlexNet-SVM [14] and AlexNet-Sketch [14].Model AccuracyHumans [3] 73.1%HOG-SVM [3] 56.0%Ensemble [20] 61.5%MKL-SVM [19] 65.8%Fisher-Vectors [31] 68.9%LeNet [16] 55.2%AlexNet-SVM [14] 67.1%AlexNet-Sketch [14] 68.6%Sketch-a-Net v1 [45] 74.9%Sketch-a-Net v2 [44] 77.95%Sketch-a-Net v2 (ours) [44] 77.54%ResNet50 [9] 82.08%

Sketch-R2CNN (Sketch-a-Net v2) 78.49%Sketch-R2CNN (ResNet50) 83.25%

Table 1. Evaluations on the TU-Berlin benchmark. Our methodwith ResNet50 working as the CNN backbone achieves the highestrecognition accuracy. Sketch-a-Net v2 (our) is our PyTorch-basedimplementation.

We reimplemented Sketch-a-Net v2 with PyTorch sincethe original model [44], implemented with Caffe, is notcompatible with our framework (i.e., the NLR module). Wepre-trained the Sketch-a-Net v2 on QuickDraw [8] insteadof preprocessed edge maps from photos [44] for ease ofpreparation and reproduction. Our best reproduced recogni-tion accuracy of Sketch-a-Net v2 on the TU-Berlin bench-mark is . , close to the accuracy of . reportedwith the original Caffe-based implementation [44]. In ad-dition to Sketch-a-Net v2, we also evaluated ResNet50 [9],a more advanced CNN architecture that has been widelyused for various visual tasks such as image classiﬁca-tion [9] or object detection [21]. Speciﬁcally, before train-ing on raster sketches of the TU-Berlin benchmark, we se-quentially pre-trained the ResNet50 on ImageNet [29] andQuickQraw. The ResNet50 achieves a recognition accuracyof . , signiﬁcantly outperforming the state-of-art ap-proach Sketch-a-Net v2.Since both Sketch-a-Net v2 and ResNet50 are CNN vari-ants, they can be incorporated into our network architec-ture (Fig. 1) as the CNN backbone. By inserting one ofthese CNN alternatives into the proposed architecture, wecan study how helpful the attention learned by RNN canbe for vector sketch recognition. The comparison resultsare summarized in Table 1. Our method incorporated withSketch-a-Net v2, named Sketch-R2CNN (Sketch-a-Net-v2)in Table 1, achieves a recognition accuracy of . , im-proving Sketch-a-Net v2 (ours) by about . Another vari-ant of our method with ResNet50, named Sketch-R2CNN u r s e v i t ne tt A - t e N s e R cabinet helmet spider present 01 Figure 3. Visualization of attention maps, in grayscale andcolor coded, produced by our Sketch-R2CNN (ResNet50) andAttentive-ResNet50. Recognition failures are in red and successesare in green. Attention maps of Attentive-ResNet50 are estimatedfrom feature maps of the last layer of the C residual block, whichare of size × , while attention maps by our method are of size × . (Best viewed in the electronic version.) (ResNet50) in Table 1, achieves an accuracy of . , im-proving the ResNet50-only model by about . , and sur-passes all the existing approaches and human performance. Alternatives Study on TU-Berlin Benchmark.

To val-idate our proposed architecture, we have studied severalnetwork design alternatives on the TU-Berlin benchmark(Table 2). First, as mentioned in Sec. 2, attention mod-ules have been used in existing CNN architectures for im-age classiﬁcation [37] and sketch retrieval [34]. To com-pare against our RNN-based attention module, we modi-ﬁed ResNet50 and inserted the spatial attention module pro-posed by Song et al. [34] after the C residual block [9, 21].This modiﬁed version of ResNet50 still takes binary sketchimages as input and tries to compute attention maps fromfeature maps of previous convolutional layers. This model,named Attentive-ResNet50 in Table 2, achieves a recogni-tion accuracy of . , slightly higher than . by theResNet50-only model, while lower than . attained byour method, showing the comparatively higher effectivenessof additional cues in vector sketches used by our methodfor attention estimation. Attention maps produced by ourRNN-based attention module and Attentive-ResNet50 arevisualized in Fig. 3. Note that our method only predicts at-tention for stroke pixels and sets non-stroke pixels to havean attention value of zero, while Attentive-ResNet50 com-putes attention for every pixel of the attention map.To study the inﬂuence of temporal ordering informa-tion provided by human on RNN’s attention estimation,we trained Sketch-R2CNN (ResNet50) with randomizedstroke orders. That is, instead of keeping the human draw-ing order in vector sketch, the stroke sequence is ran-domly disrupted. This scheme, named Random-Stroke-Order, achieves a slightly lower recognition accuracy of . than Sketch-R2CNN (ResNet50) on the TU-Berlinbenchmark, still superior to the ResNet50-only model. This indicates that the temporal information (i.e., stroke order)provided by human can help RNN to learn more descriptivesequential features, conﬁrming a similar conclusion madefrom sketch retrieval experiments in [42].Model AccuracyAttentive-ResNet50 [34] 82.42%Random-Stroke-Order 82.78%Attention-using-Sketching-Order 81.74%Two-Branch-Late-Fusion [42] 81.43%Two-Branch-Early-Fusion 81.84% Sketch-R2CNN (ResNet50) 83.25%

Table 2. Alternative design choice studies on the TU-Berlin bench-mark.

In addition to our RNN-based encoding method for vec-tor sketches, we also explored a straightforward approachto allow CNNs to gain access to the sketching order in-formation for feature extraction. Speciﬁcally, in a prepro-cessing step, for a sketch in the point sequence represen-tation, we encode its ordering information into an imagethrough rasterization by assigning an intensity value of oneto the ﬁrst point and zero to the last point and linearly in-terpolating the intensities of the points in-between. Fig. 5shows some examples of the resulting images. This en-coding scheme is based on a hypothesis that users tendto draw more “important” strokes ﬁrst, and the resultingraster sketches can be considered as temporal-encoding at-tention maps. We trained a ResNet50 with such hand-crafted attention maps as input, but found that this encodingscheme, with a recognition accuracy of 81.74% (Attention-using-Sketching-Order in Table 2), is not effective and evenslightly worse than the baseline with binary image inputs(ResNet50 in Table 1). This indicates that, for CNN-basedrecognition networks, stroke importance may not always beproperly aligned with stroke order under such a straightfor-ward encoding scheme, due to different drawing styles usedby different users, and this encoding scheme may even posechallenges to CNNs for learning effective patterns. Thus,instead of “hard-coding” temporal information into images,a more adaptive and robust encoder (e.g., RNN) is neededto accommodate sequential variations in vector sketches.Next, we discuss arrangements of RNN and CNN in thenetwork architecture design. As mentioned before, Xu etal. [42] use a two-branch late-fusion architecture, whichfuses the features extracted from a CNN branch and a par-allel RNN branch, for sketch retrieval. In contrast, ourdesign combines an RNN encoder and a CNN feature ex-tractor sequentially in a single branch for sketch classiﬁ-cation. We therefore set up another experiment to investi-gate which of the above two types of architecture is a bet-ter scheme to incorporate the addition temporal orderingnd grouping information existing in vector sketches. Fol-lowing [42], we built a similar model, named Two-Branch-Late-Fusion in Table 2, which uses the same RNN cell andCNN backbone as Sketch-R2CNN (ResNet50) for fairnessand consistency. The training procedure is the same asSketch-R2CNN (ResNet50), with the softmax cross entropyloss [42]. The Two-Branch-Late-Fusion achieves a recog-nition accuracy of . on the TU-Berlin benchmark,which is about lower than Sketch-R2CNN (ResNet50).This result reveals that our proposed single-branch archi-tecture can make the CNN, which works as an abstractvisual concept extractor, and the RNN, which models hu-man sketching orders, complement each other better thanthe two-branch architecture. Surprisingly, another observa-tion is that the recognition accuracy of Two-Branch-Late-Fusion, adapted to the sketch classiﬁcation task from theoriginal sketch retrieval task, is even slightly inferior to thatof the single CNN branch (ResNet50 in Table 1). This isalso observed from results on the QuickDraw benchmark, aspresented in the following section. Due to the lack of imple-mentation details of [42], we postulate that the differencesof training strategies ([42]: multi-stage training for CNNand RNN; Ours: joint training of CNN and RNN), CNNbackbones ([42]: AlexNet; Ours: ResNet50) and datasets([42]: pruned QuickDraw dataset; Ours: original TU-Berlinand QuickDraw datasets) may affect the learning of the late-fusion layer and cause the performance degradation.Complement to the above experiments on attention es-timation with RNN as well as arrangements of RNN andCNN, we stretched the design choice exploration to study-ing an alternative way of injecting the learned attention fromRNN into CNN. In our proposed architecture, the CNNdirectly takes the attention maps produced by the RNNas input. An alternative architecture is to weigh featuremaps of a certain intermediate layer in CNN (which stilltakes binary sketch images as input) with the attention mapby RNN that leverages vector sketches as input. In ourimplementation, we inject the attention map produced byRNN, which is of size × with stroke width threshold (cid:15) = 0 . , into the output of the C residual block [9, 21] ofResNet50. Following the same training procedures as thosein Table 2, this alternative architecture, named Two-Branch-Early-Fusion, achieves a recognition accuracy of . onthe TU-Berlin benchmark, performing slightly better thanTwo-Branch-Late-Fusion. However the recognition accu-racy of Two-Branch-Early-Fusion is still slightly inferiorto that of the ResNet50-only model. This may be due tonon-stroke pixels in the attention map from RNN having anattention value of zero, which, during the injection, wouldmake convolution features at those corresponding locationsvanish, reducing the feature information learned by previ-ous convolutional layers from the input. Results on QuickDraw Benchmark.

We further Model AccuracySketch-a-Net v2 [44] 74.84%ResNet50 [9] 82.48 %Two-Branch-Late-Fusion [42] 82.11%

Sketch-R2CNN (Sketch-a-Net v2) 77.29%Sketch-R2CNN (ResNet50) 84.41%

Table 3. Evaluations on the QuickDraw benchmark. compared the proposed Sketch-R2CNN with Sketch-a-Net v2 [44], ResNet50-only model, and Two-Branch-Late-Fusion [42] on the QuickDraw benchmark. Note theResNet50 is pre-trained on ImageNet [29] and served as theCNN backbone in Sketch-R2CNN and Two-Branch-Late-Fusion. Quantitative results are summarized in Table 3, andthe performance of each competing method on the Quick-Draw benchmark agrees well with those on the TU-Berlinbenchmark. Compared to the competitors, Sketch-R2CNN(ResNet50) achieves the highest recognition accuracy onthe QuickDraw benchmark, echoing its performance on theTU-Berlin benchmark. It is a similar case for the ResNet50-only model, which still achieves better recognition perfor-mance than both Sketch-a-Net v2 and Two-Branch-Late-Fusion. Sketch-R2CNN (ResNet50) and Sketch-R2CNN(Sketch-a-Net v2) improve ResNet50 and Sketch-a-Net v2respectively by about . Although the sketch quality ofQuickDraw may not be as good as that of TU-Berlin, thanksto the voluminous data of QuickDraw (24.15M sketches fortraining, 862.5K sketches for validation or testing), we stillhave seen consistent performance improvement of Sketch-R2CNN over CNN-only models, showing the generality ofour proposed architecture. Qualitative Results.

Fig. 4 shows some qualitativerecognition comparisons between the CNN-only method(ResNet50) and our Sketch-R2CNN (ResNet50). Throughvisualization, it is observed that the attention maps pro-duced by the RNN in Sketch-R2CNN can help the CNNto focus on more effective stroke parts of the inputs and ig-nore the interference of irrelevant strokes (e.g., the circlearound the crab in Fig. 4) to make better classiﬁcations. Incontrast, the CNN-only model cannot access the additionalordering and grouping cues existing in vector sketches andthus tends to struggle with sketches that have similar shapesbut different category labels. Fig. 5 visualizes the attentionmaps by our method and the ones encoding sketching order(used in Attention-using-Sketching-Order in Table 2). It isobserved that our attention maps estimated by RNN sharea certain degree of similarity with the ones using sketchingorder, but the attention magnitudes by RNN are more adap-tively biased.

Limitation.

As shown in Fig. 6, in some cases, the RNNin Sketch-R2CNN may fail to produce correct attention alm tree windmill spider crabwindmill palm tree crab spiderQuickDraw t e N s e R O u r s lobster scorpion panda teddy bearscorpion lobster teddy bear pandaTU-Berlin Figure 4. Recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50 as the CNN backbone).Failures are in red and successes are in green. Attention maps produced by our RNN are shown in the second row and are color coded. Notethat our RNN only predicts attention for stroke pixels; non-stroke pixels are set to have an attention value of zero and are not color-coded. O u r s gn i h c t e k S O r de r TU-Berlin

QuickDrawchurch castle squirrelkangaroo

Figure 5. The ﬁrst row shows color-coded attention maps produced by our Sketch-R2CNN (ResNet50) for speciﬁc object categories.Correspondingly in the second row, we directly encode the sketching order as attention maps, higher attention values for strokes drawnearlier. Note that non-stroke pixels are set to have an attention value of zero and are not color-coded. O u r s t e N s e R banana pumpkin pig cowarmchairbackpack toastersuitcase Figure 6. Recognition failures of our Sketch-R2CNN (ResNet50). guidance for the subsequent CNN, leading to recognitionfailures (e.g., the pumpkin), possibly due to the inabilityin extracting effective sequential features from inputs withsimilar temporal ordering and grouping cues as other train-ing sketches in different categories. Some sketches that areseemingly with ambiguous categories (e.g., the toaster) mayalso pose challenges to our method. It is expected that hu-man would make similar mistakes on such cases. One possi- ble solution to address the ambiguity is to put the sketchedobjects in context (i.e., scenes), and integrate our methodwith the context-based recognition methods [47, 48].

5. Conclusion

In this work, we have proposed a novel single-branchattentive network architecture named Sketch-R2CNN forvector sketch recognition. Our RNN-Rasterization-CNNdesign consistently improves the recognition accuracy ofCNN-only models by 1-2% on two existing large-scalesketch recognition benchmarks. The key enabler for join-ing RNN and CNN together is a novel differentiable neuralline rasterization module that performs in-network vector-to-raster sketch conversion. Applying Sketch-R2CNN toother tasks like sketch retrieval or sketch synthesis that needdescriptive line-drawing features could be interesting to ex-plore in the future.

References [1] C. Alvarado and R. Davis. SketchREAD: A multi-domainsketch recognition engine. In

Proc. ACM UIST . ACM, 2004.[2] R. Arandjelovi´c and T. M. Sezgin. Sketch recognition by fu-sion of temporal and image-based features.

Pattern Recogn. ,44(6):1225–1234, 2011. 2[3] M. Eitz, J. Hays, and M. Alexa. How do humans sketchobjects?

ACM TOG , 31(4):44:1–44:10, July 2012. 1, 2, 3, 5[4] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, andM. Alexa. Sketch-based shape retrieval.

ACM TOG ,31(4):31:1–31:10, July 2012. 1[5] B. Graham, M. Engelcke, and L. van der Maaten. 3d se-mantic segmentation with submanifold sparse convolutionalnetworks. In

Proc. IEEE CVPR , 2018. 3[6] B. Graham and L. van der Maaten. Submanifold sparse con-volutional networks.

CoRR , abs/1706.01307, 2017. 3[7] A. Graves. Generating sequences with recurrent neural net-works.

CoRR , abs/1308.0850, 2013. 2, 3[8] D. Ha and D. Eck. A neural representation of sketch draw-ings. In

Proc. ICLR , 2018. 2, 3, 4, 5[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

Proc. IEEE CVPR , June 2016. 4,5, 6, 7[10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-works. In

Proc. IEEE CVPR , 2018. 3[11] Z. Huang, H. Fu, and R. W. H. Lau. Data-driven seg-mentation and labeling of freehand sketches.

ACM TOG ,33(6):175:1–175:10, Nov. 2014. 1[12] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer.In

Proc. IEEE CVPR , 2018. 2, 4[13] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization.

CoRR , abs/1412.6980, 2014. 5[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassiﬁcation with deep convolutional neural networks. In

NIPS , pages 1097–1105. 2012. 1, 2, 5[15] J. J. LaViola, Jr. and R. C. Zeleznik. MathPad2: A system forthe creation and exploration of mathematical sketches.

ACMTOG , 23(3):432–440, Aug. 2004. 2[16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller.

NeuralNetworks: Tricks of the Trade: Second Edition , pages 9–48.2012. 5[17] K. Li, K. Pang, J. Song, Y.-Z. Song, T. Xiang, T. M.Hospedales, and H. Zhang. Universal sketch perceptualgrouping. In

Proc. ECCV , 2018. 1[18] L. Li, H. Fu, and C.-L. Tai. Fast sketch segmentation andlabeling with deep learning.

CoRR , abs/1807.11847, 2018. 1[19] Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Free-handsketch recognition by multi-kernel feature learning.

CVIU ,137:1 – 11, 2015. 1, 2, 5[20] Y. Li, Y.-Z. Song, and S. Gong. Sketch recognition by en-semble matching of structured features. In

Proc. BMVC ,2013. 5[21] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, andS. Belongie. Feature pyramid networks for object detection.In

Proc. IEEE CVPR , July 2017. 5, 6, 7[22] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing whento look: Adaptive attention via a visual sentinel for imagecaptioning. In

Proc. IEEE CVPR , 2017. 3 [23] T. Lu, C.-L. Tai, F. Su, and S. Cai. A new recognition modelfor electronic architectural drawings.

CAD , 37(10):1053 –1069, 2005. 2[24] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recur-rent models of visual attention. In

NIPS , pages 2204–2212.2014. 3[25] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks formultimodal reasoning and matching. In

Proc. IEEE CVPR ,2017. 3[26] L. Olsen, F. F. Samavati, M. C. Sousa, and J. A. Jorge.Sketch-based modeling: A survey.

Comput. & Graph. ,33(1):85 – 103, 2009. 1[27] T. Y. Ouyang and R. Davis. ChemInk: A natural real-timerecognition system for chemical drawings. In

Proc. ACMIUI . ACM, 2011. 2[28] U. Riaz Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, andT. M. Hospedales. Learning deep sketch abstraction. In

Proc.IEEE CVPR , June 2018. 3[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet large scale visualrecognition challenge.

IJCV , 115(3):211–252, Dec 2015. 5,7[30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The SketchyDatabase: Learning to retrieve badly drawn bunnies.

ACMTOG , 35(4):119:1–119:12, July 2016. 1, 2, 3[31] R. G. Schneider and T. Tuytelaars. Sketch classiﬁcationand classiﬁcation-driven analysis using Fisher Vectors.

ACMTOG , 33(6):174:1–174:9, Nov. 2014. 1, 2, 3, 5[32] T. M. Sezgin and R. Davis. Sketch recognition in inter-spersed drawings using time-based graphical models.

Com-put. & Graph. , 32(5):500–510, 2008. 2[33] J. Song, K. Pang, Y.-Z. Song, T. Xiang, and T. M.Hospedales. Learning to sketch with shortcut cycle consis-tency. In

Proc. IEEE CVPR , June 2018. 2[34] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales.Deep spatial-semantic attention for ﬁne-grained sketch-based image retrieval. In

Proc. IEEE ICCV , 2017. 3, 4,6[35] Z. Sun, C. Wang, L. Zhang, and L. Zhang. Free hand-drawn sketch segmentation. In

Proc. ECCV , pages 626–639.Springer, 2012. 1[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In

Proc. IEEE CVPR , 2015.2[37] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang,X. Wang, and X. Tang. Residual attention network for imageclassiﬁcation. In

Proc. IEEE CVPR , 2017. 3, 4, 6[38] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrievalusing convolutional neural networks. In

Proc. IEEE CVPR ,2015. 1[39] X. Wang, X. Chen, and Z. Zha. SketchPointNet: A compactnetwork for robust sketch recognition. In

Proc. ICIP , pages2994–2998, 2018. 2[40] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.The application of two-level attention models in deep convo-utional neural network for ﬁne-grained image classiﬁcation.In

Proc. IEEE CVPR , 2015. 3[41] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhut-dinov, R. S. Zemel, and Y. Bengio. Show, attend andtell: Neural image caption generation with visual attention.

CoRR , abs/1502.03044, 2015. 3[42] P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z. Song, T. Xiang,T. M. Hospedales, Z. Ma, and J. Guo. SketchMate: Deephashing for million-scale human sketch retrieval. In

Proc.IEEE CVPR , June 2018. 1, 2, 3, 4, 6, 7[43] E. Yanık and T. M. Sezgin. Active learning for sketch recog-nition.

Comput. & Graph. , 52:93 – 105, 2015. 2[44] Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, and T. M.Hospedales. Sketch-a-Net: A deep neural network that beatshumans.

IJCV , 122(3):411–425, May 2017. 1, 2, 3, 4, 5, 7[45] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales.Sketch-a-net that beats humans. In

Proc. BMVC , pages 7.1–7.12, 2015. 5[46] H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao.SketchNet: Sketch classiﬁcation with web images. In

Proc.IEEE CVPR , 2016. 2[47] J. Zhang, Y. Chen, L. Li, H. Fu, and C.-L. Tai. Context-basedsketch classiﬁcation. In

Proc. Expressive , pages 3:1–3:10.ACM, 2018. 1, 2, 8[48] C. Zou, Q. Yu, R. Du, H. Mo, Y.-Z. Song, T. Xiang, C. Gao,B. Chen, and H. Zhang. SketchyScene: Richly-annotatedscene sketches. In