[PDF] Fast-AT: Fast Automatic Thumbnail Generation using Deep Neural Networks

Abstract

Fast-AT is an automatic thumbnail generation system based on deep neural networks. It is a fully-convolutional deep neural network, which learns specific filters for thumbnails of different sizes and aspect ratios. During inference, the appropriate filter is selected depending on the dimensions of the target thumbnail. Unlike most previous work, Fast-AT does not utilize saliency but addresses the problem directly. In addition, it eliminates the need to conduct region search on the saliency map. The model generalizes to thumbnails of different sizes including those with extreme aspect ratios and can generate thumbnails in real time. A data set of more than 70,000 thumbnail annotations was collected to train Fast-AT. We show competitive results in comparison to existing techniques.

Full PDF

FFast-AT: Fast Automatic Thumbnail Generation using Deep Neural Networks

Seyed A. Esmaeili Bharat Singh Larry S. DavisUniversity of Maryland, College Park { [email protected], [email protected], [email protected] } Abstract

Fast-AT is an automatic thumbnail generation systembased on deep neural networks. It is a fully-convolutionaldeep neural network, which learns speciﬁc ﬁlters for thumb-nails of different sizes and aspect ratios. During inference,the appropriate ﬁlter is selected depending on the dimen-sions of the target thumbnail. Unlike most previous work,Fast-AT does not utilize saliency but addresses the problemdirectly. In addition, it eliminates the need to conduct re-gion search on the saliency map. The model generalizes tothumbnails of different sizes including those with extremeaspect ratios and can generate thumbnails in real time. Adata set of more than 70,000 thumbnail annotations wascollected to train Fast-AT. We show competitive results incomparison to existing techniques.

1. Introduction

Thumbnails are used to facilitate browsing of a collec-tion of images, make economic use of display space, and re-duce the transmission time. A thumbnail image is a smallerversion of an original images that is meant to effectivelyportray the original image (Figure 1). Social media web-sites such as Facebook, Twitter, Pinterest, etc have contentfrom multiple user accounts which needs to be displayed ona ﬁxed resolution display. A normal web page on Facebookcontains hundreds of images, which are essentially thumb-nails of larger images. Therefore, it is important to ensurethat each thumbnail displays the most useful informationpresent in the original image. Since images displayed on aweb page vary signiﬁcantly in size and aspect ratio , anythumbnail generation algorithm must be able to generatethumbnails over a range of scales and aspect ratios.The standard operations used for creating thumbnailsare cropping and scaling. Since thumbnails are ubiquitousand the manual generation of thumbnails is time consum-ing, signiﬁcant research has been devoted for automaticthumbnail generation. Most methods [20, 3, 2] utilize asaliency map to identify regions in the image that couldserve as good crops to create thumbnails. This leads to a

Figure 1. Illustration of the thumbnail problem. The original im-age is shown on the left with thumbnails of different aspect ratioson the right. two step solution where saliency is ﬁrst computed and thenan optimization problem is solved to ﬁnd the optimum crop.Whereas a recent method addresses [9] the problem directly,it involves hand crafted features and uses SVMs to scorecandidate crops. Moreover, the implementation requires 60seconds to produce a single thumbnail.We propose Fast-AT, a deep learning based approach forthumbnail generation that addresses the problem directly, inan end-to-end learning framework. Our work involves thefollowing contributions: • Fast-AT is based on an object detection framework,which takes dimensions of the target thumbnail intoaccount for generating crops. Since it produces thumb-nails using a feed-forward network, it can process 9images per second on a GPU. • By vector quantizing aspect ratios, Fast-AT learns dif-ferent ﬁlters for different aspect ratios during training.During inference, the appropriate ﬁlter is selected de-pending on dimensions of the target thumbnail. •

2. Related Work

Since thumbnail creation involves reducing the imagesize, retargetting methods [17] such as seam carving and1 a r X i v : . [ c s . C V ] A p r on-homogeneous warping can be used. However thesemethods produce artifacts which are often pronounced sincemost thumbnails are signiﬁcantly smaller than the origi-nal images. Therefore, most thumbnail generation methodsuse a combination of cropping and scaling. Typically, au-tomatic thumbnail generators utilize a saliency map as anindicator of important regions in the image to be cropped[20, 3, 21, 2]. Region search is then performed to ﬁnd thesmallest region of the image that has a total saliency above acertain threshold. A brute force approach to region search iscomputationally expensive, therefore approximations havebeen investigated such as greedy search [20], restriction ofthe search space to a set of ﬁxed size rectangles [19], andbinarization of the saliency map [3]. Recently, an algorithmthat conducts region search in linear time has been reported[2].However, saliency can ignore the semantics of the sceneand does not take the target thumbnail size into account.Many methods address this shortcoming through a heuris-tic approach such as selecting a crop that contains all de-tected faces [20] or using an algorithm that depends on boththe saliency and image class. Sun et al. [21] improve thesaliency map by taking the thumbnail scale into account andpreserving object completeness. A scale and object awaresaliency map is computed by using a contrast sensitivityfunction [14] and an objectiveness measure [1]; then greedysearch is conducted to ﬁnd the optimum region, similar to[20]. However, the method does not impose aspect ratiorestrictions on the selected region and the ﬁnal thumbnailimages can contain objects that look signiﬁcantly deformed.In addition, whereas [2] introduced an algorithm which pro-duces regions with restricted aspect ratios, it mentions thatthe problem could be infeasible for a given overall saliencythreshold value. Some other approaches attempt to crop themost aesthetic part of the image [23, 15].Huang et al. were the ﬁrst to directly address the prob-lem [9]. A data set of images and their manually gener-ated thumbnails was collected. However, only one thumb-nail size of 160 ×

120 was considered. The solution alsoinvolved scoring a large set of candidate crops and then se-lecting the crop with the largest score. The implementation- albeit an unoptimized CPU code - required 60 seconds togenerate a thumbnail for a single image. In addition, thesolution [9] is based on hand crafted features and SVM,which generally have inferior performance compared to re-cent deep learning based methods.Deep convolutional neural networks have achieved im-pressive results on high level visual tasks such as imageclassiﬁcation [11, 18, 8], object detection [7, 16, 4], andsemantic segmentation[12]. These architectures have notonly led to signiﬁcantly better results but also systems thatcan be deployed in real time [16]. We present a solution,based on a fully convolutional deep neural network that is learned end-to-end. We take into account varying thumb-nail sizes from 32 ×

32 to 200 ×

200 pixels. At test time thenetwork can produce thumbnails at 105ms per image andshows signiﬁcant improvements over the existing baselines.

3. Data Collection

We started the annotation of images from the photo qual-ity data set of [13] using Amazon Mechanical Turk (AMT).The set includes both high and low quality images andspans a number of categories such as humans, animals, andlandscape. Target thumbnail sizes were divided into threegroups - thumbnails between 32 to 64, 64 to 128, and 100 to200, in both height and width. This leads to an aspect ratiorange from 0.5 to 2. Each image was annotated three times,with different target thumbnail sizes from each group.The annotation was done through an interface that drawsa bounding box on the original image with an aspect ratioequal to that of the thumbnail; users can only scale the boxup or down and change its location. This bounding box rep-resents the selected crop. It is scaled down to the thumbnailsize and shown to the user at the same time. Restrictingthe bounding box (crop) to have an aspect ratio equal to thethumbnail’s aspect ratio leads to more ﬂexible annotationand avoids any possible deformation affects.To make the interface more practical, the images werescaled down such that the height does not exceed 650 andthe width does not exceed 800. The Mechanical Turk work-ers were shown examples of good and bad thumbnails. Theexamples were intended to illustrate that good thumbnailscapture a signiﬁcant amount of content while at the sametime are easy to recognize. After the data set was collected,the thumbnail images were manually swept through and badannotations were excluded; this led to a total of 70,048 an-notations over 28,064 images with each image having atmost 3 annotations.

4. Does target thumbnail size matter?

An automatic thumbnail generating system receives twoinputs: the image and the target thumbnail. Therefore, westudy the dependence between the target thumbnail and thegenerated crop. It is clear that the generated crop shouldhave an aspect ratio equal to the target thumbnail’s aspectratio. Selecting a crop of a different aspect ratio could causepronounced deformations when scaling the crop down to thethumbnail size as shown in Figure 2. It is worth noting thatalthough deformations can be caused when selecting cropswith aspect ratios different from that of the thumbnail, it hasbeen ignored in some work [20, 21].Intuitively, it is expected that smaller input thumbnailsizes would typically require smaller crops. Larger cropswould be less recognizable when they are scaled down. Toinvestigate this, we compare the average area of the anno-2 igure 2. The above thumbnails were generated using the codefrom [21] which is agnostic to the thumbnail aspect ratio. Theobjects look clearly deformed in the thumbnails, where variationin aspect ratio is signiﬁcant. tated crop vs the thumbnail size in the annotated dataset.However, we did not observe any correlation between thetwo as shown in Figure 3. Thus, we concluded that to pro-duce the optimal crop, the thumbnail size does not need tobe taken into account, but the aspect ratio matters for thisdataset. We still consider a model that takes both the as-pect ratio and the size of the thumbnail into account in ourexperiments.

5. Approach

Thumbnails are created by selecting a region in the im-age to be cropped (a bounding box) followed by scaling thebounding box down to the thumbnail size. We present a so-lution to this problem employing a deep convolutional neu-ral network that learns the best bounding boxes to producethumbnails. Since we formulate the problem as a boundingbox prediction problem, it is closely connected to objectdetection. However, unlike object detection, the ﬁnal pre-dictions will not consist of bounding boxes with a discreteprobability distribution across different classes, but involvetwo classes: one that is representative of the image and an-other which is not.Early deep learning methods for object detection utilizedproposal methods such as [22] that were time-consuming.Computation time was signiﬁcantly reduced using regionproposal networks (RPNs) [16] that learn to generate pro-posals. R-FCN [4], which was recently proposed, re-duces the computational expense of forward propagatingthe pooled proposal features through two fully connectedlayers by introducing a new convolutional layer consistingof class-speciﬁc position-sensitive ﬁlters. Speciﬁcally, ifthere are C classes to be detected, then this new convolu-tional layer will generate k ( C + 1) feature maps. The k position-sensitive score maps correspond to k × k evenlypartitioned cells. Those k feature maps are associated with Figure 3. The plot above shows the average crop area for a giventhumbnail size. The crop area is normalized by the maximumvalue. The average crop area is not generally smaller for smallerthumbnail sizes and it does not vary signiﬁcantly. different relative positions in the spatial gird, such as (top-left,...,bottom-left) for every class. k = 3 , corresponds to a × spatial grid and position-sensitive ﬁlters per class.Every class (including the background), will have k fea-ture maps associated with it. Instead of forward propagatingthrough two fully connected layers, position-sensitive pool-ing followed by score averaging is performed. This gen-erates a ( C + 1) -d vector on which a softmax function isapplied to obtain responses across categories.An architecture for thumbnail generation should be fullyconvolutional because, including a fully connected layer re-quires a ﬁxed input size. If there were a mismatch betweenthe aspect ratio of an image and the ﬁxed input size, the im-age would have to be cropped in addition to being scaled.Because the thumbnail crops (bounding boxes) could touchthe boundaries of the image or even extend to the whole im-age, the pre-processing step of cropping a region of the im-age could lead to sub-optimal predictions, since part of theimage has been removed. Therefore, an architecture similarto [18] that was used for the ImageNet localization chal-lenge [5], which simply replaces the class scores by 4-Dbounding predictions, cannot be employed because of thefully connected layer at the end.Another observation is that unlike object detection, thethumbnail generation network receives two inputs: the im-age and the thumbnail aspect ratio. Both RPN and R-FCNintroduce task-speciﬁc ﬁlters. In the case of RPN, ﬁlterbanks that specialize in predicting proposals of a speciﬁcscale are achieved by modifying the training policy. In thecase of R-FCN, position-sensitive ﬁlters specialize throughthe position-sensitive pooling mechanism. In a similar man-ner, we modify R-FCN for thumbnail creation by introduc-ing a set of aspect ratio-speciﬁc ﬁlter banks. A set of A points are introduced in the aspect ratio range of [0 . , ,which represent aspect ratios that grow by a constant fac-3or (a geometric sequence), i.e. it is of the form S = { c, c , . . . , c A } . Note that c = and c ( A +1) = 2 ,leading to c = A +1 √ . The ﬁlter banks in the last convolu-tional layer in R-FCN are modiﬁed into A pairs, with eachpair having a total of k ﬁlters. Each pair is associated witha single element in the set S . Similar to R-FCN, position-sensitive pooling followed by averaging is performed overthat pair and those two values are used to yield a softmaxprediction of representativeness.At training time, when an image-thumbnail size pair isreceived, the image is forward propagated through convolu-tional layers up to the last convolutional layer. The thumb-nail aspect ratio is calculated and the element with the clos-est value is selected from S - the pair associated with thatelement factors in the training while others are ignored. Forthis pair, the proposals are received, and similar to objectdetection, positive and negative labels are assigned to theproposals based on their intersection over union (IoU) withthe ground truth. Speciﬁcally, a positive label is assigned ifthe IoU ≥ . and negative otherwise. Similarly, A aspectratio-speciﬁc regressors are trained, one for each element in S ; these are similar to class-speciﬁc regressors. For a givenproposal, we employ the following loss: L ( s i , t i ) = A X i =1 l i L cls ( s i , s ∗ ) + λ [ s ∗ = 1 ∧ l i ] L reg ( t i , t ∗ ) , where l i is either ignore=0 or factor-in=1, namely l i =  if i = argmin i | c i − thumbnail aspect ratio | otherwise s i is the representativeness score predicted by the i th pair, s ∗ is the ground truth label, and L cls is cross entropy loss. λ is a weight for the regression loss which we set to 1. Theregression loss is 0 for all but the nearest aspect ratio. Forthe ﬁlter corresponding to the nearest aspect ratio, L reg isthe smooth L loss as deﬁned in [6], t i is the bounding boxpredictions made by the i th regressor and t ∗ is the groundtruth bounding box. Both predictions are parametrized as in[6]. Figure 4 illustrates the architecture.Since each regressor is responsible for a range of inputthumbnail sizes, the predictions made by any regressor attest time could have an aspect ratio that differs from the tar-get thumbnail aspect ratio. Therefore the output boundingbox has to be rectiﬁed to have an aspect ratio equal to thethumbnail’s aspect ratio, to eliminate any possible defor-mations when scaling down. We employ a simple methodwhere a new bounding box with an aspect ratio equal to thatof the target thumbnail is placed at the center of the pre-dicted box and is expanded until it touches the boundaries.Since the predicted box already has an aspect ratio close tothat of the thumbnail, the difference between the rectiﬁed Figure 4. Illustration of the Fast-AT architecture and training pol-icy. The appropriate ﬁlter is decided based on the thumbnail aspectratio. box and the predicted box is not signiﬁcant, as shown inFigure 5(b).Our implementation of Fast-AT is based on Resnet-101[8], a learning rate of 0.001, momentum of 0.9, weight de-cay of 0.0005 with approximate joint training is used [16].

Among the baselines we consider is R-FCN- without anymodiﬁcations. In effect, it is performing object detectionbetween two classes. We ﬁnd that R-FCN alone generatesbounding boxes that have good representations of the origi-nal image. But since the architecture is agnostic to the inputthumbnail dimensions, the generated thumbnails are of lowquality as shown in Figure 5(a). If we apply the same rec-tiﬁcation to the generated boxes, to cancel the deformationaffects, important parts of the images are not preserved, incontrast to our model’s results which are shown in Figure.5(b). This is because of the signiﬁcant mismatch betweenthe target thumbnail aspect ratio and the predicted box as-pect ratio. Eliminating the rectiﬁcation step would lead todeformed results, similar to what is shown in Figure 2.

6. Experiments

In comparing models we use the following metrics: • offset: the distance between the center of the groundtruth bounding box and the predicted bounding box. • rescaling factor (rescaling): deﬁned as max ( s g /s p , s p /s g ) where s g and s p are the rescal-ing factors for the ground truth and predicted box,respectively. [9].4 igure 5. R-FCN and Fast-AT predictions on test set images. (a):The original image is displayed with the R-FCN prediction in blue, therectiﬁed box is in red and the resulting thumbnail is below. Note how the resulting thumbnail is missing important parts of the originalimage.(b):The original image is displayed with the Fast-AT prediction in blue, the rectiﬁed box is in red and the resulting thumbnail isbelow. The rectiﬁcation does not introduce a signiﬁcant change in the box predicted by Fast-AT. Model offset rescaling IoU mismatchR-FCN 56.2 1.192 0.64 0.102Fast-AT (AR) 55.0 1.149 0.68 0.010Fast-AT (AR+TS) 55.4 1.154 0.68 0.012Fast-AT (AR, scale up of 350) 53.1 1.156 0.69 0.024

Table 1. Metrics computed using different models. R-FCN, Fast-AT with aspect ratio mapping (AR), Fast-AT with aspect ratio andthumbnail size mapping (AR+TS), and Fast-AT with aspect ratio mapping scaled up to 350. • IoU: intersection over union between the predicted boxand the ground truth. • aspect ratio mismatch (mismatch): the square of thedifference between the aspect ratio of the predicted boxand the aspect ratio of the thumbnail.The total dataset consisting of 70,048 annotations over28,064 images was split into 24,154 images for trainingwith 63043 annotations (90% of the total annotations) and3,910 images for testing with 7,005 annotations (10% of thetotal annotations). The training and test sets do not shareany images. Comparative results between different modelsis shown in Table 1. The ﬁrst model we use is R-FCN with-out modiﬁcation. This architecture is agnostic to thumb-nail dimension, the number of classes are reduced to twoand the architecture is modiﬁed accordingly. We see thatR-FCN alone has good performance in terms of all met-rics except for aspect ratio mismatch. High values in thesemetrics show that rectifying the bounding box will causesigniﬁcant change in the predicted box. We next considerour proposed model, where we map based on aspect ratio,with 5 divisions ( A = 5 ). We see signiﬁcant improvements in the metrics - the mean IoU has increased by 4% and theoffset and rescaling factor have been reduced. The aspectratio mismatch has also been signiﬁcantly reduced. We ex-tend the divisions to thumbnail sizes as well. In this case wedivide the input thumbnail space into three branches: smallthumbnails (32-64), medium (64-100), and large (100-200).Each branch is further divided based on aspect ratio as in theﬁrst model. This leads to a total of × regressors and15 pairs of k ﬁlters. This did not lead to an improvementover the model with only aspect ratio divisions.Unlike object detection bounding boxes, the predictedbounding boxes for thumbnails can enclose multiple objectsand may extend to the whole image. So while, a networkwith a small receptive ﬁeld may predict accurate bound-ing boxes for object detection, its predictions for thumb-nail crops may be inaccurate. For object detection, FasterRCNN [16] effectively reduces the receptive ﬁeld of the net-work by scaling up the image so that the smallest dimensionis 600. This step is implemented in R-FCN as well. We re-duce the image dimension (minimum of height/width) from600 to 350 in Fast-AT to investigate the affect of the recep-tive ﬁeld. We observe a slight improvement in the offset5nd IoU as shown in Table 1. The improvement is not largebecause we use Resnet-101 [8], which already has a largereceptive ﬁeld. Fast-AT’s architecture is generic and henceother models such as VGGNet [18] or ZFNet [24] can alsobe used. Such shallower models are likely to beneﬁt signif-icantly in thumbnail generation if their receptive ﬁelds areextended.We further compare Fast-AT with aspect ratio divisionsand Fast-AT with aspect ratio and thumbnail divisions overthumbnails with a small size; below × . We do not seeany signiﬁcant improvement (Table 2). This further con-ﬁrms our initial conclusion that it is the thumbnail aspectratio, not the thumbnail size, that matters. Model offset rescaling IoU mismatchFast-AT (AR) 55.9 1.149 0.67 0.011Fast-AT (AR+TS) 55.0 1.153 0.68 0.012

Table 2. Performance Comparison between Fast-AT with aspectratio mapping (AR) and Fast-AT with aspect ratio and thumbnailsize mapping (AR+TS) at small thumbnail sizes below × . We also measure the metrics at extreme aspect ratios,i.e. aspect ratios below 0.7 and above 1.8; the results areshown in Table 3. We observe a signiﬁcant drop in R-FCN’sperformance- IoU falls by about 6% and the mismatch al-most doubles. At the same time Fast-AT still performs well.This shows that Fast-AT can handle thumbnails of widelyvarying aspect ratios.

Model offset rescaling IoU mismatchR-FCN 57.5 1.348 0.58 0.200Fast-AT (AR) 49.8 1.18 0.68 0.013Fast-AT (AR+TS) 50.7 1.183 0.68 0.014

Table 3. Performance Comparison between R-FCN, Fast-AT withaspect ratio mapping (AR) and Fast-AT with aspect ratio andthumbnail mapping (AR+TS) at extreme values of aspect ratio(above 1.8 and below 0.7).

7. Evaluation

We compare our method to the other methods, by metricevaluations, by visual results, and through a user study. Wecompare against 4 methods: • Scale and Object-Aware Saliency (SOAT): In thismethod scale and object-aware saliency is computedand a greedy algorithm is used to conduct regionsearch over the generated saliency map [21]. • Efﬁcient Cropping: This method generates a saliencymap and conducts region search in linear time. Un-like previous methods, the search can be restricted toregions with a speciﬁc aspect ratio. However, when the aspect ratio mismatch between the image and thethumbnail is signiﬁcant, a solution may not exist. Insuch cases, we apply the method without aspect ratiorestriction. We run this method at a saliency thresholdof 0.7. [2]. • Aesthetic Cropping: This method attempts to generatea crop with an optimum aesthetic quality [23]. • Visual Representativeness and Foreground Recogniz-ability (VRFR): This method is very similar to ours,in objective. However, it can only generate thumbnailsfor a ﬁxed size of 160 ×

120 [9].Since the code was not released for the aesthetic methodand VRFR, our comparison is limited to a user study ontheir publicly available data set of 200 images with theirgenerated thumbnails [9].

For comparing different methods, we use the same met-rics that were used in the experiments section. In addition,we use the hit ratio h r and the background ratio b r [9],which are deﬁned as: h r = | g ∩ p || g | and b r = | p | − | g ∩ p || g | where g is the ground truth box and p is the predicted box.The metrics are computed over an annotated test set con-sisting of 7,005 annotations over 3,910 images and aver-aged. Table 4 shows the performance of different methods.Note that offset is higher than that reported in [9]. Unlikethe MIRFLICKR-25000 dataset [10] that was used in [9],the data set we use has images with larger variation in size,some with low quality, and it includes many images withmultiple objects. In addition, our thumbnails have an as-pect ratio that varies from 0.5 to 2. This makes the data setsigniﬁcantly more challenging and would explain the largeincrease in the offset values compared to the reported resultsin [9]. We ﬁnd that our method performs the best in termsof offset, rescaling factor, and IoU. We note that efﬁcientcropping has a non-zero aspect ratio mismatch, indicatingthat there were examples where the problem was infeasi-ble when the aspect ratio restriction is imposed. This is ex-pected given the wide variation of thumbnail aspect ratios inour data set. Unsurprisingly, SOAT, which is agnostic to thetarget aspect ratio, has the highest aspect ratio mismatch.The hit ratio represents the percentage of ground truththat was captured by the bounding box and the backgroundratio represents the percentage of bounding box area thatlies outside the ground truth. The optimal method shouldbe close to the ground truth and therefore should have alarge hit ratio and a small background ratio. We ﬁnd thatthe performance of different methods in terms of hit and6 igure 6. Images and their generated thumbnails: The original image is on the left, to its right we display the thumbnails: top is SOAT,middle is efﬁcient cropping, and bottom is Fast-AT. Method offset rescaling factor IoU aspect ratio mismatch h r b r SOAT [21] 80.5 1.378 0.52 0.204 68.7% 41.6%Efﬁcient Cropping [2] 88.3 1.329 0.52 0.176 64.4% 34.3%Fast-AT (aspect ratio) 55.0 1.148 0.68 0.010 83.7% 37.1%

Table 4. Metrics evaluated on different thumbnail generation methods. background ratios is similar to the results reported in [9].Namely, the saliency based methods focus on a relativelysmall region having large saliency. This leads to small cropswhich explain the low values for the hit and backgroundratios. Our method, in comparison is distinguished by alarge hit ratio and a low background ratio. This shows thatit closely matches the predicted ground truth boxes.

We show qualitative results in comparison to other base-lines in Figure 6. Saliency based methods succeed in pre-serving important content; in some examples however, theirﬁnal thumbnails can have pronounced deformations. Thiscan be seen in many examples for SOAT and in some ex-amples for efﬁcient cropping. In addition, these methodsignore the semantics of the scene and may ignore importantparts of the image. This can be seen in the third and fourthexamples for SOAT and in the ﬁrst and second examples forefﬁcient cropping. At the same time, it can be seen from theexamples that Fast-AT succeeds in each case. It preserversthe content of the scene and predicts thumbnails that tightlyenclose the most representative part of the image.

We performed a user study where users were shown theoriginal image and the generated thumbnails. They wereasked to select the best thumbnail among SOAT, EfﬁcientCropping and Fast-AT. A total of 372 images were ran-domly picked from the test set. 30 mechanical turk usersparticipated and no user was allowed to vote on more than30 images. We have included the results of this study in Ta-ble 5. Fast-AT clearly outperforms the other two methods.SOAT [21] Efﬁcient Cropping [2] Fast-AT88 (23.7%) 86 (23.1%) 198 (53.2%)

Table 5. Number of votes for each method.

We performed another study over the released 200 im-ages from [9] using the results of [21, 9, 23]. The re-sults are shown in Table 6. Although VRFR[9] takes 60seconds for an image and works for only one thumbnailsize (160x120), Fast-AT performs slightly better in the userstudy performed.SOAT [21] Aesthetic [23] VRFR [9] FastAT34(8.5%) 92(23%) 135(33.7%) 139(34.7%)

Table 6. Number of votes for each method. . Failure Cases and Multiple Predictions We also investigate failure cases of Fast-AT. We look forexamples in the test set where the prediction has an IoU withthe ground truth below 0.1. Figure 7(a) shows some exam-ples. The ground truth box is in green and the predictionis in blue. We see that although the prediction is very dif-ferent from the ground truth, in some cases it still predictscrops that capture representative regions in the original im-ages. Furthermore, for some of these failure cases we takethe prediction with the second or third highest conﬁdence.Figure 7(b) shows examples where the second or third pre-dictions are close to the ground truth. This suggests that ifthe system is to be deployed, users could beneﬁt if the sys-tem outputs a small set of top predictions instead of one.These predictions can be treated as a set of candidates fromwhich the user picks the best solution. We also see a sig-niﬁcant improvement in the performance of some metrics ifthe second prediction is also used. The third prediction didnot lead to a signiﬁcant improvement, as shown in Table 7.

Figure 7. Failure cases of Fast-AT:(a): the prediction could havelow IoU with the ground truth but still capture a representativeregion. (b): we show that the second or third most conﬁdent pre-diction is close to the ground truth.

Model offset rescaling IoU mismatchTop 1 55.0 1.149 0.677 0.010Top 2 50.4 1.152 0.693 0.011Top 3 50.3 1.152 0.693 0.011

Table 7. Performance of Fast-AT using the top 1, 2, and 3 predic-tions. The offset and IoU are signiﬁcantly improved by using thetop 2 predictions, the other metrics do not change signiﬁcantly.Using the third prediction does not lead to signiﬁcant improve-ment.

Another interesting case is when there is a signiﬁcant as-pect ratio mismatch between the representative part of theimage and the thumbnail’s aspect ratio. Because of the sig-niﬁcant aspect ratio mismatch, the crop cannot capture all of the representative part of the image. We show that ouralgorithm is capable of producing multiple crops that coverdifferent representative parts of the image. Figure 8 showssome examples. In the ﬁrst three images (from the left),the region of interest is spread horizontally, but the thumb-nail’s aspect ratio is very small (tall thumbnail). The reverseis true for the rightmost image. The ﬁrst row shows thebounding box prediction with the highest conﬁdence andthe second row shows the bounding box prediction with thesecond highest conﬁdence. We see that these predictionscover different representative parts of the image.

Figure 8. In the above images there is a signiﬁcant aspect ratio mis-match between the region of interest in the image and the thumb-nail’s aspect ratio. The prediction with the highest conﬁdence isshown in the ﬁrst row and the prediction with the second highestconﬁdence is shown in the second row. The predictions are suc-cessful in covering different representative regions in the image.

9. Conclusion

We presented a solution to the automatic thumbnail gen-eration problem that does not depend on saliency or heuris-tic considerations but rather attacks the problem directly. Alarge data set consisting of 70,048 annotations over 28,064images was collected. A CNN designed to generate thumb-nails in real time was trained using this set. Metric and qual-itative evaluations have shown superior performance overexisting methods. In addition, a user study has shown thatour method is preferred over other baselines.

10. Acknowledgement

We thank graduate students in our lab who annotated aninitial dataset which was collected to check the feasibilityof this approach. We are also grateful to the anonymousreviewers who provided valuable feedback for improvingthis paper.

References [1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the object-ness of image windows.

IEEE Transactions on Pattern Anal-ysis and Machine Intelligence , 34(11):2189–2202, 2012. 2[2] J. Chen, G. Bai, S. Liang, and Z. Li. Automatic image crop-ping: A computational complexity study. In

Proceedings f the IEEE Conference on Computer Vision and PatternRecognition , pages 507–515, 2016. 1, 2, 6, 7[3] G. Ciocca, C. Cusano, F. Gasparini, and R. Schettini. Self-adaptive image cropping for small displays. IEEE Transac-tions on Consumer Electronics , 53(4):1622–1627, 2007. 1,2[4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection viaregion-based fully convolutional networks. arXiv preprintarXiv:1605.06409 , 2016. 2, 3[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 248–255. IEEE, 2009. 3[6] R. Girshick. Fast r-cnn. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 1440–1448,2015. 4[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 580–587,2014. 2[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. arXiv preprint arXiv:1512.03385 ,2015. 2, 4, 6[9] J. Huang, H. Chen, B. Wang, and S. Lin. Automatic thumb-nail generation based on visual representativeness and fore-ground recognizability. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 253–261,2015. 1, 2, 4, 6, 7[10] M. J. Huiskes and M. S. Lew. The mir ﬂickr retrieval eval-uation. In

MIR ’08: Proceedings of the 2008 ACM Inter-national Conference on Multimedia Information Retrieval ,pages 39–43. ACM, 2008. 6[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012. 2[12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3431–3440, 2015. 2[13] W. Luo, X. Wang, and X. Tang. Content-based photo qualityassessment. In , pages 2206–2213. IEEE, 2011. 2[14] J. Mannos and D. Sakrison. The effects of a visual ﬁdelitycriterion of the encoding of images.

IEEE Transactions oninformation theory , 20(4):525–536, 1974. 2[15] M. Nishiyama, T. Okabe, Y. Sato, and I. Sato. Sensation-based photo cropping. In

Proceedings of the 17th ACM inter-national conference on Multimedia , pages 669–672. ACM,2009. 2[16] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In

Advances in neural information processing systems , pages91–99, 2015. 2, 3, 4, 5[17] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir. Acomparative study of image retargeting. In

ACM transactionson graphics (TOG) , volume 29, page 160. ACM, 2010. 1 [18] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 2, 3, 6[19] F. Stentiford. Attention based auto image cropping. In

Work-shop on Computational Attention and Applications, ICVS ,volume 1. Citeseer, 2007. 2[20] B. Suh, H. Ling, B. B. Bederson, and D. W. Jacobs. Auto-matic thumbnail cropping and its effectiveness. In

Proceed-ings of the 16th annual ACM symposium on User interfacesoftware and technology , pages 95–104. ACM, 2003. 1, 2[21] J. Sun and H. Ling. Scale and object aware image thumbnail-ing.

International journal of computer vision , 104(2):135–153, 2013. 2, 3, 6, 7[22] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.Smeulders. Selective search for object recognition.

Inter-national journal of computer vision , 104(2):154–171, 2013.3[23] J. Yan, S. Lin, S. Bing Kang, and X. Tang. Learning thechange for automatic image cropping. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 971–978, 2013. 2, 6, 7[24] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In

European Conference on Com-puter Vision , pages 818–833. Springer, 2014. 6, pages 818–833. Springer, 2014. 6