[PDF] PR Product: A Substitute for Inner Product in Neural Networks

Abstract

In this paper, we analyze the inner product of weight vector w and data vector x in neural networks from the perspective of vector orthogonal decomposition and prove that the direction gradient of w decreases with the angle between them close to 0 or {\pi}. We propose the Projection and Rejection Product (PR Product) to make the direction gradient of w independent of the angle and consistently larger than the one in standard inner product while keeping the forward propagation identical. As a reliable substitute for standard inner product, the PR Product can be applied into many existing deep learning modules, so we develop the PR Product version of fully connected layer, convolutional layer and LSTM layer. In static image classification, the experiments on CIFAR10 and CIFAR100 datasets demonstrate that the PR Product can robustly enhance the ability of various state-of-the-art classification networks. On the task of image captioning, even without any bells and whistles, our PR Product version of captioning model can compete or outperform the state-of-the-art models on MS COCO dataset. Code has been made available at:this https URL.

Full PDF

PPR Product: A Substitute for Inner Product in Neural Networks

Zhennan Wang † , Wenbin Zou † , Chen Xu ∗ Shenzhen University [email protected], { wzou, xuchen szu } @szu.edu.cn Abstract

In this paper, we analyze the inner product of weightvector w and data vector x in neural networks from theperspective of vector orthogonal decomposition and provethat the direction gradient of w decreases with the anglebetween them close to 0 or π . We propose the Projectionand Rejection Product (PR Product) to make the directiongradient of w independent of the angle and consistentlylarger than the one in standard inner product while keepingthe forward propagation identical. As a reliable substitutefor standard inner product, the PR Product can be appliedinto many existing deep learning modules, so we develop thePR Product version of fully connected layer, convolutionallayer and LSTM layer. In static image classiﬁcation, theexperiments on CIFAR10 and CIFAR100 datasets demon-strate that the PR Product can robustly enhance the abilityof various state-of-the-art classiﬁcation networks. Onthe task of image captioning, even without any bells andwhistles, our PR Product version of captioning model cancompete or outperform the state-of-the-art models on MSCOCO dataset. Code has been made available at: https://github.com/wzn0828/PR_Product .

1. Introduction

Models based on neural networks, especially deep con-volutional neural networks (CNN) and recurrent neuralnetworks (RNN), have achieved state-of-the-art results invarious computer vision tasks [11, 10, 1]. Most of theoptimization algorithms for these models rely on gradient-based learning, so it is necessary to analyze the gradientof inner product between weight vector w ∈ R d and datavector x ∈ R d , a basic operation in neural networks.Denoted by P ( w , x ) = w T x the inner product, the gradient † The authors are with College of Electronic and InformationEngineering, the Shenzhen Key Laboratory of Advanced MachineLearning and Applications, the Guangdong Key Laboratory of IntelligentInformation Processing, Shenzhen University. ∗ Corresponding author is with the College of Mathematics andStatistics, Shenzhen University. (a) (b) θ θ x x w wP x P x R x ||x|| E rx Figure 1. The orthogonal decomposition of the gradient w.r.t.weight vector w in two-dimensional space. (a) The case of thestandard inner product. (b) The case of our proposed PR Product.For the length gradient, both are the vector projection P x of x onto w . However, the direction gradient is changed from the vectorrejection R x in (a) to (cid:107) x (cid:107) E rx in (b), where E rx represents theunit vector along R x . of P ( w , x ) w.r.t. w is exactly the data vector x whichcan be orthogonally decomposed into the vector projection P x on w and the vector rejection R x from w , as shownin Figure 1 (a). The vector projection P x is parallel tothe weight vector w and will update the length of w innext training iteration, called the length gradient. While thevector rejection R x is orthogonal to w , it will change thedirection of w , called the direction gradient.Driven by the orthogonal decomposition of the gradientw.r.t. w , a question arises: Which is the key factorfor optimization, length gradient or direction gradient?To answer this question, we optimize three 5-layer fullyconnected neural networks on Fashion-MNIST [36] withdifferent variants of inner product: standard inner product,a variant without length gradient and a variant withoutdirection gradient. The top-1 accuracy is 88.42%, 88.32%and 38.59%, respectively. From these comparative exper-iments we can observe that the direction gradient is thekey factor for optimization and is far more critical than thelength gradient, which might be unsurprising. However,the direction gradient would be very small when w and x are nearly parallel, which would hamper the update of the a r X i v : . [ c s . C V ] A ug irection of weight vector w .On the other hand, in Euclidean space, the geometricdeﬁnition of inner product is the product of the Euclideanlengths of the two vectors and the cosine of the angle be-tween them. That is P ( w , x ) = w T x = (cid:107) w (cid:107) (cid:107) x (cid:107) cos θ ,where we denote by (cid:107) ∗ (cid:107) the Euclidean length of vector ∗ and by θ the angle between w and x with the range of [0 , π ) . From this formulation, we can see that the θ isstrongly connected with the direction of weight vector w .The gradient of P w.r.t. θ is ∂P/∂θ = −(cid:107) w (cid:107) (cid:107) x (cid:107) sin θ ,which becomes small with θ close to 0 or π and thushinders the optimization. Several recent investigations ofbackpropagation [4, 43] focus on modifying the gradientof activation function. However, few researches proposevariants of backpropagation for the inner product function.In this paper, we propose the Projection and RejectionProduct(abbreviated as PR Product) which changes thebackpropagation of standard inner product to eliminate thedependence of the direction gradient of w and the gradientw.r.t. θ on the value of θ . We ﬁrstly prove that the standardinner product of w and x only contains the information ofvector projection P x , which is the main cause of the abovedependence. While our proposed PR Product involves theinformation of both the vector projection P x and the vectorrejection R x through rewriting the standard inner productinto a different form and suitable components of that formare held ﬁxed during backward pass. We further analyzethe gradients of PR Product w.r.t. θ and w . For θ , theabsolute value of gradient changes from (cid:107) w (cid:107) (cid:107) x (cid:107) | sin θ | to (cid:107) w (cid:107) (cid:107) x (cid:107) . For w , the length of direction gradientchanges from (cid:107) x (cid:107) | sin θ | to (cid:107) x (cid:107) , as shown in Figure 1.There are several advantages of using PR Product:(a)ThePR Product gets a different backward pass while the for-ward pass remains exactly the same as the standard innerproduct; (b) Compared with the behavior of standard innerproduct, PR increases the proportion of the direction gradi-ent which is the key factor for optimization; (c) As the PRProduct maintains the linear property, it can be a reliablesubstitute for inner product operation in the fully connectedlayer, convolutional layer and recurrent layer. By reliable,we mean it does not introduce any additional parametersand matches with the original conﬁgurations such as activa-tion function, batch normalization and dropout operation.We showcase the effectiveness of PR Product on imageclassiﬁcation and image captioning tasks. For both tasks,we replace all the fully connected layers, convolutional lay-ers and recurrent layers of the backbone models with theirPR Product version. Experiments on image classiﬁcationdemonstrate that the PR Product can typically improve theaccuracy of the state-of-the-art classiﬁcation models. More-over, our analysis on image captioning conﬁrms that the PRProduct deﬁnitely change the dynamics of neural networks.Without any tricks of improving the performance, like scene graph and ensemble strategy, our PR Product version ofcaptioning model achieves results on par with the state-of-the-art models.In summary, the main contributions of this paper are: • We propose the PR Product, a reliable substitute forthe standard inner product of weight vector w anddata vector x in neural networks, which changes thebackpropagation while keeping the forward propaga-tion identical; • We develop the PR-FC, PR-CNN and PR-LSTM,which applies the PR Product into the fullyconnected layer, convolutional layer and LSTMlayer respectively; • Our experiments on image classiﬁcation and imagecaptioning suggest that the PR Product is generallyeffective and can become a basic operation of neuralnetworks.

2. Related Work

Variants of Backpropagation.

Several recent investiga-tions have considered variants of standard backpropagation.In particular, [22] presents a surprisingly simple back-propagation mechanism that assigns blame by multiplyingerrors signals with random weights, instead of the synapticweights on each neuron, and further downstream. [2]exhaustively considers many Hebbian learning algorithms.The straight-through estimator proposed in [4] heuristicallycopies the gradient with respect to the stochastic outputdirectly as an estimator of the gradient with respect tothe sigmoid argument. [43] proposes Linear Backpropthat backpropagates error terms only linearly. Differentfrom these methods, our proposed PR Product changes thegradient of inner product function during backpropagationwhile maintaining the identical forward propagation.

Image Classiﬁcation.

Deep convolutional neural net-works [18, 32, 11, 12, 45, 37, 13] have become the dominantmachine learning approaches for image classiﬁcation. Totrain very deep networks, shortcut connections have becomean essential part of modern networks. For example, High-way Networks [33, 34] present shortcut connections withgating functions, while variants of ResNet [11, 12, 45, 37]use identity shortcut connections. DenseNet [13], a morerecent network with several parallel shortcut connections,connects each layer to every other layer in a feed-forwardfashion.

Image Captioning.

In the early stage of vision to languageﬁeld, template-based methods [7, 20] generate the captiontemplates whose slots are ﬁlled in by the outputs of objectdetection, attribute classiﬁcation and scene recognition,which results in captions that sound unnatural. Recently,inspired by the advances in the NLP ﬁeld, models basedncoder-decoder architecture [17, 16, 35, 14] have achievedstriking advances. These approaches typically use a pre-trained CNN model as the image encoder, combined with anRNN decoder trained to predict the probability distributionover a set of possible words. To better incorporate theimage information into the language processing, visualattention for image captioning was ﬁrst introduced by [38]which allows the decoder to automatically focus on theimage subregions that are important for the current timestep. Because of remarkable improvement of performance,many extensions of visual attention mechanism [44, 5, 40,9, 27, 1] have been proposed to push the limits of thisframework for caption generation tasks. Except for thoseextensions to visual attention mechanism, several attempts[31, 26] have been made to adapt reinforcement learningto address the discrepancy between the training and thetesting objectives for image captioning. More recently,some methods [41, 15, 25, 39] exploit scene graph toincorporate visual relationship knowledge into captioningmodels for better descriptive abilities.

3. The Projection and Rejection Product

In this section, we begin by shortly revisiting the stan-dard inner product of weight vector w and data vector x .Then we formally propose the Projection and RejectionProduct (PR Product) which involves the information ofboth vector projection of x onto w and vector rejectionof x from w . Moreover, we analyze the gradient of PRProduct. Finally, we develop the PR Product version offully connected layer, convolutional layer and LSTM layer.In the following, for the simplicity of derivation, we onlyconsider a single data vector x ∈ R d and a single weightvector w ∈ R d except for the last subsection. In Euclidean space, the inner product P of the twoEuclidean vectors w and x is deﬁned by: P ( w , x ) = w T x = (cid:107) w (cid:107) (cid:107) x (cid:107) cos θ (1)where (cid:107) ∗ (cid:107) is the Euclidean length of vector ∗ , and θ isthe angle between w and x with the range of [0 , π ) . Fromthis formulation, we can observe that the angle θ explicitlyaffects the state of neural networks. The gradient of P w.r.t. θ . Neither the weight vector w nor the data vector x is the function of θ , so it is easy to get: ∂P∂θ = −(cid:107) w (cid:107) (cid:107) x (cid:107) sin θ (2) The gradient of P w.r.t. w . From Equation (1) and Figure1 (a), it is easy to obtain the gradient function of P w.r.t. w : ∂P∂ w = x = P x + R x (3) Here, R x is the direction gradient of w . From Figure 1(a) and Equation (2), we can see that either the value of thegradient of P w.r.t. θ or the length of R x is close to 0 with θ close to 0 or π , which would hamper the optimization ofneural networks.From Figure 1 (a),we can easily get the length of P x : (cid:107) P x (cid:107) = (cid:107) x (cid:107) | cos θ | (4)And the length of R x is: (cid:107) R x (cid:107) = (cid:107) x (cid:107) | sin θ | (5)So equation (1) can be reformulated as: P ( w , x ) = (cid:40) −(cid:107) w (cid:107) (cid:107) P x (cid:107) , if π/ ≤ θ < π/ (cid:107) w (cid:107) (cid:107) P x (cid:107) , otherwise . = sign (cos θ ) (cid:107) w (cid:107) (cid:107) P x (cid:107) (6)where sign(*) denotes the sign of *. We can observe thatthis formulation only contains the information of vectorprojection of x on w , P x . As shown in Figure 1, thevector projection P x changes very little when θ is near 0or π , which may be a block to the optimization of neuralnetworks. Although the length of the rejection vector R x issmall when θ is close to 0 or π , it varies greatly and thus isable to support the optimization of neural networks. That isour basic motivation for the proposed PR Product. In order to take advantage of the vector rejection, thesimplest way is to replace the (cid:107) P x (cid:107) in Equation (6) with (cid:107) P x (cid:107) + (cid:107) R x (cid:107) . But the trends of (cid:107) P x (cid:107) and (cid:107) R x (cid:107) with θ are inconsistent, so we employ (cid:107) x (cid:107) − (cid:107) R x (cid:107) to involvethe information of vector rejection. In addition, we utilizetwo coefﬁcients to maintain the linear property, which areheld ﬁxed during the backward pass. To be more detailed,we derive the PR Product as follows: P R ( w , x )= sign (cos θ ) (cid:107) w (cid:107) (cid:34) (cid:107) R x (cid:107) (cid:107) x (cid:107) (cid:107) P x (cid:107) + (cid:107) P x (cid:107) (cid:107) x (cid:107) ( (cid:107) x (cid:107) − (cid:107) R x (cid:107) ) (cid:35) = (cid:107) w (cid:107) (cid:104) | sin θ |(cid:107) P x (cid:107) sign (cos θ ) + cos θ ( (cid:107) x (cid:107) − (cid:107) R x (cid:107) ) (cid:105) = (cid:107) w (cid:107) (cid:107) x (cid:107) (cid:104) | sin θ | cos θ + cos θ (1 − | sin θ | ) (cid:105) (7) For clarity, we denote by

P R the proposed product func-tion. Note that the * denotes detaching * from neuralnetworks. By detaching, we mean * is considered as aconstant rather than a variable during backward propaga-tion. Compared with the standard inner product formulation(Equation (6) or (1)), this formulation involves not onlythe information of vector projection P x but also the one ofvector rejection R x without any additional parameters. Weall this formulation the Projection and Rejection Productor PR Product for brevity.Although the PR Product does not change the outcomeduring forward pass, compared with the standard innerproduct, it changes the gradients during backward pass. Inthe following, we theoretically derive the gradient of P R w.r.t. θ and w during backpropagation. The gradient of

P R w.r.t. θ . We just need to calculate thegradients of trigonometric functions except for the detachedones in Equation (7). When θ is in the range of [0 , π ) , thegradient of P R w.r.t. θ is: ∂P R∂θ = (cid:107) w (cid:107) (cid:107) x (cid:107) (cid:0) − sin θ − cos θ (cid:1) = −(cid:107) w (cid:107) (cid:107) x (cid:107) (8)When θ is in the range of [ π, π ) , the gradient of P R w.r.t. θ is: ∂P R∂θ = (cid:107) w (cid:107) (cid:107) x (cid:107) (cid:0) sin θ + cos θ (cid:1) = (cid:107) w (cid:107) (cid:107) x (cid:107) (9)We use the following uniﬁed form to express the above twocases: ∂P R∂θ = (cid:107) w (cid:107) (cid:107) x (cid:107) sign ( − sin θ ) (10)Compared with the standard inner product (Equation(2)), the PR Product changes the gradient w.r.t. θ from asmoothing function to a hard one. One advantage of thisis the gradient w.r.t. θ does not decrease as θ gets close to0 or π , providing continuous power for the optimization ofneural networks. The gradient of

P R w.r.t. w . Above we discussed thegradient w.r.t. θ , an implicit variable in neural networks. Inthis part, we explicitly take a look at the differences betweenthe gradients of the standard inner product and our proposedPR Product w.r.t. w .For the PR Product, we derive the gradient of P R w.r.t. w from Equation (7) and Equation (10) as follows : ∂P R∂ w = w (cid:107) w (cid:107) (cid:107) x (cid:107) cos θ + (cid:107) w (cid:107) (cid:107) x (cid:107) sign ( − sin θ ) ∂θ∂ w = P x + (cid:107) w (cid:107) (cid:107) x (cid:107) sign ( − sin θ ) d θ d cos θ ∂ cos θ∂ w = P x + (cid:107) w (cid:107) (cid:107) x (cid:107) | sin θ | ∂ (cid:16) w T x (cid:107) w (cid:107) (cid:107) x (cid:107) (cid:17) ∂ w = P x + (cid:107) w (cid:107) (cid:107) x (cid:107) | sin θ | ( I − M w ) x (cid:107) w (cid:107) (cid:107) x (cid:107) = P x + R x | sin θ | = P x + (cid:107) x (cid:107) R x (cid:107) R x (cid:107) = P x + (cid:107) x (cid:107) E rx , with M w = ww T (cid:107) w (cid:107) (11)Where M w is the projection matrix that projects onto theweight vector w , which means M w x = P x , and E rx isthe unit vector along the vector rejection R x . Similar toEquation (3), the P x is the length gradient part and the (cid:107) x (cid:107) E rx is the direction gradient part. For the length gra-dient, the cases in P and P R are identical. For the directiongradient part, however, the one in

P R is consistently largerthan the one in P , except for the almost impossible caseswhen θ is equal to π/ or π/ . So P R increases theproportion of the direction gradient. In addition, the lengthof direction gradient in

P R is independent of the value of θ .Figure 1 shows the comparison of the gradients of the twoformulations w.r.t. w . The PR Product is a reliable substitute for the standardinner product operation, so it can be applied into manyexisting deep learning modules, such as fully connectedlayer(FC), convolutional layer(CNN) and LSTM layer. Wedenote the module X with PR Product by PR-X. In thissection, we show the implementation of PR-FC, PR-CNNand PR-LSTM.

PR-FC.

To get PR-FC, we just replace the inner productof the data vector x and each weight vector in the weightmatrix with the PR Product. Suppose the weight matrix W contains a set of n column vectors, W = ( w , w , ..., w n ) ,so the output vector of PR-FC can be calculated as follows: P R - F C ( W , x )= ( P R ( w , x ) , P R ( w , x ) , ..., P R ( w n , x )) + b (12)where b represents an additive bias vector if any. PR-CNN.

To apply the PR Product into CNN, we convertthe weight tensor of the convolutional kernel and the dataodel CIFAR10 CIFAR100ResNet110 6.23 28.08PR-ResNet110

PreResNet110 5.99 27.08PR-PreResNet110

WRN-28-10 4.34

PR-WRN-28-10

Table 1. Error rates on CIFAR10 and CIFAR100. The bestresults are highlighted in bold for the models with the samebackbone architectures. All values are reported in percentage. ThePR Product version can typically outperform the correspondingbackbone models. tensor in the sliding window into vectors in Euclideanspace, and then use the PR Product to calculate the out-put. Suppose the size of the convolution kernel w is ( k , k , C in ) , so the output at position (i, j) is: P R - CN N ( w , x ) ij = P R ( f latten ( w ) , f latten ( x [ ij ] )) + b (13)where f latten ( w ) and f latten ( x [ ij ] ) ∈ R k ∗ k ∗ C in , x [ ij ] represents the data tensor in the sliding window correspond-ing to output position (i,j), and b represents an additive biasif any. PR-LSTM.

To get the PR Product version of LSTM, justreplace all the perceptrons in each gate function with thePR-FC. For each element in input sequence, each layercomputes the following function: i t = σ (cid:0) P R - F C ( W ii , x t ) + P R - F C ( W hi , h ( t − ) ) + b i (cid:1) f t = σ (cid:0) P R - F C ( W if , x t ) + P R - F C ( W hf , h ( t − ) ) + b f (cid:1) g t = tanh (cid:0) P R - F C ( W ig , x t ) + P R - F C ( W hg , h ( t − ) ) + b g (cid:1) o t = σ (cid:0) P R - F C ( W io , x t ) + P R - F C ( W ho , h ( t − ) ) + b o (cid:1) c t = f t ∗ c ( t − ) + i t ∗ g t h t = o t ∗ tanh( c t ) (14) where σ is the sigmoid function, and * is the Hadamardproduct.In the following, we conduct experiments on imageclassiﬁcation to validate the effectiveness of PR-CNN. Andthen we show the effectiveness of PR-FC and PR-LSTM onimage captioning task.

4. Experiments on Image Classiﬁcation

We employ various classic networks such asResNet [11], PreResNet [12], WideResNet [45] andDenseNet-BC [13] as the backbone networks in our

Language PR-LSTMAttention PR-LSTM

Visual Attention

Softmax {v , ..., v k } t ˆv h t  h t  h t h t h t h t t y g v e t W  Figure 2. Decoder module used in our captioning model. Theinput to the Attention PR-LSTM consists of the global imagerepresentation v g and the embedding of the previously generatedword W e Π t . The input to the Language PR-LSTM consists of theattended image representation ˆv t concatenated with the output ofthe Attention PR-LSTM. The dotted arrows represent the transferof the hidden states of PR-LSTM layers. experiments. In particular, we consider ResNet with 110layers denoted by ResNet110, PreResNet with 110 layersdenoted by PreResNet110, and WideResNet with 28 layersand a widen factor of 10 denoted by WRN-28-10, aswell as DenseNet-BC with 100 layers and a growth rateof 12 denoted by DenseNet-BC-100-12. For ResNet110and PreResNet110, we use the classic basic block. Toget the corresponding PR Product version models, allthe fully connected layers and the convolutional layers inthe backbone models are replaced with our PR-FC andPR-CNN respectively, and we denote them by PR-X, suchas PR-ResNet110, PR-PreResNet110, PR-WRN-28-10 andPR-DenseNet-BC-100-12 respectively. We conduct our image classiﬁcation experiments on theCIFAR dataset [19], which consists of 50k and 10k imagesof × pixels for the training and test sets respectively.The images are labeled with 10 and 100 categories, namelyCIFAR10 and CIFAR100 datasets. We present experimentstrained on the training set and evaluated on the test set. Wefollow the simple data augmentation in [21] for training:4 pixels are padded on each side and a × crop israndomly sampled from the padded image or its horizontalﬂip. For testing, we only evaluate the single view of theoriginal × image. Note that our focus is on theeffectiveness of our proposed PR Product, not on pushingthe state-of-the-art results, so we do not use any more dataaugmentation and training tricks to improve accuracy.roduct B1 B2 B3 B4 M RL C SP 76.7 60.8 47.3 36.8 28.1 56.9 116.0 R 76.3 60.4 46.7 36.0 27.7 56.5 113.3 20.6PR P ∗ R ∗ ∗ Table 2. Performance comparison of different products on thetest portion of Karpathy splits on MS COCO dataset, where Bnis short for BLEU-n, M is short for METEOR, RL is short forROUGE-L, C is short for CIDEr, and S is short for SPICE. Thetop part is for cross-entropy training, and the bottom part is forCIDEr optimization (marked with ∗ ). All values are reported inpercentage, with the highest value of each entry highlighted inboldface. For fair comparison, not only are the PR-X modelstrained from scratch but also the corresponding backbonemodels, so our results may be slightly different from theones presented in the original papers due to some hyper-parameters like random number seeds. The strategiesand hyper-parameters used to train the respective back-bone models, such as the optimization solver, learning rateschedule, parameter initialization method, random seed forinitialization, batch size and weight decay, are adoptedto train the corresponding PR-X models. The results areshown in Table 1 and some training curves are shown inthe supplementary material, from which we can see that thePR-X can typically improve the corresponding backbonemodels on both CIFAR10 and CIFAR100. On average, itreduces the top-1 error by 0.27% on CIFAR10 and 0.16%on CIFAR100. It is worth emphasizing that the PR-Xmodels don’t introduce any additional parameters and keepthe same hyper-parameters as the corresponding backbonemodels.

5. Experiments on Image Captioning

We utilize the widely used encoder-decoder framework[1, 27] as our backbone model for image captioning.

Encoder.

We use the Bottom-Up model proposed in [1] togenerate the regional representations and the global repre-sentation of a given image I . The Bottom-Up model em-ploys Faster R-CNN [29] in conjunction with the ResNet-101 [11] to generate a variably-sized set of k represen-tations, A = { a , ..., a k } , a i ∈ R , such that eachrepresentation encodes a salient region of the image. Weuse the global average pooled image representation a g = k (cid:80) i a i as our global image representation. For modelingconvenience, we use a single layer of PR-FC with rectiﬁer M a x o f | c o s  | Iterations P Product PR Product M i n o f | s i n  | Iterations P Product PR Product

Figure 3. The minimum of | sin θ | of the hidden-hidden transferpart in the Attention LSTM. activation function to transform the representation vectorsinto new vectors with dimension d : v i = ReLU ( P R - F C ( W a , a i )) , v i ∈ R d (15) v g = ReLU ( P R - F C ( W g , a g )) , v g ∈ R d (16)where W a and W g are the weight parameters. Thetransformed V = { v , ..., v k } is our deﬁned regionalimage representations and v g is our deﬁned global imagerepresentation. Decoder.

For decoding image representations V and v g to sentence description, as shown in Figure 2, we utilize anvisual attention model with two PR-LSTM layers accordingto recent methods [1, 28, 41], which are characterized asAttention PR-LSTM and Language PR-LSTM respectively.We initialize the hidden state and memory cell of each PR-LSTM as zero.Given the output h of the Attention PR-LSTM, wegenerate the attended regional image representation ˆv t through the attention model, which is broadly adopted inrecent previous work [5, 27, 1]. Here, we use the PRProduct version of visual attention model expressed asfollows: f = tanh (cid:0) P R - F C ( W v , V ) + P R - F C ( W h1 , h ) (cid:1) f = P R - F C ( W z , f ) α t = sof tmax ( f ) ˆv t = k (cid:88) i =1 α t,i v i (17)where W v , W h1 and W z are learned parameters, f and f are the outputs of the ﬁrst layer and the second layer in theattention model respectively. α t is the attention weight over k regional image representations, and ˆv t is the attendedimage representation at time step t. Dataset.

We evaluate our proposed method on the MSCOCO dataset [23]. MS COCO dataset contains 123287odel BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr SPICELSTM-A [42] 73.5 56.6 42.9 32.4 25.5 53.9 99.8 18.5SCN-LSTM Σ [8] 74.1 57.8 44.4 34.1 26.1 - 104.1 -Adaptive [27] 74.2 58.0 43.9 33.2 26.6 - 108.5 -SCST:Att2all Σ [31] - - - 32.2 26.7 54.8 104.7 -Up-Down [1] 77.2 - - 36.2 27.0 56.4 113.5 20.3Stack-Cap [9] 76.2 60.4 46.4 35.2 26.5 - 109.1 -ARNet [6] 74.0 57.6 44.0 33.5 26.1 54.6 103.4 19.0NBT [28] 75.5 - - 34.7 27.1 - 107.2 20.1GCN-LSTM sem [41] - - 36.8 27.9 57.0 EmbeddingReward ∗ [30] 71.3 53.9 40.3 30.4 25.1 52.5 93.7 -LSTM-A ∗ [42] 78.6 - - 35.5 27.3 56.8 118.3 20.8SCST:Att2all Σ ∗ [31] - - - 35.4 27.1 56.6 117.5 -Up-Down ∗ [1] 79.8 - - 36.3 27.7 56.9 120.1 21.4Stack-Cap ∗ [9] 78.6 62.5 47.9 36.1 27.4 56.9 120.4 20.9GCN-LSTM ∗ sem [41] 80.5 - - 38.2 28.5 58.3 127.6 22.0CAVP ∗ [25] - - - 38.6 28.3 58.5 126.3 21.6SGAE ∗ [39] - - 38.4 28.4 ∗ Table 3. Performance compared with the state-of-the-art methods on the Karpathy test split of MS COCO. Σ indicates ensemble. The toppart is for cross-entropy training, and the bottom part is for REINFORCE-based optimization (marked with ∗ ). All values are reported inpercentage, with the highest value of each entry highlighted in boldface. images labeled with at least 5 captions. There are 82783training images and 40504 validation images, and it pro-vides 40775 images as the test set for online evaluation aswell. For ofﬂine evaluation, we use a set of 5000 images forvalidation, a set of 5000 images for test and the remains fortraining, as given in [16]. We truncate captions longer than16 words and then build a vocabulary of words that occur atleast 5 times in the training set, resulting in 9487 words. Implementation Details.

In the captioning model, weset the number of hidden units in each LSTM or PR-LSTM to 512, the embedding dimension of a word to512, and the embedding dimension of image representationto 512. All of our models are trained according to thefollowing recipe. We train all models under the cross-entropy loss using ADAM optimizer with an initial learningrate of × − and a momentum parameter of 0.9. Weanneal the learning rate using cosine decay schedule andincrease the probability of feeding back a sample of theword posterior by 0.05 every 5 epochs until we reach afeedback probability 0.25 [3]. We then run REINFORCEtraining to optimize the CIDEr metric using ADAM with alearning rate × − with cosine decay schedule and amomentum parameter of 0.9. During CIDEr optimizationmode and testing mode, we use a beam size of 5. Notethat in all our model variants, the untransformed imagerepresentations A and a g from the Encoder are ﬁxed andnot ﬁne-tuned. As our focus is on the effectiveness ofour proposed PR Product, so we just exploit the widely used backbone model and settings, without any additionaltricks of improving the performance, like scene graph andensemble strategy. The effectiveness of PR Product.

To test the effectivenessof PR Product, we ﬁrst compare the performance of modelsusing the following different substitutes for inner producton Karpathy’s split of MS COCO dataset: • P Product : This is just the standard inner product.In Euclidean geometry, it is also called projectionproduct, so we abbreviate it as P Product. • R Product : Contrary to P Product, R Product onlyinvolves the information of vector rejection of x from w . To keep the same range and sign as P Product, weformulate the R Product as follows: R ( w , x ) = sign (cos θ ) (cid:107) w (cid:107) ( (cid:107) x (cid:107) − (cid:107) R x (cid:107) ) (18) • PR Product : This is the proposed PR Product. Ev-idently, the PR Product is the combination of the PProduct and R Product with the relationship as follows:

P R ( w , x ) = | sin θ | P ( w , x ) + | cos θ | R ( w , x ) (19)For fair comparison, results are reported for modelstrained with cross-entropy loss and models optimized forLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDErC5 C40 C5 C40 C5 C40 C5 C40 C5 C40 C5 C40 C5 C40SCN-LSTM Σ [8] 74.0 91.7 57.5 83.9 43.6 73.9 33.1 63.1 25.7 34.8 54.3 69.6 100.3 101.3Adaptive Σ [27] 74.8 92.0 58.4 84.5 44.4 74.4 33.6 63.7 26.4 35.9 55.0 70.5 104.2 105.9SCST:Att2all ∗ Σ [31]78.1 93.7 61.9 86.0 47.0 75.9 35.2 64.5 27.0 35.5 56.3 70.7 114.7 116.7Up-Down ∗ Σ [1] ∗ [26] 75.4 91.8 59.1 84.1 44.5 73.8 33.2 62.4 25.7 34.0 55.0 69.5 101.3 103.2MAT [24] 73.4 91.1 56.8 83.1 42.7 72.7 32.0 61.7 25.8 34.8 54.0 69.1 102.9 106.4Stack-Cap ∗ [9] 77.8 93.2 61.6 86.1 46.8 76.0 34.9 64.6 27.0 35.6 56.2 70.6 114.8 118.3Ours:PR ∗ Table 4. Performance compared with the state-of-the-art methods on the online MS COCO test server. Σ indicates ensemble, and ∗ indicatesﬁne-tuned by REINFORCE-based optimization. The top part is for the ensemble models, and the bottom part is for the singles. All valuesare reported in percentage, with the highest value of each entry highlighted in boldface. CIDEr score on Karpathy’s split of MS COCO dataset,as shown in Table 2. Although the R Product does notperform as well as the P Product or PR Product, the resultsshow that the vector rejection of data vector from weightvector can be used to optimize neural networks. Comparedwith the P Product and R Product, the PR Product achievesperformance improvement across all metrics regardless ofcross-entropy training or CIDEr optimization, which exper-imentally proves the cooperation of vector projection andvector rejection is beneﬁcial to the optimization of neuralnetworks. To intuitively illustrate the advantage of the PRProduct, we show some examples of image captioning insupplementary material.To better understand how the PR Product affects neuralnetworks, we plot the minimum of | sin θ | to investigatethe dynamics of neural networks to some extent. Figure 3shows the statistic of the hidden-hidden transfer part in theAttention LSTM, and plots for more layers can be foundin the supplementary material. For most of the layers, theminimum of | sin θ | in PR Product version is larger than theone in P Product, which means the weight vector and datavector in PR Product are more orthogonal. We argue this isthe reason for PR Product to take effect. Comparison with State-of-the-Art Methods.

To furtherverify the effectiveness of our proposed method, we alsocompare the PR Product version of our captioning modelwith some state-of-the-art methods on Karpathy’s split ofMS COCO dataset. Results are reported in Table 3, ofwhich the top part is for cross-entropy loss and the bottompart is for CIDEr optimization.Among those methods, SCN-LSTM [8] andSCST:Att2all [31] use the ensemble strategy. GCN-LSTM [41], CAVP [25] and SGAE [39] exploit informationof visual scene graphs. Even though we do not use anyof the above means of improving performance, our PRProduct version of captioning model achieves the best performance in most of the metrics, regardless of cross-entropy training or CIDEr optimization. In addition, wealso report our results on the ofﬁcial MS COCO evaluationserver in Table 4. As the scene graph models can greatlyimprove the performance, for fair comparison, we onlyreport the results of methods without scene graph models.It is noteworthy that we just use the same model as reportedin Table 3, without retraining on the whole training andvalidation images of MS COCO dataset. We can seethat our single model achieves competitive performancecompared with the state-of-the-art models, even thoughsome models exploit ensemble strategy.

6. Conclusion

In this paper, we propose a reliable substitute for theinner product of weight vector w and data vector x , the PRProduct, which involves the information of both the vectorprojection P x and the vector rejection R x . The length ofthe direction gradient of PR Product w.r.t. w is consistentlylarger than the one in standard inner product. In particular,we show the PR Product version of the fully connectedlayer, convolutional layer and LSTM layer. Applyingthese PR Product version modules to image classiﬁcationand image captioning, the results demonstrate the robusteffectiveness of our proposed PR Product. As the basicoperation in neural networks, we will apply the PR Productto other tasks like object detection. Acknowledgement.

This work was supported in part bythe NSFC Project under Grants 61771321 and 61872429,in part by the Guangdong Key Research Platform of Uni-versities under Grants 2018WCXTD015, in part by theScience and Technology Program of Shenzhen under GrantsKQJSCX20170327151357330, JCYJ20170818091621856,and JSGG20170822153717702, and in part by the Interdis-ciplinary Innovation Team of Shenzhen University. eferences [1] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,volume 3, page 6, 2018. 1, 3, 6, 7, 8[2] Pierre Baldi and Peter Sadowski. A theory of local learning,the learning channel, and the optimality of backpropagation.

Neural Networks , 83:51–74, 2016. 2[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and NoamShazeer. Scheduled sampling for sequence predictionwith recurrent neural networks. In

Advances in NeuralInformation Processing Systems , pages 1171–1179, 2015. 7[4] Yoshua Bengio, Nicholas L´eonard, and Aaron Courville.Estimating or propagating gradients through stochasticneurons for conditional computation. arXiv preprintarXiv:1308.3432 , 2013. 2[5] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, JianShao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatialand channel-wise attention in convolutional networks forimage captioning. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 6298–6306. IEEE, 2017. 3, 6[6] Xinpeng Chen, Lin Ma, Wenhao Jiang, Jian Yao, andWei Liu. Regularizing rnns for caption generation byreconstructing the past with the present. arXiv preprintarXiv:1803.11439 , 2018. 7[7] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,Peter Young, Cyrus Rashtchian, Julia Hockenmaier, andDavid Forsyth. Every picture tells a story: Generatingsentences from images. In

European conference on computervision , pages 15–29. Springer, 2010. 2[8] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu,Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng.Semantic compositional networks for visual captioning. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , volume 2, 2017. 7, 8[9] Jiuxiang Gu, Jianfei Cai, Gang Wang, and TsuhanChen. Stack-captioning: Coarse-to-ﬁne learning for imagecaptioning. arXiv preprint arXiv:1709.03376 , 2017. 3, 7, 8[10] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and RossGirshick. Mask r-cnn. In

Proceedings of the IEEEinternational conference on computer vision , pages 2961–2969, 2017. 1[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 1, 2, 5, 6[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In

Europeanconference on computer vision , pages 630–645. Springer,2016. 2, 5[13] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, andKilian Q Weinberger. Densely connected convolutionalnetworks. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4700–4708,2017. 2, 5[14] Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and TongZhang. Recurrent fusion network for image captioning. arXiv preprint arXiv:1807.09986 , 2018. 2[15] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Imagegeneration from scene graphs. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1219–1228, 2018. 3[16] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-ments for generating image descriptions. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 3128–3137, 2015. 2, 7[17] Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.Multimodal neural language models. In

InternationalConference on Machine Learning , pages 595–603, 2014. 2[18] Alex Krizhevsky. One weird trick for parallelizing convo-lutional neural networks. arXiv preprint arXiv:1404.5997 ,2014. 2[19] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report,Citeseer, 2009. 5[20] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, SagnikDhar, Siming Li, Yejin Choi, Alexander C Berg, andTamara L Berg. Babytalk: Understanding and generatingsimple image descriptions.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 35(12):2891–2903,2013. 2[21] Chen-Yu Lee, Saining Xie, Patrick Gallagher, ZhengyouZhang, and Zhuowen Tu. Deeply-supervised nets. In

Artiﬁcial Intelligence and Statistics , pages 562–570, 2015.5[22] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed,and Colin J Akerman. Random synaptic feedback weightssupport error backpropagation for deep learning.

Naturecommunications , 7:13276, 2016. 2[23] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

European conference on computer vision , pages 740–755.Springer, 2014. 6[24] Chang Liu, Fuchun Sun, Changhu Wang, Feng Wang, andAlan Yuille. Mat: A multimodal attentive translator forimage captioning. arXiv preprint arXiv:1702.05658 , 2017.8[25] Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, YongdongZhang, and Feng Wu. Context-aware visual policy networkfor sequence-level image captioning. In , pages1416–1424. ACM, 2018. 3, 7, 8[26] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, andKevin Murphy. Improved image captioning via policygradient optimization of spider. In

Proc. IEEE Int. Conf.Comp. Vis , volume 3, page 3, 2017. 3, 8[27] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.Knowing when to look: Adaptive attention via a visualsentinel for image captioning. In

Proceedings of the IEEEonference on Computer Vision and Pattern Recognition ,volume 6, page 2, 2017. 3, 6, 7, 8[28] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.Neural baby talk. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 7219–7228, 2018. 6, 7[29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

Advances in neural informationprocessing systems , pages 91–99, 2015. 6[30] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning-based image captioningwith embedding reward. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 290–298, 2017. 7[31] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, JarretRoss, and Vaibhava Goel. Self-critical sequence training forimage captioning. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , volume 1, page 3,2017. 3, 7, 8[32] Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014. 2[33] Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmid-huber. Highway networks. arXiv preprint arXiv:1505.00387 ,2015. 2[34] Rupesh K Srivastava, Klaus Greff, and J¨urgen Schmidhuber.Training very deep networks. In

Advances in neuralinformation processing systems , pages 2377–2385, 2015. 2[35] Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, andAnton van den Hengel. What value do explicit highlevel concepts have in vision to language problems? In

Proceedings of computer vision and pattern recognition ,pages 203–212, 2016. 2[36] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms. arXiv preprint arXiv:1708.07747 , 2017.1[37] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1492–1500, 2017. 2[38] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, AaronCourville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image captiongeneration with visual attention. In

International conferenceon machine learning , pages 2048–2057, 2015. 3[39] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.Auto-encoding scene graphs for image captioning, 2018. 3,7, 8[40] Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen,and Ruslan R Salakhutdinov. Review networks for captiongeneration. In

Advances in Neural Information ProcessingSystems , pages 2361–2369, 2016. 3[41] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploringvisual relationship for image captioning. In

Proceedings of the European Conference on Computer Vision (ECCV) ,pages 684–699, 2018. 3, 6, 7, 8[42] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and TaoMei. Boosting image captioning with attributes. In

IEEEInternational Conference on Computer Vision , pages 22–29,2017. 7, 8[43] Mehrdad Yazdani. Linear backprop in non-linear networks.In

Advances in Neural Information Processing SystemsWorkshop , 2018. 2[44] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. Image captioning with semantic attention.In

Proceedings of computer vision and pattern recognition ,pages 4651–4659, 2016. 3[45] Sergey Zagoruyko and Nikos Komodakis. Wide residualnetworks. arXiv preprint arXiv:1605.07146 , 2016. 2, 5 . Training Curves on CIFAR10

Figures 4-7 show the training curves of some classiﬁca-tion models on CIFAR10 used in the paper, from which wecan see that the models of PR Product version get consistentlower error rates than the models of standard inner productversion.

20 40 60 80 100 120 140 1606810121416 10 -3 -2 -1 t e s t e rr o r ( % ) epoch Test error: PR-ResNet Test error: ResNet Tainging loss: PR-ResNet Training loss: ResNet t r a i n i ng l o ss Figure 4. Training curves of the ResNet on CIFAR10.

20 40 60 80 100 120 140 1606810121416 10 -3 -2 -1 t e s t e rr o r ( % ) epoch Test error: PR-PreResNet Test error: PreResNet Tainging loss: PR-PreResNet Training loss: PreResNet t r a i n i ng l o ss Figure 5. Training curves of the PreResNet on CIFAR10.

20 40 60 80 100 120 140 160 180 20046810121416182022 10 -3 -2 -1 t e s t e rr o r ( % ) epoch Test error: PR-WRN Test error: WRN Tainging loss: PR-WRN Training loss: WRN t r a i n i ng l o ss Figure 6. Training curves of the WRN on CIFAR10.

80 100 120 140 160 180 200 220 240 260 280 300681012 10 -3 -2 -1 t e s t e rr o r ( % ) epoch Test error: PR-DenseNet Test error: DenseNet Tainging loss: PR-DenseNet Training loss: DenseNet t r a i n i ng l o ss Figure 7. Training curves of the DenseNet on CIFAR10.

B. The Minimum of | sin θ || sin θ || sin θ | We plot the minimum of | sin θ | in some layers of ourcaptioning model, as shown in Figures 8-16. From theseplots, we can observe that the minimum of | sin θ | in PRProduct version is larger than the one in P Product versionfor most of the layers, which means the weight vector anddata vector in PR Product are more orthogonal. We arguethis is the reason for PR Product to take effect. M i n o f | s i n  | Iterations P Product PR Product

Figure 8. The minimum of | sin θ | of the a i to v i transfer part inthe Encoder. M i n o f | s i n  | Iterations P Product PR Product

Figure 9. The minimum of | sin θ | of the a g to v g transfer part inthe Encoder. M i n o f | s i n  | Iterations P Product PR Product

Figure 10. The minimum of | sin θ | of the W e Π t to hiddentransfer part in the Attention LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 11. The minimum of | sin θ | of the v g to hidden transferpart in the Attention LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 12. The minimum of | sin θ | of the hidden to hidden transferpart in the Attention LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 13. The minimum of | sin θ | of the ˆv t to hidden transferpart in the Language LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 14. The minimum of | sin θ | of the h to hidden transferpart in the Language LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 15. The minimum of | sin θ | of the hidden to hidden transferpart in the Language LSTM. M i n o f | s i n  | Iterations P Product PR Product

Figure 16. The minimum of | sin θ | of the output layer (softmaxlayer in the Decoder of our captioning model.) C. Examples of Image Captioning

To intuitively illustrate the advantage of the PR Product,we show some examples of image captioning in Figure 17.The images are sampled from Karpathy’s test split of MSCOCO dataset. All the three models (P version, R version,and PR version) are trained with cross-entropy loss and thenﬁne-tuned for CIDEr optimization. The results show thatPR product makes contribution to the descriptiveness of thesentences and prove that the PR Product is effective. T: two teddy bears lie propped up against a wall P: a teddy bear with a ball in front of it R: a teddy bear holding a group of balloons PR: two teddy bears sitting next to each other

GT: a group of people gathered at the bottom of a snow mountain P: a group of people on skis on a ski lift R: a group of people riding skis on a ski lift PR: a group of people on skis on a snow covered mountainGT: s everal people are flying kites in an open field P: a man standing in a field flying a kite R: a man flying a kite in a field PR: a group of people flying kites in a fieldGT: a man riding skis across a snow covered slope P: a person standing on skis in the snow R: a person standing on skis in the snow PR: a person riding skis on a snow covered slopeGT: a group of people in a pool with floating plates of food P: a group of people sitting at a table R: a group of people sitting around a table PR: a group of people in a poolGT: a large passenger jet flying through a cloudy sky P: a large airplane flying in the sky R: a large airplane flying in the sky PR: a large airplane flying in a cloudy sky GT: a bowl of broccoli sits beside a lemon wedge P: a white plate of broccoli and orange slices on a table R: a white plate of broccoli and a table PR: a white plate of broccoli and a lemon wedgeGT: a pile of broccoli sitting next to other vegetables P: a bunch of vegetables in a market R: a bunch of vegetables on display in a market PR: a pile of broccoli and vegetables in a marketGT: two very worn suitcases stacked on top of each other resting on a table P: a old suitcase with stickers on top of it R: a suitcase with stickers on it on a wall PR: two suitcases stacked on top of each otherGT: a shelf filled with lots of different pairs of shoes P: a bunch of cats sitting in a shelf R: a group of cats sitting on top of a shelf PR: a group of shoes sitting on top of a shelf GT: a 'No Parking' sign is attached to a traffic cone on a sidewalk P: a sign on the side of a street R: a street sign on the side of a road PR: a no parking sign on the side of a street

GT: a table with a plate of cut pizza, two plates of salad, and a can of soda P: a plate of food on a table with a salad R: a table with plates of food on it PR: a table with plates of food and a can of soda