Natural Language Person Search Using Deep Reinforcement Learning
NNatural Language Person Search Using Deep Reinforcement Learning
Ankit ShahLanguage Technologies InstituteCarnegie Mellon University [email protected]
Tyler VuongElectrical and Computer EngineeringCarnegie Mellon University [email protected]
Abstract
Recent success in deep reinforcement learning is havingan agent learn how to play Go and beat the world cham-pion without any prior knowledge of the game. In that task,the agent has to make a decision on what action to takebased on the positions of the pieces. Person Search is re-cently explored using natural language based text descrip-tion of images for video surveillance applications [2]. Wesee [3] provides an end to end approach for object basedretrieval using deep reinforcement learning without con-straints placed on which objects are being detected. How-ever, we believe for real world applications such as personsearch defining specific constraints which identify personas opposed to starting with a general object detection willhave benefits in terms of performance and computationalresources requirement.In our task, Deep reinforcement learning would localize theperson in an image by reshaping the sizes of the boundingboxes. Deep Reinforcement learning with appropriate con-straints would look only for the relevant person in the imageas opposed to an unconstrained approach where each indi-vidual objects in the image are ranked. For person search,the agent is trying to form a tight bounding box aroundthe person in the image who matches the description. Thebounding box is initialized to the full image and at each timestep, the agent makes a decision on how to change the cur-rent bounding box so that it has a tighter bound around theperson based on the description of the person and the pixelvalues of the current bounding box. After the agent takesan action, it will be given a reward based on the Intersec-tion over Union (IoU) of the current bounding box and theground truth box. Once the agent believes that the boundingbox is covering the person, it will indicated that the personis found.
1. Introduction
Current state of the art techniques for Person search [2]assumes the availability of a dataset where an image or a set of images are cropped to the same size as the candi-date person they are trying to find. Further, state of theart techniques for object detection analyzes large amountof region proposals. Other relevant methods such as ob-ject detection using an end-to-end approach with deep re-inforcement learning [3] is able to detect persons, howeverauthors believe that their work can be improved upon usingconstraints for specific tasks such as person identification inour case. [5] discusses the benefits of the large scale personre-identification and explores a solution based on Convolu-tional neural networks to generate a compact descriptor fora coarse search achieving high performance on two exist-ing datasets. We observe that [4] explores a deep learningframework to jointly detect pedestrian detection and personre-identification with a new online instance matching lossfunction to validate the results. These papers are publishedregularly with little work done using Deep ReinforcementLearning which we plan to explore through our work.Our current task does person search using natural lan-guage where we evaluate our results on an image from caseto case basis given a description of person we are trying tofind. We will find an optimal policy for changing the sizesof the bounding boxes to quickly identify the person in animage. Further, an analysis of the system will be done basedon how quickly the system is able to recognize the personof interest in the image provided. Mean average precisionover all the samples in our dataset will be used to determinethe performance of our approach. We will further measurethe average number of actions needed to successfully findthe person along with the average Intersection over Union(metric to find percentage overlap). [4].
2. Task Description
One application of deep reinforcement learning is havingan agent learn how to play Atari games without any priorknowledge of the game. For games, the state is the currentframe and based on the raw or preprocessed pixel values, theagent has to make a decision on what action to take in orderto maximize some score. In our task, we are given an imageand the natural language description query of the person the1 a r X i v : . [ c s . C V ] S e p gent is trying to find in the image. The agent is initializedto a bounding box around the whole image and at each step,the agent could either shrink the box, expand the box, movein any of the four directions, or terminate. After the agenttakes an action, we will generate some reward mechanismbased on how the new bounding box and the old boundingbox compares with the ground truth bounding box. Aftereach action, the agent will update the network.We will use Deep Reinforcement Learning to help theagent find the bounding box around the person of interest.In our case, we use the natural language description for theperson rather than a photo of the person. The problems thatwe came up with requires us to figure out how to treat thedescription query and map it to a form we can work with,how to evaluate the candidate bounding box, what kind ofdeep reinforcement learning methods should we use, thekinds of inputs to the agent and the reward mechanism weare using.To handle the natural language description, we used apre-existing sentence embedding model Paragraph2Vecand to get the image features of the bounded box, we usedAlexnet which is a model trained on ImageNet. Afterwe have the sentence features and the image features, welearned a new mapping that maps the features into thesame dimensional space. Reward mechanisms for pongconsist of giving a good score for certain actions if theagent wins, and giving a bad score for certain actions if theagent loses. Some of them give scores for actions basedon some weighted average. For our reward mechanism,if the action took the agent to a new bounding box wherethe IoU is higher than the previous bounding box, wegave it a positive reward. If the IoU decreased, we gave anegative reward. We describe the procedure in detail in ourexperiments and further sections
3. Our Framework
We formulate the task of person search as a Markov De-cision Process (MDP). A MDP describes how an agent canbe in a set of states, take certain actions, receive rewardsbased on actions at a given state, and a discount factor usedin calculating the cumulative discounted reward which is alldonated as ( S, A, R, γ ) . where S is the state space, A is theaction space, R is rewards for each state. The agent inter-acts with the environment and at each time step t, the agenttakes action a t based on current state s t and transitions tothe new state s t +1 and receives reward r t from the environ-ment. It needs the policy function π ( s ) that specifies whataction to take at a given state. One way is to find the optimalaction-value function Q ∗ ( s, a ) , so that the policy the agentfollows is a greedy policy which is at each state, simplychoose the action that has the maximum state-action value.Applying methods such as value iteration to converge to the optimal action-value function is impractical due to a largestate space, therefore we use a function approximator to ap-proximate the state-action values. Neural Networks are uni-versal function approximators so the idea is to have a neuralnetwork learn the optimal state-action values which we canthen use as the optimal policy at each state.The goal for our agent is to land a tight bounding boxaround the target person that matches the description. Inour case, the state s is the description of the person it is try-ing to find and the region inside the bounding box at timet in which the agent can transform by taking an action A.We define the following 9 possible actions : shrink width,shrink height, expand width, expand height, move up, movedown, move left, move right and terminate with illustrationas seen in Figure 1. These are all possible actions which arepermissible for our agent and the defined actions are enoughto explore the complete environment. Figure 1. Set of all possible Actions
The action terminate indicates that the agent believes ithas found the person that matches the description. RewardR depends on the IoU at S t +1 compared with S t where S isstate of network. IoU is defined as Intersection over Unionwhich is the intersection of the area of the two boundingboxes (predicted and groundtruth bounding boxes) dividedby the sum total of the areas of the bounding boxes. We de-fine b as the bounding box at state s and b’ as the boundingbox at the new state s’ and g as the ground truth box R ( s, s (cid:48) ) = sign ( IoU ( b (cid:48) , g ) − IoU ( b, g )) (1) R ( terminate ) = (cid:26) +4 IoU ( b, g ) ≥ . − else (cid:27) (2)We formulate the above equation based on following ra-tionale. Our agent receives a positive reward if the agentmoves towards the target which in this case is a higher IoUas compared with the previous state. [1] also follows a sim-ilar approach and awards a reward of +1 when moving to-wards the target whereas a reward of -1 when moving awayrom the target. We define a higher reward for the termi-nation case where the agent thinks it has found the personwhereas a lower negative reward is assigned if the agentterminates incorrectly without finding the person or the IoUvalue is lower than 0.5.
4. Deep Q Learning
A common problem as seen in Reinforcement Learningis that RL is unstable or divergent when an action valuefunction is approximated with a non linear function such asNeural Networks. Deep Q learning uses a neural networkto approximate the state-action values Q(s,a). Based on theBellman Equation, the current state-action values should beequal to the immediate reward plus the discounted maxi-mum state-action value at the next state Q ( s, a ) = R + γmax a (cid:48) Q ( s (cid:48) , a (cid:48) ) At each time step, the agent looks at the current state S, usesa neural network to estimate the state-action values, takesan action A, receives a reward R and transitions to a newstate S’. The network weights are learned by minimizingthe mean squared error between the predicted state-actionvalues and the immediate reward plus the discounted maxi-mum state-action value at the next state. All the informationneeded to perform that update is represented in (S,A,R,S’)
Figure 2. Neural Network architecture to predict the next actions
Rather than optimizing the model based on the current(S,A,R,S’), the idea of experience replay was introduced inorder to get rid of correlated updates. Each time the agentchooses an action to transform the box, the current bound-ing box, action, immediate reward, new bounding box isstored in the experience replay. Then the neural networksamples a batch from the experience replay and optimizesthe model based on the Bellman Equation. The agent fol-lows an (cid:15) − greedy policy which means that with probabil-ity (cid:15) , the agent takes a random action and with probability 1 - (cid:15) , it takes the action with the highest state-action value.The value of epsilon decreases over time. Epsilon greedywas used because of the exploration vs exploitation tradeoff. In the beginning, rather than letting the agent followits own policy which is taking the action with the higheststate-action value, we make the agent take a random action.We do this because the network’s initial estimates will befar from optimal so we let the network first take random ac-tions. Once the network has trained enough, the agent startsto follow its own policy rather than just taking random ac-tions hence the value of epsilon decreasing over time. Apre-trained image model Alexnet was used to transform thecurrent bounding box into a 4096-D feature space and a pre-trained sentence model paragraph2vec was used to trans-form the sentence descriptions into a 100 - D feature space.We concatenate the image representation, sentence repre-sentation, along with a 90-D vector that is simply concate-nation of 1-hot encodings of the previous 10 actions. Theoutput of the model should be a 9 dimensional vector repre-senting the state action values for all 9 actions as shown inFigure 2.
5. Training
To train our model, we take in one image at a time anduse the same image multiple times before we get a new im-age. Each time we let the agent ”play” an image, we con-sider that an episode. We initially let the agent play thesame image for multiple episodes before moving to a newimage. For each episode, the bounding box is initializedto the full image and the agent follows the epsilon greedypolicy as mentioned above. As the agent continues to playthe same image, the value epsilon decreases which allowsthe agent to start following its own policy rather than tak-ing random actions. Each time the agent is at state s, takesaction a, moves to state s’, and receives reward r, (s,a,r,s’)is all pushed into the replay memory. After the agent takesan action and transitions into a new state, we then randomlysample a batch from the replay memory and optimize ourmodel based on the sampled batch rather than the immedi-ate action. That allows for uncorrelated and random updatesthat will prevent the network from becoming unstable. Af-ter the image has gone through a certain amount of episodes,we obtain a new image. We repeat the same procedure forall images and continue until we are done training. Aftera certain amount of epochs, we start to decrease the initialvalue of epsilon and the amount of episodes for each image.
6. Experiments and Results
We perform various experiments to validate our ap-proach. Given the amount of compute power we had at ourdisposal, we train on a smaller batch of images and performtesting on 100 images to report the results. Training tookbout 12.5 hours on average for 25 epochs using a NVIDIATITAN X GPU. Figure 3
Figure 3. Bounding Box locationat t (action) = 0Figure 4. Bounding Box locationat t (action) = 7Table 1. Person Search Results
We observe from Table 1 that the performance increaseswith correct termination as the number of epochs of train-ing increases. This suggests that we can receive even betterperformance metrics with a larger number of epochs duringthe training phase. We receive an average IoU of 0.591 with25 epochs training which is a reasonable average IoU forthe given task. Correctly terminated shows the proportionof the terminated bounding boxes that have an IoU greaterthan or equal to .5. Average IoU No Terminate shows theIoU of the bounding box at the very last time step withoutthe agent using the terminate action.To further, validate our approach we perform experi-ments with regular descriptions, random description and nodescription as sentence vector representation. We observe
Figure 5. Bounding Box locationat t (action) = 15Figure 6. Q(s,9)Table 2. Performance with Different Descriptions
Regular Descriptions Random Descriptions No DescriptionsTotal Terminated .61 .24 .64Correctly Terminated .95 .88 .72Avg IoU .485 .358 .438Avg IoU Terminate .591 .573 .527Avg IoU No Terminate .318 .290 .278Avg Number Action 16.5 20 16 that our model performs well as seen in Table 2 with ran-dom description having over 95 percent correct terminationgiven that the model terminated. The ratio of terminations ishuge between providing the regular description vs the ran-dom description provided as the input sentence feature vec-tor.
7. Conclusion
We demonstrate that a deep reinforcement approach tonatural language person search is possible and providespractical results. We observe that having the descriptionhelps the agent become more confident on finding the per-son over time. We obtain approximate 60 percent accuracyfor searching the correct person with reasonably small num-ber of actions (16) on average. Our approach is not per-fect and even though the agent does not always terminate,it learns to crop out the background and to focus only onpeople which is a major breakthrough in itself. We plan toextend our work with running more training epochs as wellas exploring methods such as Double DQN to form a morerobust person search system and related methods. eferences [1] M. Bellver, X. Giro-i-Nieto, F. Marques, and J. Torres. Hier-archical Object Detection with Deep Reinforcement Learning.
ArXiv e-prints , Nov. 2016. 2[2] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang. PersonSearch with Natural Language Description.
ArXiv e-prints ,Feb. 2017. 1[3] F. Wu, Z. Xu, and Y. Yang. An End-to-End Approach to Nat-ural Language Object Retrieval via Context-Aware Deep Re-inforcement Learning.
ArXiv e-prints , Mar. 2017. 1[4] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint Detectionand Identification Feature Learning for Person Search.
ArXive-prints , Apr. 2016. 1[5] H. Yao, S. Zhang, D. Zhang, Y. Zhang, J. Li, Y. Wang, andQ. Tian. Large-scale person re-identification as retrieval. In , pages 1440–1445, July 2017. 1
8. Appendix8. Appendix