Object recognition and tracking using Haar-like Features Cascade Classifiers: Application to a quad-rotor UAV
OObject recognition and tracking using Haar-like Features CascadeClassifiers: Application to a quad-rotor UAV.
Luis Arreola , Gesem Gudi˜no , and Gerardo Flores Abstract — In this paper we develop a functional UnmannedAerial Vehicle (UAV), capable of tracking an object using aMachine Learning-like vision system called Haar feature-basedcascade classifier. The image processing is made on-board witha high processor single-board computer. Based on the detectedobject and its position, the quadrotor must track it in order tobe in a centered position and in a safe distance to it. The objectin question is a human face; the experiments were conducted ina two-step detection, searching first for the upper-body and thensearching for the face inside of the human body detected area.Once the human face is detected the quadrotor must followit automatically. Experiments were conducted which shows theeffectiveness of our mythology; these results are showing in avideo.
I. I
NTRODUCTION
The use of quadrotors in real applications such as aerialphotography, environment monitoring, farming, structure in-spection among others, demand implemented vision tech-niques to provide perception capabilities to the aerial drone[1], [2], [3], [4]. A camera-based system is one of the favoritesolutions to these demands due to its passive and low-cost characteristics. This work presents an easy-to-implementapplication to this topic, programming object detection al-gorithms completely embedded in a quadrotor UAV whichincludes on-board computer and a monocular camera. Theentire system consists in detecting an object by taking aframe from the camera and then the on-board computerprocesses the image to detect the object using a Haar-like feature-based classifier. Due to the fact that the Haarclassifiers are considered as weak classifiers [5], a cascadetraining is implemented to obtain a robust detection. Oncethe object is detected, the on-board computer determines theposition of the object with respect to (w.r.t) the quadrotor,hence it sends the corresponding information to the flightcontroller to ensure the correct tracking. For this problem,tracking refers to the capability that the quadrotor has todetect a desired object, in this case a person, and follow itwith the information obtained by the vision algorithms. TheUAV used for this approach can be seen at Fig. 1.There are several works that have explored object detectionand tracking using UAVs [6], [7]. Next we cite several works Perception and Robotics Laboratory, Center for Research in Optics,Le´on, Guanajuato, Mexico, 37150. Deparment of Industrial Electro-Mechanics, Universidad Tecnol´ogica deLe´on, Le´on, Guanajuato, Mexico, 37670.(E-mail: [email protected], [email protected], gfl[email protected]). Corresponding author: Gerardo Flores.This work was supported in part by the FORDECYT-CONACYT undergrant 000000000292399 and the Laboratorio Nacional de ´Optica de laVisi´on of the National Council of Science and Technology in Mexico(CONACYT) under agreement 293411.
Fig. 1: The quad-rotor unmanned aerial vehicle used in thisstudy. It has a Jetson TX2 computer and a monocular cameraon-board to perform the person detection.related with our research. In [8], the object detection andtracking performed by an UAV is conducted with a videopreviously recorded. A face detection algorithm is presentedin [9], where image processing is embedded in an Odroidmicro computer. Also the number of steps to determine theposition of the object with respect to the UAV are reduceddue to the simplicity of the performed strategies, which mayresult in a faster response from the UAV with respect to thepersons movement. In [10] a Jetson TX2 is also implementedwith object detection techniques in order to obtain betterperformance results.A Vision-based navigation has developed multiple strate-gies depending on the applications, such as road following,power lines inspection, and navigation in orchards [11],[12], [13]. For this work the general strategy consists intracking a moving object, in particular a person’s face, withthe information of the continuous frames that are acquiredby the camera. Each one of the frames are compared withthe cascade training of the features classifiers to detect theperson’s face. The tracking consists in navigating the UAVin a way to achieve that the center of the image correspondswith the centroid of the object and with a certain predefineddistance. This operating principle is the same used for vision-based autonomous landing, in which the camera must detecta preset landmark that indicates where the quadrotor shouldbe land [14], [15]. Several experiments were conducted inthis work. As is briefly mentioned above, such experimentsconsist in detection and tracking of a human face. Thedetection process consists in searching for an upper body a r X i v : . [ c s . R O ] M a r from the shoulders to the head), and inside the sub-imageof a detected upper-body the human face is now searchedwithing such area. It is demonstrated trough experiments thatthis approach makes the system more robust than without theuse of upper-body detection part.The remainder of this paper is organized as follows. InSection II the general approach and the architecture of thesystem are described. Section III presents the results of theupper-body and face detection combined, and the subsequenttracking by the UAV by using the proposed approach. Finally,Section IV presents some concluding remarks and an outlineof future directions of the presented research.II. S YSTEM DESCRIPTION
In this section, the main hardware and software elementsof the quadrotor tracking system are explained.
A. Hardware1) UAV:
Main elements of our quadrotor experimentalplatform includes: electronics control systems, a frame,Electronic Speed Controllers (ESCs), motors, propellers, abattery, a control board and an Inertial Measurement Unit(IMU) embedeed in an autopilot. The IMU is composed ofa 3-axis accelerometer and a gyroscope. Its function is to getthe actual information of the quadrotor’s attitude. Also thequadrotor is endowed with a GPS module which gets thequadrotor’s absolute position. The quad-rotor used in thispaper is shown at Fig. 1. This drone is equipped with allthe primary elements listed above, including a Jetson TX2single board computer and a voltage regulator. The quad-rotor specifications are listed in Table I.
Quadrotor characteristicsParameter ValueSpan 70 [cm]Height 26 [cm]Weight ≈ ≈ TABLE I: Quadrotor UAV parameters.
2) NVIDIA Jetson TX2:
A Jetson TX2 developer kit isused to embed the vision and estimation algorithms. Thetraining for the object detection is stored in the JetsonTX2. Depending on the object’s position, the correspondingcommands are send to the autopilot in order that UAV tracksthe person’s face. The specifications of this single-boardcomputer can be seen at [16].
B. Software
The general code is based in two main parts. The first,describes all the training steps for the object detectiontask. The second part refers to the communication betweenthe on-board computer and autopilot in order to perform control commands in the quadrotor. Then, when an objectis detected, a bounding box surrounds it and calculates itscentroid, then the navigation commands are computed as afunction of the centroid position to achieve that the center ofthe image coincides with the centroid of the object. Once thisstep is done, the bounding box width determines if the UAVneeds to make a pitch-forward to get closer to the object ornot. If no object is detected, the UAV remains in its positionin hover. When an action is performed, the program waitsfor the next frame and repeats the loop.Object tracking stops in two cases: a) when the UAV’sbattery reaches a established minimum; and b) when the userdecides to finish this task, for that, the user can turn off thetracking object remotely. Once the program ends, for anyof the two reasons mentioned above, the quadrotor enters in failsafe mode, which causes it to position itself at five metershigh w.r.t takeoff position and once there, it returns to homeposition for landing.
1) Haar cascades Training:
The Haar feature-based cas-cade classifier is an effective object detection method basedon image information. It is a machine learning approachwhere a cascade function is trained from positive and neg-ative images. When one says positive images it means thatthe object of interest is in such image; the negative images are the opposite, i.e., when the object of interest is not inthe image. The object detection procedure classifies imagesbased in the value of features, instead of working directlywith pixels. The system uses three kind of features shown atFig. 2.Fig. 2: Different patterns considered as Haar features.The value of a two-rectangle feature is the differencebetween the sum of the pixels within two regular regions(edge features). The regions must have the same size andshape, and are adjacent either horizontal or vertical. A three-rectangle feature calculates the sum within two rectanglessubtracted from the sum in a rectangle in the center (linefeature). And finally a
Four-rectangle feature computes thedifference between two pairs of rectangles diagonally po-sitioned. Given a base resolution of the detector as x pixels, it results in a large number of rectangle features,over , . In order to compute this features rapidly, thereexists a technique called integral image . The integral imageat a given location ( x, y ) in pixels contains the sum of thepixels left and above of ( x, y ) and it is given by the following a) Positive images.(b) Negative images. Fig. 3: Example of positive and negative set of images. In thepositive set of images it is shown a bounding box surroundingthe object of interest.formula ii ( x, y ) = (cid:88) x (cid:48) ≤ x,y (cid:48) ≤ y i ( x (cid:48) , y (cid:48) ) (1)where ii ( x, y ) is the integral image at location ( x, y ) and i ( x (cid:48) , y (cid:48) ) is the original image at location ( x, y ) [17]. Theintegral image can be computed following the next pair ofrecurrences s ( x, y ) = s ( x, y −
1) + i ( x, y ) ii ( x, y ) = ii ( x − , y ) + s ( x, y ) (2)where s ( x, y ) is the cumulative sum of a row, therefore s ( x, − = 0 and ii ( − , y ) = 0 .Initially, the algorithm needs a big number of positiveimages and negative images. The recommended number ofimages is of the order of a few thousand of images for thepositive set and five hundred images for the negative set [18],[19]. An example of positives and negatives sets of imagesis shown at Fig. 31) Negative samples.
Negative samples are the set ofimages where the object of interest, in this case aperson’s face, can not be found, in other words, itcontains everything the user does not want to detect. Theset of negative images are taken from arbitrary imagesand must be prepared by the user manually and areenumerated in an text file. In such text file it is describedthe image file name and directory information. Anexample of negative samples is shown at Fig. 3b. Theimages in this set can be of different sizes, but eachimage should be at least equal or larger than the desiredtraining windows size. The set of negative windowsamples will be used to tell the features classifier whatnot to look for when trying to find the object of interest. 2)
Positive samples.
There exist two possible waysto generate the positive samples. One of them ismaking use of an OpenCV library function called $opencv createssamples . This method booststhe process to define what the model should look likegenerating artificial images with the desired object.The second method is selecting manually the positiveimages. In this project the second method was chosen.Like with the negative samples, the user manuallyprepares the set of images. To ensure a robust model,the samples should cover a wide range of varietiesthat can occur within the object class. In the caseof faces, the samples must consider different gender,emotions, races and even beard styles. An example ofpositive samples is shown at Fig. 3a. For the upperbody, the samples must consider different positions,haircuts styles and sizes. Inside the directory where theimages are located, a .dat file needs to be included.Each line of this file corresponds to an image, the firstelement of the line is the name of the image, followedby the number of objects within the image, followedby numbers indicating the coordinates of the object(s)bounding rectangle(s), i.e. ( x, y, width , height ) [20].The cascade classifier is an algorithm that achievesan improved detection performance while reducing thecomputation time. The overall form of a cascade classifieris that of a degeneration tree. A positive result from onefirst classifier leads to a evaluation of a second classifierthat has been adjusted to achieve high detection rates. If thesecond classifier gives a positive result, it triggers a thirdclassifier, and so on. In any stage, if a negative result isthrown by the algorithm, then an immediate rejection of thesub-window is performed [21]. This is shown in a graphicway in Fig. 4.Fig. 4: Schematic description of a detection cascade. Training Haar cascades.
Once the two sets ofimages are prepared, the Haar cascade training isready to be performed. Using OpenCV, the commandto run is opencv traincascade , that looks like: $ opencv traincascade -data data -vecpositives.vec -bg bg.txt -numPos 1800-numNeg 900 -numStages 10 -w 20 -h 20
Where -data refers to the direction where the trainedclassifier should be stored; -vec refers to the vec-file withpositive samples; -bg refers to the background descriptionle. This is the file direction containing the negative sampleimages; -numPos refers to the number of positive samplesused in the training for every stage; -numNeg refers tothe number of negative samples used in training for everystage; -numStages refers to the number of cascade stagesto be trained. The more stages are selected, the more robustthe training will be, however, it can take a longer time tocompute depending on the training computer capacities;and finally -w and -h refer to the width and height of thetraining samples in pixels [22]. The output files are goingto depend of the number of stages that the user selected.In the example, the directory data contains eleven .xml cascade training files, one per stage; and also contains the final.xml file that represents the whole cascade training.
2) Dronekit-Python:
Dronekit-Python is a tool that runsin an on-board computer and allows users to create acommunication between UAV’s on-board computers and theArduPilot flight controller, all these using a low latency link.The on-board processing enhances the autopilot efficiencysignificantly, improving the behavior of the vehicle andperforming tasks that are computationally intensive or time-sensitive, such as computer vision and estimation algorithmsor even path planning [23]. The main use of Dronekit-Python in this paper, consists in arming the data package thatcontains the information about the velocity and directionsthat the quadrotor needs to perform given the information ofthe object detection. These data packages connect with theflight controller using MAVLink, a communication protocolfor small UAV’s [24], [25].The Dronekit function
Send NED Velocity is used formaking a connection between Jetson TX2 and the PixHack-V3 flight controller, the autopilot used in this work. Thefunction is defined as follows d e f Send NED Velocity ( v e l o c i t y x , v e l o c i t y y ,v e l o c i t y z , d u r a t i o n , v e h i c l e ) :msg = v e h i c l e . m e s s a g e f a c t o r y .s e t p o s i t i o n t a r g e t l o c a l n e d e n c o d e (0 ,
The above mentioned function asks for several parameters,being velocity x, velocity y, velocity z and vehicle the necessary parameters for our application. Thefirst three parameters represent the velocities in North-East-Down (NED) directions, being in this case the pitch-forward,roll to the right and a negative thrust. The reasons for this, isthat the directions are with respect to the body frame. Then itis necessary to specify the UAV position w.r.t the body frame,for that, we use the line MAV FRAME BODY OFFSET NED .The body frame is depicted at Fig. 5. Thanks to theseparameter it can be specified the position in x meters north, y meters east and z meters down of the current UAV position. Velocity directions are in the NED frame [26].Fig. 5: NED axes with respect to the body’s current frame.III. E XPERIMENTS AND R ESULTS
In this section the experiments and results obtained fromthe methods shown in section II are described, emphasizingthe efficiency in the object detection and object trackingperformed autonomously by the UAV.
A. Object Tracking
The flowchart that describes the steps for the experimentsin the rest of this section is shown at Fig. 6.Fig. 6: Flow chart describing the experiments executionsteps.
1) Object detection:
The object to be detected for thisexperiment is a human face. We have chosen a human facedue to the number of specific parts of a face that can beimplemented in the training stage. These parts: eye, mouth,nose, etc. make the recognition more robust. In this case,the process to detect a face consists in two separate trainingprocesses:1)
Upper body detection:
The Haar Feature Classifieralgorithm is trained to detect the upper body of a humanbeing. A rectangle is drawn covering the full area of theimage where the upper body is detected, as it is shownin Fig. 7. This training is not enough, because no matterhow well the process of training is performed, the objectdetection sometimes may lead to detect false objects.ig. 7: Upper body detection using Haar-cascade training.2)
Complete face detection:
After the upper body detec-tion, the Haar Feature Classifier algorithm is trained todetect a complete face. In this way, the detection goesfrom something general to a more specific object, inthis case, the face. Like the previous case, rectanglesare drawn around the detected object, as it is shown inFig. 8.Fig. 8: Face detection using Haar-Cascade training.3)
Combining both training stages:
As it has beenmentioned before, no matter how good the training is,there always be a probability that a false object will bedetected. To minimize this probability, the face detectiontraining is made only inside a loop where the upperbody detection is performed, as it is shown at Fig.9. This methodology is conducted to ensure that nofaces will be detected outside of an upper body andno face will be detected without an upper body, thenthe probability of detecting false objects is reduced.Fig. 9: Combination of upper body and face detection usingHaar-cascade training. (a) Movements in simple direction. Soft green meansslow movement, while hard green means fast move-ment.(b) Movements in two directions at same speed. Orangemeans slow movement, while red means fast movement.(c) Movements in two directions at different velocities.Soft blue means fast roll movement and slow thrustmovement. Hard blue means fast thrust movement andslow roll movement.
Fig. 10: Roll and thrust movements depending of the imagezone where the centroid’s object is detected.
2) UAV displacements:
The next part of the experimentconsists in identifying the the centroid’s object which islocated in the given image. When the centroid is not in theimage center, it can be in one of three different zones, anddepending in which one of these is, it will be the movementthat the UAV will perform, these movements can also beclassified in three types as shown in Fig. 10.The quadrotor velocity and directions are assigned depend-ing on the centroid position, as it is shown in the nex pseudocode: e l i f x <
290 and y < <
145 and y < − r o l l f , − t h f , 1 ,v e h i c l e )e l i f x <
145 and y > − r o l l f , − t h s , 1 ,v e h i c l e )e l i f x >
145 and y < − r o l l s , − t h f , 1 ,v e h i c l e )e l i f x >
145 and y > − r o l l s , − t h s , 1 ,v e h i c l e ) In this snipped code there are two main conditions thatneed to be satisfied for each one of the cases shown in Fig.10.The first condition is to know where the objects centroid isw.r.t. the center of the image frame (top, bottom, right, left orany corner). The second condition is looking if the centroidlocation is actually near or far from the image center, makinga decision to perform a slight or aggressive maneuver and rolldisplacement. For example, if the object centroid is locatedat the image top-left corner, the quadrotor will perform anaggressive roll and diminish thrust, as the object gets closerto the center it will decrease the velocity of the movementsuntil it reaches the very image center. Variables that controlthese movements in the code are roll f and roll s forfast and slow movements in the roll angle, respectively,and th f , th s to define a fast and slow motors speed,respectively.Fig. 11 shows two sequences of images representing theobject recognition and tracking performed by the UAV. As itwas stated in Section II, there are some frames where falseupper bodies are detected, but the real face is only recognizedwhen a previous upper body is detected. The full video ofthe external view and the drone camera view can be foundfollowing the next link: https://youtu.be/SY-dss_jJA4 The video of the flight experiments shows how the faceis detected and how the drone navigates following the face.The program analyzes each image with a rate of about fourframes per second, having a reaction time between the objectdetection and the UAV movement of about . seconds. Italso can be observed that bounding boxes appear in placeswhere the algorithm detects false upper bodies, but in noneof these cases a face is found.IV. C ONCLUSIONS
The object detection using the Haar Feature-based CascadeClassifier method seems to work relatively well for vision-based drone navigation. To ensure that the detection bereliable, the process must be meticulous and follow all thesteps correctly. For example, at least positive images negative images are required to guarantee a good training.The decision to perform a two-step detection, one being theupper body and another the face, demonstrates better resultsin comparison with only face detection. The possibility todetect faces in a false upper body decreases exponentially,in fact, during the experiments there never is a case wherea complete object, upper body + face, is detected wrongly.Future works include: a) the capability of the UAV toperform yaw movements in case the object makes a rotationon its own axis. The most effective method could be detecting (a) Drone camera view 1.(b) External camera view 1.(c) Drone camera view 2.(d) External camera view 2.
Fig. 11: Sequence of ordered object tracking images takenfrom the drone perspective and from an external videorecording.when the two points that form the horizontal top line ofthe bounding box are not in the same row of the imageor in a certain hysteresis; b) using different methods forobject detection including deep learning techniques; methodslike convolutional neuronal networks should be addressedto make a comparison about their performance, particularly,TensorFlow can be implemented taking advantage of GPUfeatures, therefore the Jetson TX2 would be better exploited.R
EFERENCES[1] A. M. de Oca, L. Arreola, A. Flores, J. Sanchez, and G. Flores,“Low-cost multispectral imaging system for crop monitoring,” in ,June 2018, pp. 443–451.[2] J. M. Vazquez-Nicolas, E. Zamora, I. Gonzalez-Hernandez, R. Lozano,and H. Sossa, “Towards automatic inspection: crack recognition basedon quadrotor uav-taken images,” in , June 2018, pp. 654–659.[3] X. Li and L. Yang, “Design and implementation of uav intelligentaerial photography system,” in , vol. 2, Aug2012, pp. 200–203.[4] L. Arreola, A. Montes de Oca, A. Flores, J. Sanchez, and G. Flores,“Improvement in the uav position estimation with low-cost gps, insand vision-based system: Application to a quadrotor uav,” in ,June 2018, pp. 1248–1254.[5] R. Lienhart and J. Maydt, “An extended set of haar-like features forrapid object detection,” in
Proceedings. International Conference onImage Processing , vol. 1, Sept 2002, pp. I–I.[6] D. Deneault, D. Schinstock, and C. Lewis, “Tracking ground targetswith measurements obtained from a single monocular camera mountedon an unmanned aerial vehicle,” in , May 2008, pp. 65–72.[7] T. Gaspar, P. Oliveira, and C. Silvestre, “Uav-based marine mammalspositioning and tracking system,” in , Aug 2011, pp. 1050–1055.[8] S. Zhang, “Object tracking in unmanned aerial vehicle (uav) videosusing a combined approach,” in
Proceedings. (ICASSP ’05). IEEEInternational Conference on Acoustics, Speech, and Signal Processing,2005. , vol. 2, March 2005, pp. ii/681–ii/684 Vol. 2.[9] J. I. Flores-Delgado, L. G. Mart´ınez-Santos, R. Lozano, I. Gonzalez-Hernandez, and D. A. Mercado, “Embedded control using monocularvision: Face tracking,” in , June 2017, pp. 1285–1291.[10] S. Xu, A. Savvaris, S. He, H. Shin, and A. Tsourdos, “Real-timeimplementation of yolo+jpda for small scale uav multiple objecttracking,” in , June 2018, pp. 1336–1341.[11] O. Araar and N. Aouf, “Visual servoing of a quadrotor uav for au-tonomous power lines inspection,” in , June 2014, pp. 1418–1424.[12] “Vision-based uav navigation in orchards,”
IFAC-PapersOnLine ,vol. 49, no. 16, pp. 10 – 15, 2016, 5th IFAC Conference on Sensing, Control and Automation Technologies for Agriculture AGRICON-TROL 2016.[13] L. R. G. Carrillo, G. R. F. Colunga, G. Sanahuja, and R. Lozano,“Quad rotorcraft switching control: An application for the task ofpath following,”
IEEE Transactions on Control Systems Technology ,vol. 22, no. 4, pp. 1255–1267, July 2014.[14] R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sutton,and A. Cangelosi, “Toward end-to-end control for uav autonomouslanding via deep reinforcement learning,” in , June 2018, pp. 115–123.[15] S. Lee, T. Shim, S. Kim, J. Park, K. Hong, and H. Bang, “Vision-based autonomous landing of a multi-copter unmanned aerial vehicleusing reinforcement learning,” in
Inter-national Journal of Computer Vision , 2001.[18] opencv dev team. (2011) Cascade classifier training.[19] R. Ajna and T. Hersan. Training haar cascades. [Online]. Available:https://memememememememe.me/post/training-haar-cascades/[20] OpenCV. (2016) Cascade classifier training.[21] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in
Proceedings of the 2001 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition. CVPR 2001 ,Dec 2001.[22] sentdex. (2016) Training haar cascade object detection - opencv withpython for image and video analysis 20.[23] D. Robotics. (2015) Guiding and controlling copter. [Online].Available: http://python.dronekit.io/guide/copter/guided mode.html[24] I. Free Software Foundation. (2007) Mavlink developer guide.[Online]. Available: https://mavlink.io/en/[25] S. Atoev, K. Kwon, S. Lee, and K. Moon, “Data analysis of themavlink communication protocol,” in2017 International Conferenceon Information Science and Communications Technologies (ICISCT)