[PDF] Hand Gesture Controlled Drones: An Open Source Library

Abstract

Drones are conventionally controlled using joysticks, remote controllers, mobile applications, and embedded computers. A few significant issues with these approaches are that drone control is limited by the range of electromagnetic radiation and susceptible to interference noise. In this study we propose the use of hand gestures as a method to control drones. We investigate the use of computer vision methods to develop an intuitive way of agent-less communication between a drone and its operator. Computer vision-based methods rely on the ability of a drone's camera to capture surrounding images and use pattern recognition to translate images to meaningful and/or actionable information. The proposed framework involves a few key parts toward an ultimate action to be taken. They are: image segregation from the video streams of front camera, creating a robust and reliable image recognition based on segregated images, and finally conversion of classified gestures into actionable drone movement, such as takeoff, landing, hovering and so forth. A set of five gestures are studied in this work. Haar feature-based AdaBoost classifier is employed for gesture recognition. We also envisage safety of the operator and drone's action calculating the distance based on computer vision for this task. A series of experiments are conducted to measure gesture recognition accuracies considering the major scene variabilities, illumination, background, and distance. Classification accuracies show that well-lit, clear background, and within 3 ft gestures are recognized correctly over 90%. Limitations of current framework and feasible solutions for better gesture recognition are discussed, too. The software library we developed, and hand gesture data sets are open-sourced at project website.

Full PDF

HHand Gesture Controlled Drones: An Open Source Library

Kathiravan Natarajan † Student IEEE MemberDepartment of Computer ScienceTexas A&M University – CommerceCommerce, Texas [email protected]

Truong-Huy D. Nguyen ∗ IEEE MemberDepartment of Computerand Information SciencesFordham UniversityBronx, New York [email protected]

Mutlu Mete

IEEE MemberDepartment of Computer ScienceTexas A&M University – CommerceCommerce, Texas [email protected]

Abstract —Drones are conventionally controlled using joysticks,remote controllers, mobile applications, and embedded com-puters. A few signiﬁcant issues with these approaches arethat drone control is limited by the range of electromagneticradiation and susceptible to interference noise. In this study wepropose the use of hand gestures as a method to control drones.We investigate the use of computer vision methods to developan intuitive way of agent-less communication between a droneand its operator. Computer vision-based methods rely on theability of a drones camera to capture surrounding imagesand use pattern recognition to translate images to meaning-ful and/or actionable information. The proposed frameworkinvolves a few key parts toward an ultimate action to betaken. They are: image segregation from the video streams offront camera, creating a robust and reliable image recognitionbased on segregated images, and ﬁnally conversion of classiﬁedgestures into actionable drone movement, such as takeoff,landing, hovering and so forth. A set of ﬁve gestures are studiedin this work. Haar feature-based AdaBoost classiﬁer [1] isemployed for gesture recognition. We also envisage safety ofthe operator and drone’s action calculating the distance basedon computer vision for this task. A series of experiments areconducted to measure gesture recognition accuracies consid-ering the major scene variabilities, illumination, background,and distance. Classiﬁcation accuracies show that well-lit, clearbackground, and within 3 ft gestures are recognized correctlyover 90%. Limitations of current framework and feasiblesolutions for better gesture recognition are discussed, too. Thesoftware library we developed,and hand gesture datasets areopen-sourced at project website.

1. Introduction

Drones, also known as unmanned aerial vehicles, are onthe rise in recreational and in a wide range of industrialapplications, such as security, defense, agriculture, energy,insurance and hydrology. Drones are essentially special † Supported by Texas A&M University–Commerce Graduate School andDepartment of Computer Science ∗ Corresponding author ﬂying robots that perform functionalities like capturing im-ages, recording videos and sensing multimodal data fromits environment. There are two types of drones based ontheir shape and size, ﬁxed-wing and multirotor. Becauseof their versatility and small size, multirotor drones canoperate where humans cannot, collect multimodal data, andintervene in occasions. Moreover, with the use of a guardhull, multirotor drones are very sturdy in collisions, whichmake them even more valuable for exploring uncharted ar-eas. At present, ﬂying robots are used in different businesseslike parcel delivery systems [2]. For example, companieslike Amazon Prime and UPS are using multirotor dronesto deliver their parcels. New York Police Department usesquadcopters in crime prevention [3]. For the purposes ofagriculture monitoring, for instance, the use of multiplesensors such as video and thermal infrared cameras are ofbeneﬁt [4]. Drones are especially useful in risky missions.For the sake of clarity in the rest of this work, we deﬁne adrone as a multirotor ﬂying robot, excluding ﬁxed-wings.A visual camera is an indispensable sensor for currentdrones. The low cost, low power, small size of image captur-ing, and streaming devices make them a de facto feature fornumerous drones in the market. Output of a drone’s cameracan be used in many ways depending of the applications.In a common scenario, the camera output is directed tothe drone operator who may command the drone a newinstruction based on the current visual environment via aremote controller, which serves as an agent between droneand its operator. In this work, we investigate an alternativemethod of controlling multirotor drones using hand gesturesas the main channel of communication. We propose a frame-work that maps segregated images from video stream as oneof ﬁve commands/gestures. The camera can capture visualinstructors from the operator, which eliminates the controldevice, leading the way for agent-less communication.Haar features serve as fundamental masks to capturegradient changes in images. Each block of mask can bescaled or rotated to capture predetermined targets. Theseadvantages allow us to detect various gestures in many sizes.Therefore, a Haar feature-based AdaBoost classiﬁer [1] isemployed in action planner. Safety issues are also considered a r X i v : . [ c s . R O ] M a r hile the drones automatically comply with the commandsinitiated by operator’s gestures. This project also presents acase study for image recognition-driven autonomous drones.Our key contributions in this project include1) A novel framework of drone control based on handgestures2) A comparison of state-of-the-art computer visionapproaches in hand gesture detection, applied onour hand gesture dataset3) A discussion of key challenges and lessons learnedfrom building the framework’s hand gesture recog-nition component.This project uses one of many mediocre drones in themarket: Parrot AR.Drone 2.0 [5]. Both the software libraryand hand gesture datasets are open-sourced at [6].

2. Background

Before detailing our framework, we brieﬂy summarizerelated works in drone control approaches and attempts inemploying gesture detection for this purpose.

Most commercial drones available on the market comewith specially designed controllers, either as a dedicatedsignal transmitter or software applications running on users’hand-held device (such as mobile phones or tablets). In bothcases, the controller sends commands with detailed move-ment information such as move the drone x units towards acertain direction through wireless channels (e.g., Wi-Fi orBluetooth). Notable products include the DJI drones (modelsPhantom, Inspire, Matrice, etc.) [7] and Parrot’s drones(models AR. Drone, Bebop, DISCO, Swing, Mambo, etc.)[8].Recently there have been commercial products that in-troduce hand gestures as a viable control mechanism. Tocapture the gestures, there are two approaches. • Using specially designed gloves: The controller ismounted on a glove worn by users and detects inreal time the yaw, pitch, and roll of the hand totranslate into respective movements for the drone.Products include the Kd Interactive Aura Drone [9],and the MenKind Motion Control Drone [10]. • Using computer vision via the on-board camera.These devices use the on-board camera to detect inreal time where the user’s hand is and respond toit in intuitive ways. Products include the DJI SparkDrone [11].The ﬁrst approach above presents an attempt to add newcontrol dimensions, thus allowing more degrees of freedomto the drone controller. Instead of pressing some predeﬁnedbuttons, users can move their ﬁngers or wave their hand(s) inspeciﬁc ways that are recognized by sensors installed in theglove, which are then converted into digital commands. The transmission of commands is done over radio channels, so itis the same as the traditional control paradigm. The secondapproach on the other hand takes a more radical leap byemploying real-time image analysis, which is done on thedrone itself, to recognize commands instead of sending themover radio channels.In academia, there have been similar attempts to in-vestigate alternative methods to control drones using bodyparts, such as hand gestures or full body motions. Notably,Cauchard et al. [12] found that when interacting with dronesusing body language, drone operators feel natural usinggestures like those used as with a pet or other people, suchas beckoning or waving. As such, natural user interfaces(NUIs) present an appealing way to enhance the user ex-perience when interacting with drones, as compared to thetraditional way of a remote-control device. In building anNUI for drone control, there are two main directions fellowresearchers are working towards: with and without the helpof aiding devices.The ﬁrst involves the use of some third-party device thatcan recognize non-verbal gestures reliably, before mappingthe detected gestures into suitable digital commands. Somesuch devices include the Leap Motion Controller [13], [14]and the Microsoft Kinect [15], [16].While Leap Motion Controller is designed speciﬁcallyto capture hand motions, the Kinect can capture full bodymotion faithfully. While this approach yields high accuracyin gesture or body motion detection, they need to be con-nected to a computer to work, so portability is a limitingfactor.In the second direction, body movement is detected inreal-time, using machine vision, to control the drone withoutany additional instrument. Researchers have examined theuse of eye gazes [17], face poses, hand gestures, and thecombination of them [18], [19]. Image-based hand gesture recognition problems havebeen studied extensively for decades. Twenty-four basicsigns of American Sign Language are detected and clas-siﬁed using a boosted cascade of classiﬁers trained usingAdaBoost and informative Haar wavelet features. In thiswork, Dinh et al. [20] have proposed a new feature called

Double L for best describing the hand gestures other thanedge features, line features and edge surrounded features.Real time hand gesture detection based on bag of featuresand support vector machines were proposed in [21]. Intraining, scale invariance feature transform (SIFT) is usedto extract the key-points for all training images, and vectorquantization is used to map key-points from training imageinto bag of words after performing K-means clustering.These histograms act as feature vectors. SVM model istrained for the classiﬁcation purposes. Experiments werecarried out with a web camera. study done by Dardas et al. [22] detects and trackshand gestures in cluttered backgrounds as well as in differentlighting conditions. It uses skin detection and hand posturecontours comparison algorithms by subtracting faces andonly recognizes hand gestures using Principal ComponentAnalysis. In each training stage, different hand gesture im-ages with various scales, angles, and lightings are trained.The training weights are calculated by projecting trainingimages onto the eigen vectors. During testing, the imagesthat contained hand gestures are projected onto the eigenvectors and the testing weights are calculated. Finally, Eu-clidean distances are calculated between training weightsand test weights to classify hand gestures.In another work, Hu moment features used by Meng etal. [23] proposed an algorithm for detecting the ﬁngertipstructure. First, the features which are the areas includingskin region and the image, were made to differentiate thebackground in space of saturation, value of brightness, andhue from the skin region. Later, an algorithm to ﬁnd theregion of interest was implemented and ﬁngertips were de-tected by approximating the contour. The seven-dimensionalfeature vector was created after the detection process. Fi-nally, the distance marching criterion was used for the handgesture recognition. This algorithm improved the accuracyby 2.7% when compared to Hu moment feature recognition.Detecting hand gestures in real time is a challenging taskdue to a few reasons, including how people perform handgestures. Molchanov et al. [24] recently addressed thesechallenges by a three dimensional recurrent convolutionalneural network model with multi-modal input streams. Thehypothesis is validated by testing multi-modal dynamic handgesture dataset captured with depth, color and stereo infraredsensors. This system achieved an accuracy of 83.8% in thecomplex dynamic hand gesture set.A multi-class classiﬁcation approach based on WeightedLinear Discriminant Analysis and Gentle AdaBoost (GAB)algorithm was proposed by Tian et al. [25]. In this ap-proach, Histogram of Oriented Gradient (HoG) features areextracted arbitrarily in random locations and a multi-classcascade classiﬁer is trained for hand gesture detection.

3. Proposed Framework

In this section, we detail our framework of gesture-based drone control. The targeted drone types for this frame-work are multirotor helicopters equipped with a front-facingcamera, such as the Parrot AR.Drone [5]. Figure 1 depictsone such drone with four rotors on the sides of the bodyin charge of lifting the drone off the ground and movingthe drone in different directions. A camera is mounted atthe front of the drone’s body, which allows recording ofthe environment within its ﬁeld of view. The framework isdepicted in Figure 2.The video stream is constantly recorded through theon-board camera of the drone, and then segmented intosequences of still images. Each image is then analyzedthrough the hand gesture recognition process, which in-cludes three main steps: feature extraction, hand region

Figure 1. Stylized top-down view of a quadrotor drone, facing downwardswith a camera mounted at the front of the drone. identiﬁcation, and ﬁnally gesture classiﬁcation. A commandmapper transforms the detected gesture into a command,such as take off , land , or back off . An action planner takesthe command as its input and compute the correspondingcourse of primitive actions to satisfy the command. Whilethe planner is operating, it also considers the surroundingenvironment to avoid collision and ensure the safety for boththe drone and perceived obstacles.The hand gestures we work on are shown in Figure 3.Note that gestures are recognized based on certain orienta-tion of the user’s hand, i.e., either right or left hand is usedfor each gesture. The set of all ﬁve gestures includes ﬁst,palm, go symbol, v-shape, and little ﬁnger. These gesturesare arbitrarily picked but we made sure to have a lot ofunique haar features for each carefully chosen gesture andthey are very common gestures in the society and easy topose. The reason for using only 5 gestures is to provideall basic functionalities of the drone like moving the droneright, left, backward, forward and clicking pictures. Un-questionably, more functionalities can be implemented bytraining more hand gestures. But the scope of this paper isfocused on achieving high accuracies in mediocre dronesfor those basic functionalities mentioned above.We avoid three ﬁngers and two ﬁngers gestures, sincethey may be translated into similar Haar features, which maylead to many errors in classiﬁcation step. Another example,the one ﬁnger gesture and the ﬁngers crossed gesture mayend up having similar Haar features. During the preliminaryset of experiments, we decided to choose aforementionedhand gestures, assuming that they will have a new setof differentiated features for each gesture to be classiﬁedcorrectly. For example, the go symbol is expected to possessa separate set of Haar features when compared to the ﬁst orthe palm which in turn reduces the number of misclassiﬁedimages and improves the accuracy. This does not mean Haarfeatures end up with similar values if the gestures look alike.Therefore, we attempted to reduce one such possibility ofmisclassiﬁcation by choosing signiﬁcantly different-posinggestures. In the rest of this study, the go symbol, v-shape,and little ﬁnger gestures are indicated by GS, VS, and LF, igure 2. Gesture-based drone control framework. (a) (b) (c)(d) (e) Figure 3. The hand gestures to be classiﬁed in this study: ﬁst with righthand (3a), right hand palm (3b), left pointing left hand (go symbol)(3c),left hand V-shape (3d), and left hand little ﬁnger (3e). respectively.In order to implement the complete framework, thereare a number of key challenges that we need to address,namely gesture recognition, visual variability of scene, andsafety assurance of maneuver.

In two related studies Viola and Jones [1], [26] intro-duced Haar feature-based cascade AdaBoost classiﬁer ex-clusively for frontal face recognition. Their method builds aweak classiﬁer using extracted Haar features compiled fromvarious sub-windows/patch of the target image. AdaBoost(Adaptive Boosting) is a weak learning algorithm and wasintroduced in [27]. It classiﬁes a feature vector exploitingmany other subsequent learners. AdaBoost updates weights of each weak classiﬁer at the end of each iteration in train-ing. AdaBoost-based solutions require a set of real classiﬁersthat learn from training dataset and map testing data to oneof the classiﬁcation labels.We used Haar features to represent each image ofdataset. Although Haar features was introduced in 1910[28], it is not popularized for image recognition problemsuntil a broad analysis by Papageorgiou et al. [29]. A Haar-based feature utilizes rectangular regions at various loca-tions of the detection window by summing up the pixelintensities in each location of the detected window andcalculates the difference between these summation values.These differences are then used to categorize the image. Inour scenario, the feature extraction module uses the patterngenerated by many local Haar features of a hand gesture.Later, the classiﬁer maps feature vector of gestures eitherone of the existing gesture labels or as void. The reasonsfor choosing Haar classiﬁers over other algorithms are thatHaar cascade has better detection rate than other featuredescriptors like Hog [30] in less clear images and moreover,its implementation is simple, achieves more accuracy withless training images, and consumes less memory unlikeGPU-enabled image classiﬁcation system like ConvolutionalNeural networks [31].

The proposed study is designed for a user to control adrone in daily life, not a special laboratory environment.For this very reason, we want to empirically measure theeffects of scene variability while classiﬁcation framework iskept unchanged. To this end, three different visual variablesare introduced to be tested: illumination, background, anddistance of target gesture. The illumination measures howwell the scene is lit. In terms of illumination variable, ascene (experimental environment) is categorized in a binaryway, dim lit or well lit. We did not use any special lightingtools while collecting images of the dataset. Instead, various a) (b) (c)(d) (e) (f)

Figure 4. Demonstrating scene variability of the palm gesture. 4a: dim lit,clear background, more than 3 ft away, 4b: well lit, clear background, morethan 3 ft away, 4c: dim lit, cluttered background, within 3 ft, 4d: well lit,cluttered background, within 3 ft, 4e: dim lit, clear background, within 3ft, 4f: dim lit, cluttered background, more than 3 ft away. test cases are captured under sun and/or everyday ﬂuorescentlights. The variability of background is expressed with oneof two categories, cluttered or clear (almost blank). A user infront of a loaded bookshelf, a natural scene, or many otherobjects are categorized as clutter scene, whereas, a gestureposed in front of walls or doors are considered as clearbackground. Cluttered background problems were detailedin [32]. The last scene condition is basically the distance between gesture posing hand of the user and drone’s camera.The distance threshold is 3 ft. It means that while somegestures are presented within 3 ft, the others are tested morethan 3 ft away. Figure 4 shows various scenes based onnewly introduced conditions.

After a gesture is recognized and converted to a com-mand, such as move to the left , the action planner on thedrone kicks in to compute the most appropriate course ofaction that satisﬁes the recent command. In this process, itis imperative for the drone to carry out the action whileensuring safety to itself, surrounding objects, and the envi-ronment. Collision to any of these entities potentially causesserious damage to the parties involved, which is highly unde-sirable. In our framework, action planning module requiresthe drone to utilize its sensors (e.g., camera and proximity)to estimate the area where it can safely ﬂy or hover to.Collision avoidance is a topic addressed in robotics.Drones are much more susceptible to external factors thatcause their movements to be unstable, such as wind orair ﬂows. Collision avoidance in drones requires additionalconsiderations for such factors. While some approaches relyon the on-board camera for this task [33], [34], [35], otherspropose the use of more advanced sensors, such as ultrasonic

TABLE 1. N

UMBER OF IMAGES IN THE STUDY DATASET

Hand Gestures Positives Negatives

Fist 1570 900Palm 1456 900GS 1390 900VS 1530 900LF 1456 900 or laser range ﬁnders [36], [37]. One limitation of camera-based solutions is that they may perform poorly when thereare optical noises, such as in low lighting or foggy environ-ments. Using more non-vision based sensors helps alleviatethis problem, but adds more load to the overall weight ofthe drone, which may not always be feasible.

4. Experiments

Parrot AR.Drone 2.0 [5] is used throughout all theexperiments. It is one of the early versions of the Wi-Ficontroller drone, which is debuted by Parrot SA (Paris,France) in 2012. It costs around $130 as of December2017. AR.Drone 2.0 is equipped with 720 x 720 pixelscamera, ARM Cortex A8 1 GHz 32-bit processor, Wi-Fi connectivity, gyroscope, accelerometer, magnetometer,pressure sensor, and altitude ultrasound sensor. A stylizedtop-down view of the AR.Drone 2.0 is shown in Figure 1.Gesture recognition experiments are carried out witha 2.60 GHz CPU, 16 GB memory Ubuntu 14.04.5 LTS(Trusty Tahr) operating system. Drone control software isdeveloped using Python 2.7 with OpenCV 3.3.0, an open-source computer vision library [38].Training images are collected at resolution of 720 x720 pixels, which are same as the drone’s front cameraresolution. Positive training images are hand gestures imagescollected from drone’s front camera. Meanwhile negativetraining images, also called background images or back-ground image samples, are collected randomly with the helpof image search engines, which do not contain any handgesture images. Should the size of a negative image begreater than 720 x 720 pixels, it is down-sampled to sizeof positive images. The number of positive and negativetraining samples for each gesture is given in Table 1. Atotal of 8302 images are used in the experiments.We beneﬁt from OpenCv’s embedded tool to markbounding box and location of each gesture in positive train-ing images. It should be noted that although all ﬁve gesturesare posed with same user with same right/left hand, a samegesture appears at many different orientations and scales.In a preprocessing step, their location should be markedcorrectly to train a classiﬁer. OpenCV also provides anintegrated annotation tool to manually describe the objectsto be detected by the classiﬁer. We created an annotationﬁle which contains a ﬁle structure to maintain associationbetween positive images and the coordinates of the boundingrectangles of the gestures. Following this step, we extractedeatures in OpenCV, which supports in creating vector rep-resentation of training images using Haar features. Whilegenerating feature vectors from the images, we specify thesample size as 20 x 20 pixels since Lienhart et al. [39]reported that 20 x 20 of sample size achieved the highesthit rate in a similar study. Upon extracting feature vectors,we train the boosted cascade of weak classiﬁers, AdaBoost,using all positive and negative feature vectors. Each of ges-ture classiﬁers are trained separately, which generates ﬁvedifferent classiﬁcation models. Once an image is streamedfrom the drone’s camera to our software, each frame ismapped to the respective gesture or none. Training of eachclassiﬁers takes around 15 minutes because of the smallerwindow size (20 x 20) of Haar features extraction step. Alltraining images and model ﬁles (in the form of .xml) arepublicly available at project web site [6]. In the context ofAdaBoost, each resulting .xml ﬁle serves as strong classiﬁer,composed of the weighted sum of weak classiﬁers. Thenumber of training stages for palm, ﬁst, GS, VS, and LFgestures in Haar cascade classiﬁer are reported as 4, 16, 8,10, and 5, respectively.

5. Discussion of Experimental Results

Individual accuracy of each gesture is detailed in Table2. This table also categorized how the classiﬁer performsin variable scene conditions, which are described in Section3.2. The accuracy measure reported in Table 2 is the ratioof the correctly classiﬁed gestures to the total number ofsame gesture. For example, in case of the palm gestureexperiments with scene variables of DL, CTB, LT-3, 4593of 5000 palm gestures are correctly identiﬁed.The distance is observed as the most signiﬁcant scenevariable. The gestures posed within 3 ft outperform signif-icantly the gestures posed more than 3 ft away. Referringto Table 2, regardless of illumination and background vari-ability, the average accuracy of LT-3 experiments is 0.94while that of MT-3 is 0.71. The decline of accuracy basedon distance is found common amongst all gestures. One ofa few sharp accuracy declines is seen in the classiﬁcationof palm, where scene variable of distance is changed fromLT-3 to MT-3. In this pairwise comparison, the accuracydrops from 0.97 to 0.70. The distance variable causes acomparatively mild diminishing of classiﬁcation accuracyin the case of well-lit and clear background experiments,from 0.80 to 0.70.Second and third signiﬁcant scene variables are observedas background and illuminations, respectively. A clutteredbackground lessens accuracy in many pairwise comparisons,just as in within 3 ft, dim lit ﬁst experiments (DL, CTB,LT-3: 0.89 while DL, CLT, LT-3: 0.91). Another exampleof similarly lessened accuracy is the gesture of go shapewhere various backgrounds of scenes are tested in well-litand within 3 ft poses (WL, CTB, LT-3: 0.91 while WL,CLB, LT-3: 0.96).Illumination, categorized as dim or well lit, is foundthe least signiﬁcant scene variable. Expectedly, the effect oflighting condition is almost obvious amongst all gestures,

TABLE 2. C

LASSIFICATION A CCURACIES FOR GESTURE DETECTION

Test Conditions Palm Fist GS VS LF

DL, CTB, LT-3 0.92 0.89 0.86 0.84 0.86DL, CTB, MT-3 0.66 0.70 0.60 0.65 0.69DL, CLB, LT-3 0.97 0.91 0.87 0.90 0.88DL, CLB, MT-3 0.70 0.74 0.69 0.65 0.59WL, CTB, LT-3 0.90 0.89 0.91 0.86 0.81WL, CTB, MT-3 0.69 0.81 0.73 0.66 0.70

WL, CLB, LT-3 0.99 0.99 0.96 0.95 0.90

WL, CLB, MT-3 0.84 0.81 0.80 0.80 0.76DL: dim lit, WL: well lit, CTB: cluttered background, CLB:clear background, LT-3: within 3 ft, MT-3: more than 3 ft; GS:Go symbol, VS: V-shape, LF: little ﬁnger. The highest averageaccuracy settings are given in bold. except in a few cases of little ﬁnger and ﬁst. The accuracyof little ﬁnger is reduced from 0.86 to 0.81 in the case ofDL, CTB, LT-3 vs WL, CTB, LT-3. In the same pairwiseexperiments of ﬁst, changing the illumination variable fromDL to WL does not help the accuracy increase (DL, CTB,LT-3: 0.89 while WL, CTB, LT-3: 0.89).Overall best average classiﬁcation accuracy is 0.95 andobtained with scene variables of WL, CLB, LT-3, as givenat row

6. Conclusion

Our goal of this study is to enable the hand gesture-based control mechanism with maximum possible accuracyeven in mediocre drones which can be easily outperformedby the state-of-art drones due to their inbuilt high cameraresolutions like 4K, 8K, 16K and 64K etc. In this empiricalstudy, we investigated more on software development forthe AR Drones. We presented an image recognition-basedcommunication framework to control drones with hand ges-tures. The framework is successfully tested using a mediocre a) (b) (c)(d) (e) (f)

Figure 5. Misclassiﬁed gestures. See Section 5 for the further discussion. drone, Parrot AR.Drone 2.0. A set of ﬁve gestures werecarefully selected to build a dataset of 8302 images. Eachimage is represented by a set of Haar features in cascadedAdaBoost algorithm. Classiﬁcation results showed that thedistance between drone and its operator is the most impor-tant indicator of success. This applies all of ﬁve gestures.Experimental tests resulted in an average accuracy of 0.90where operated posed gestures were within 3 ft, regardlessof illumination and background variability of the scene. Wefound that the accuracy of the framework is highest once theoperator poses within 3 ft, well lit, and clear background.This controlling distance can be further improved by uti-lizing better cameras such as those supporting 4K or 16Kresolution in the drones, which allows capturing of imageswith good resolution at longer distance, or implementing thesame framework on state-of-art drones with better imagingcapabilities. With the available HD camera in mediocredrones, the hand gesture recognition in the distance between3 and 5 ft is highly accurate, and this controlling distancecan be improved by enabling high resolution cameras indrones. With the current hand gesture-based control mech-anism, we envision that drones can be sent to any feasibledistances and perform operations, before ﬂying back to thecontroller for further close-ranged interactions. To explorethe effects of different hand poses and deviation, an in-depthstatistical analysis on the applicability of the framework indifferent environment settings is planned for future work.

Acknowledgments

The authors thank Texas A&M University–CommerceGraduate School and Department of Computer Science forthe travel and publication support.

References [1] P. Viola and M. Jones, “Rapid object detection using a boosted cas-cade of simple features,” in

Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer SocietyConference on , vol. 1. IEEE, 2001, pp. I–I.[2] V. Gatteschi, F. Lamberti, G. Paravati, A. Sanna, C. Demartini,A. Lisanti, and G. Venezia, “New frontiers of delivery services usingdrones: A prototype system exploiting a quadcopter for autonomousdrug shipments,” in

Computer Software and Applications Conference(COMPSAC), 2015 IEEE 39th Annual

Remote Sensing

Proceedingsof the 2015 ACM International Joint Conference on Pervasiveand Ubiquitous Computing - UbiComp ’15 . New York, NewYork, USA: ACM Press, 2015, pp. 361–365. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2750858.2805823[13] B. Sanders, D. Vincenzi, and Y. Shen, “Investigation of GestureBased UAV Control,” in

Advances in Human Factors in Robots andUnmanned Systems , Chen J., Ed. Springer, Cham, jul 2017, pp.205–215. [Online]. Available: http://link.springer.com/10.1007/978-3-319-60384-1 20[14] A. Sarkar, K. A. Patel, R. K. G. Ram, and G. K. Capoor,“Gesture control of drone using a motion controller,” in . IEEE, mar 2016, pp. 1–5. [Online]. Available:http://ieeexplore.ieee.org/document/7462401/[15] A. Sanna, F. Lamberti, G. Paravati, and F. Manuri, “A Kinect-basednatural interface for quadrotor control,”

Entertainment Computing

Proceedings of the2013 international conference on Intelligent user interfaces - IUI ’13 .New York, New York, USA: ACM Press, 2013, p. 257. [Online].Available: http://dl.acm.org/citation.cfm?doid=2449396.2449429[17] J. P. Hansen, A. Alapetite, I. S. MacKenzie, and E. Møllenbach, “Theuse of gaze to control drones,” in

Proceedings of the Symposium onEye Tracking Research and Applications - ETRA ’14 . New York,New York, USA: ACM Press, 2014, pp. 27–34. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2578153.257815618] J. Nagi, A. Giusti, G. A. Di Caro, and L. M. Gambardella,“Human Control of UAVs using Face Pose Estimates and HandGestures,” in

Proceedings of the 2014 ACM/IEEE internationalconference on Human-robot interaction - HRI ’14 . New York, NewYork, USA: ACM Press, 2014, pp. 252–253. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2559636.2559833[19] V. M. Monajjemi, J. Wawerla, R. Vaughan, and G. Mori, “HRI in thesky: Creating and commanding teams of UAVs with a vision-mediatedgestural interface,” in . IEEE, nov 2013, pp. 617–623.[Online]. Available: http://ieeexplore.ieee.org/document/6696415/[20] T. B. Dinh, V. B. Dang, D. A. Duong, T. T. Nguyen, and D.-D. Le,“Hand gesture classiﬁcation using boosted cascade of classiﬁers,” in

Research, Innovation and Vision for the Future, 2006 InternationalConference on . IEEE, 2006, pp. 139–144.[21] N. H. Dardas and N. D. Georganas, “Real-time hand gesture detectionand recognition using bag-of-features and support vector machinetechniques,”

IEEE Transactions on Instrumentation and Measure-ment , vol. 60, no. 11, pp. 3592–3607, 2011.[22] N. H. Dardas and E. M. Petriu, “Hand gesture detection and recog-nition using principal component analysis,” in

Computational Intel-ligence for Measurement Systems and Applications (CIMSA), 2011IEEE International Conference on . IEEE, 2011, pp. 1–6.[23] Guoqing-Meng and M. Wang, “Hand gesture recognition based onﬁngertip detection,” in , Dec 2013, pp. 107–111.[24] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz,“Online detection and classiﬁcation of dynamic hand gestures withrecurrent 3d convolutional neural network,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2016,pp. 4207–4215.[25] F. Tian, Q.-C. Hu, and T.-N. Zhang, “A hand gesture detection formulti-class cascade classiﬁer based on gradient,” in

Instrumentationand Measurement, Computer, Communication and Control (IMCCC),2015 Fifth International Conference on . IEEE, 2015, pp. 1364–1368.[26] P. Viola and M. J. Jones, “Robust real-time face detection,”

Interna-tional journal of computer vision , vol. 57, no. 2, pp. 137–154, 2004.[27] Y. Freund and R. E. Schapire, “A desicion-theoretic generalizationof on-line learning and an application to boosting,” in

Europeanconference on computational learning theory . Springer, 1995, pp.23–37.[28] A. Haar, “Zur theorie der orthogonalen functionensysteme. inaugu-ral,” Ph.D. dissertation, Dissertation (G¨ottingen, 1909), 1–49. Math.Annal. 69, 331–271, 1910.[29] C. P. Papageorgiou, M. Oren, and T. Poggio, “A general frameworkfor object detection,” in

Computer vision, 1998. sixth internationalconference on . IEEE, 1998, pp. 555–562.[30] P. Negri, X. Clady, and L. Prevost, “Benchmarking haar and his-tograms of oriented gradients features applied to vehicle detection.”in

ICINCO-RA (1) , 2007, pp. 359–364.[31] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” arXiv preprint arXiv:1409.1556 ,2014.[32] M. J. Bravo and H. Farid, “Object recognition in dense clutter,”

Perception & psychophysics , vol. 68, no. 6, pp. 911–918, 2006.[33] T. Mori and S. Scherer, “First results in detecting and avoidingfrontal obstacles from a monocular camera for micro unmannedaerial vehicles,” in . IEEE, may 2013, pp. 1750–1757. [Online].Available: http://ieeexplore.ieee.org/document/6630807/[34] H. Alvarez, L. M. Paz, J. Sturm, and D. Cremers, “CollisionAvoidance for Quadrotors with a Monocular Camera.” Springer,Cham, 2016, pp. 195–209. [Online]. Available: http://link.springer.com/10.1007/978-3-319-23778-7 { }

14 [35] C. Fu, M. A. Olivares-Mendez, R. Suarez-Fernandez, andP. Campoy, “Monocular Visual-Inertial SLAM-Based CollisionAvoidance Strategy for Fail-Safe UAV Using Fuzzy LogicControllers,”

Journal of Intelligent & Robotic Systems ,vol. 73, no. 1-4, pp. 513–533, jan 2014. [Online]. Available:http://link.springer.com/10.1007/s10846-013-9918-3[36] A. Moses, M. J. Rutherford, M. Kontitsis, and K. P. Valavanis,“UAV-borne X-band radar for collision avoidance,”

Robotica { }

S0263574713000659[37] J. F. Roberts, T. Stirling, J.-C. Zufferey, and D. Floreano, “QuadrotorUsing Minimal Sensing For Autonomous Indoor Flight,”

EuropeanMicro Air Vehicle Conference and Flight Competition (EMAV2007) ,2007. [Online]. Available: https://infoscience.epﬂ.ch/record/111485/[38] G. Bradski, “The OpenCV Library,”

Dr. Dobb’s Journal of SoftwareTools , 2000.[39] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis ofdetection cascades of boosted classiﬁers for rapid object detection,”