[PDF] Mastering Large Scale Multi-label Image Recognition with high efficiency overCamera trap images

Abstract

Camera traps are crucial in biodiversity motivated studies, however dealing with large number of images while annotating these data sets is a tedious and time consuming task. To speed up this process, Machine Learning approaches are a reasonable asset. In this article we are proposing an easy, accessible, light-weight, fast and efficient approach based on our winning submission to the "Hakuna Ma-data - Serengeti Wildlife Identification challenge". Our system achieved an Accuracy of 97% and outperformed the human level performance. We show that, given relatively large data sets, it is effective to look at each image only once with little or no augmentation. By utilizing such a simple, yet effective baseline we were able to avoid over-fitting without extensive regularization techniques and to train a top scoring system on a very limited hardware featuring single GPU (1080Ti) despite the large training set (6.7M images and 6TB).

Full PDF

MMastering Large Scale Multi-label Image Recognition with high efﬁciency overCamera trap images.

Hakuna Ma-data Challenge: 1st place submission descriptionMiroslav Valan

Swedish Museum of Natural HistoryStockholm UniversitySavantic ABPiVa AI [email protected]

Luk´aˇs Picek

University of West BohemiaPiVa AI [email protected]

Abstract

Camera traps are crucial in biodiversity motivated stud-ies, however dealing with large number of images whileannotating these data sets is a tedious and time consum-ing task. To speed up this process, Machine Learning ap-proaches are a reasonable asset. In this article we areproposing an easy, accessible, light-weight, fast and ef-ﬁcient approach based on our winning submission to the”Hakuna Ma-data - Serengeti Wildlife Identiﬁcation chal-lenge”. Our system achieved an Accuracy of and out-performed the human level performance. We show that,given relatively large data sets, it is effective to look at eachimage only once with little or no augmentation. By utilizingsuch a simple, yet effective baseline we were able to avoidover-ﬁtting without extensive regularization techniques andto train a top scoring system on a very limited hardwarefeaturing single GPU (1080Ti) despite the large training set(6.7M images and 6TB).

1. Introduction

Camera traps are an immense resource in ecological re-search and conservation efforts in addressing the need foraccurate assessments of wildlife populations. They are usedfor passively collecting animal behavior data with minimalif any disturbance. Such device is activated remotely with asensor - based on motion or an infrared sensor - or a beamtrigger based on light, laser or sound. In addition to animals,sensors are often falsely triggered, for instance by movinggrass, thus increasing the cost of processing already largenumber of images several times (4x in case of the SnapshotSerengeti) [1].For these reasons, a system that can automatically an- alyze camera trap images, while maintaining the accuracyof humans, would save years of human effort and unlocknew research opportunities in biodiversity. Recent work[2, 3] showed that modern Convolutional Neural Networks(CNNs) are achieving very promising results and closingthe gap with the human level performance (3.5% error rate).[4]. Furthermore, when such CNN is utilized within human-in-the-loop approach the error rate might be reduced to0.22%.The main contribution of this paper is a simple yet effec-tive training method, that learns representations from cam-era trap images by looking at every image only once ( ) while using no augmentations. Our proposedsystem won on Hakuna Ma-data Challenge. Un-like previous work [2, 3] on the same Snapshot Serengetidata set [1] where empty images are 1) ﬁltered and if ani-mal is present then 2) classiﬁed as one of the animals, oursystem uses a single step, while being able to predict mul-tiple species on the same image. In addition, our approachprovides signiﬁcant performance boosts being the ﬁrst everto outperform humans on this task with low complexitymodels suitable for embedded edge devices .

2. Hakuna Ma-data Challenge

Snapshot Serengeti is the largest camera trap project todate running continuously in Serengeti National Park, Tan-zania, for more than 45 years [1]. The project is currentlyusing a network of 225 cameras with passive infrared sen-sors that trigger every time a warm object moves in front ofthem. Each such event has one to several images (average 3)and is provided as a sequence to annotators. A total, 7.1Mimages from 2.65M sequences is currently annotated withan estimated effort of close to 20 man-years of work[1].The Hakuna Ma-data challenge was designed to test cur-1 igure 1. Example images from Snapshot Serengeti data set. The examples show a range of difﬁculties from severe occlusions on the left,through tiny objects (small size or object far from camera), close-up shots, motion blur and multiple classes on the same image. rent state-of-the-art approaches over the Snapshot Serengetidata set. The task was to predict presence of animalsand categorize them into one of the 54 categories. Thecategories could be of different granularity; mostly at thespecies level, but could be higher taxonomic ranks (reptiles,other birds) or from the same species (lion female and lionmale).Organizer provided 6.7M for training while keeping ap-proximately 0.5M for testing. Competitors had no accessto test images nor their annotations and needed to submit aDockerized solution that run on the server.The training set is largely imbalanced with 3/4 of im-ages being empty with no animal present on them. Pres-ence of different categories in the training set was alsohugely imbalanced toward wildebeests, zebras and Thom-son’s gazelles.

3. Methodology

The proposed system is based on 3 architectures thatwere ﬁne-tuned from the publicly available checkpointspretrained on ImageNet [5]. SE-ResNext50 [6][7][8], Ef-ﬁcientNet B1 and EfﬁcientNet B3 [9]. All the networksshared the Parameters from Table 1 and used one cycletraining policy [10, 11, 12].Parameter ValueOptimizer Adam[13]Warm start Learning rate 600 iterations * BSMaximum learning rate 0.0001End learning rate 0.000001Learning rate decay type cosine annealing

Table 1. Optimizer hyper-parameters identical to all networks.

Other parameters such us Batch size, Gradient Accumu-lation and Sampling strategy were set differently for eachmodel as listed in Table 2.

Table 2. Networks and hyper-parameters used in the experiments.BS = Batch Size. GA = Gradient Accumulation.

Speciﬁcally, models were trained for single epoch onlyusing two sampling strategies. In one strategy we used ran-dom sampling from all seasons and in the other we pro-cessed data in chunks (subsets) composed of one or moreseasons. The seasons 9 and 10 were purposely used at theend of the training to allow the model to get a better under-standing of the recent changes in the vegetation.For the ﬁnal submission we used an ensemble on two SE-ResNext50 and single B1 and B3 models while combiningtheir predictions with mean, geometrical mean and categoryaware average.We use sigmoid function to enable detection of multipleanimal categories from the same image. For EfﬁcientNetswe kept the default Dropout [14] of 0.2 and 0.3 for B1 andB3 respectively and for SE-ResNext-50 we set it to 0.0.Due to the circular nature of the data (both time anddate), and varying weather conditions (clouds, shades,wind) we might avoid most of the augmentations. We usedonly the random horizontal ﬂips.

4. Results

Our ﬁnal submission was a weighted ensemble of 4 mod-els with total of 6 forward passes per test image. For eachimage we calculated the arithmetic mean for every animalcategory and geometric mean for the empty category whichwe refer as ”class aware average” in Table 3. For sequenceswith 2 and more images we obtained the ﬁnal predictions2y calculated the arithmetic mean. (ours) 0.00531

Baseline

Table 3. Results of the top ten teams in Hakuna Ma-data challengecalculated as mean aggregated logarithmic loss (AggLogLoss).

Our submission to the challenge achieved the best scoresin terms of Aggregated Logaritmical Loss (AggLogLoss).The AggLogLoss is deﬁned as follows: For each possiblecategory in a sequence, the binary log loss will be com-puted and then the results will be summed. The sum of thebinary losses represent the total loss for the sequence. Theresults of the top 10 teams with the baseline score are listedin Table 3In Table 4, we show the dynamic leaderboard progres-sion of tricks and techniques which contributed to our ﬁ-nal solution. The analysis was conducted after the competi-tion was ﬁnalized and did not include 2% of the corruptedimages. Table 3 shows the Private Score on the HakunaMa-data challenge where it can be seen that our system im-proved the baseline score 9x.We further compare our system against previously pub-lished work on the earlier version of the data set. Our sys-tem shows signiﬁcant improvements (see Table 5) reducingthe previously reported error rates. In addition, our systemperformed better than single human annotators, trailing onlyto human-in-the-loop approach.

5. Discussion

Comparison to previous work . Previously proposedsystems [2, 3] are having two main drawbacks for real lifeapplication. First, they use two stage solutions: a) to de-tect empty images and b) to classify animals. Our contri-bution is a single step solution. Secondly, previous workwas not capable to handle presence of multiple species onthe same image as they used softmax output and we haveutilized sigmoid function which enables multilabel classiﬁ-cation. Table 5 shows that our proposed system surpassespreviously reported results by a signiﬁcant margin by usingonly a single epoch compared to 50 and 70 in [2] and [3],respectively.

Augmentation not needed.

Our system features models Submission evaluation AggLogLossEfﬁcient Net B1 0.00789Efﬁcient Net B3 0.01102SE-ResNext-50 s9 0.00830SE-ResNext-50 s10 0.00800SE-ResNext-50 random 0.00610SE-ResNext-50 + ﬂip 0.00592Ensemble 1 (mean) 0.00555Ensemble 2 (gmean) 0.00582Ensemble 3 (class aware avrg.) 0.00550

Table 4. Post-competition evaluation on Season 11. The 2% ofthe images were found corrupted and they are excluded from thisanalysis. with low complexity which are fast in terms of both trainingand inference time. We have used no augmentation exceptrandom horizontal ﬂips which allowed us to train for onlyone epoch. Excluding common augmentation techniquesfrom our pipeline (random crop, shear and rotation) is es-sential because many objects of interest are not centeredand are practically visible (animal entering camera’s ﬁeldof view) which if it is cropped would introduce noise andwould require more time to converge.

Surpassing human performance in embedded edgedevice.

While comparing our system we signiﬁcantly im-proved the current state-of-the-art on Snapshot Serengetidata in all the cases. Additionally, our system performedbetter than single human annotators on recognizing im-ages without animals, trailing only to human-in-the-loopapproach.Study Acc Empty Acc EpochsWilli [3] 93.4% 96.0% 70Norouzzadeh [2] 93.8% 96.8% 50Human [4] 96.6% 96.6% -Human-in-the-loop [3] 99.78% 99.78% - ours 94.3% 97*% 1

Table 5. Comparison to other studie. * denotes the result providedby the challenge organizers

6. Conclusion

The article describes our winning submission to the”Hakuna Ma-data - Serengeti Wildlife Identiﬁcation chal-lenge“. Proposed system achieved an outstanding scoreof 0.000531 AggLogLoss while recognizing empty imageswith 97% Accuracy.Furthermore we brieﬂy discuss the beneﬁts that biolo-gists might obtain by using such system including but notlimited to: 1) reducing manual workload over annotations,thus increasing the efﬁciency in biology research and 2)3inimizing the number of images that have to be processedas there is a lightweight model available which can cleanfalse positives directly on the camera trap device.All the source code and models weights freely avail-able at https://github.com/drivendataorg/hakuna-madata/tree/master/1st%20Place

7. Future Work

In future work we would like to study a problem of poorgeneralisation to new locations or geographic regions. Weexpect that our approach should not suffer from generali-sation issues, since the models are trained only for singleepoch. Additionally we would like to create a system thatcan handle novel classes. Lastly, deployment of a systemwith human level accuracy, small power consumption andlow complexity directly on the edge device might help toreduce the number of taken images.

Acknowledgements

This work was supported by the European Union FL sHorizon 2020 research and innovation program under theMarie Sklodowska-Curie grant agreement No. 642241 toMV. LP was supported by the Ministry of Education, Youthand Sports of the Czech Republic project No. LO1506, andby the grant of the UWB project No. SGS-2019-027. References [1] A. Swanson, M. Kosmala, C. Lintott, R. Simpson, A. Smith,and C. Packer, “Snapshot serengeti, high-frequency anno-tated camera trap images of 40 mammalian species in anafrican savanna,”

Scientiﬁc data , vol. 2, p. 150026, 2015.[2] M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson,M. S. Palmer, C. Packer, and J. Clune, “Automatically iden-tifying, counting, and describing wild animals in camera-trap images with deep learning,”

Proceedings of the NationalAcademy of Sciences , vol. 115, no. 25, pp. E5716–E5725,2018.[3] M. Willi, R. T. Pitman, A. W. Cardoso, C. Locke, A. Swan-son, A. Boyer, M. Veldthuis, and L. Fortson, “Identifying an-imal species in camera trap images using deep learning andcitizen science,”

Methods in Ecology and Evolution , vol. 10,no. 1, pp. 80–91, 2019.[4] A. Swanson, M. Kosmala, C. Lintott, and C. Packer, “A gen-eralized approach for producing, quantifying, and validatingcitizen science data from wildlife images,”

Conservation Bi-ology , vol. 30, no. 3, pp. 520–531, 2016.[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei, “ImageNet large scale vi-sual recognition challenge,”

Int. J. Comput. Vis. , vol. 115,pp. 211–252, 1 Dec. 2015. [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pp. 770–778, 2016.[7] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggre-gated residual transformations for deep neural networks,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 1492–1500, 2017.[8] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-works,” in

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , June 2018.[9] M. Tan and Q. V. Le, “Efﬁcientnet: Rethinking modelscaling for convolutional neural networks,” arXiv preprintarXiv:1905.11946 , 2019.[10] L. N. Smith, “A disciplined approach to neural networkhyper-parameters: Part 1–learning rate, batch size, momen-tum, and weight decay,” arXiv preprint arXiv:1803.09820 ,2018.[11] L. N. Smith, “No more pesky learning rate guessing games,”

CoRR , vol. abs/1506.01186, 2015.[12] L. N. Smith and N. Topin, “Super-convergence: Very fasttraining of residual networks using large learning rates,”

CoRR , vol. abs/1708.07120, 2017.[13] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980 , 2014.[14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overﬁtting,”

The journal of machine learningresearch , vol. 15, no. 1, pp. 1929–1958, 2014., vol. 15, no. 1, pp. 1929–1958, 2014.