Video-based Hierarchical Species Classification for Longline Fishing Monitoring
Jie Mei, Jenq-Neng Hwang, Suzanne Romain, Craig Rose, Braden Moore, Kelsey Magrane
VVideo-based Hierarchical Species Classificationfor Longline Fishing Monitoring
Jie Mei − − − , Jenq-Neng Hwang − − − , SuzanneRomain , Craig Rose , Braden Moore , and Kelsey Magrane University of Washington, Seattle, WA 98195, USA { jiemei, hwang } @uw.eduhttps://ipl-uw.github.io/ EM Research and Development, National Oceanic and Atmospheric Administration(NOAA) Affiliate, Pacific States Marine Fisheries Commission, Seattle, WA 98115,USA { suzanne.romain, craig.rose, braden.j.moore, kelsey.magrane } @noaa.gov Abstract.
The goal of electronic monitoring (EM) of longline fishing isto monitor the fish catching activities on fishing vessels, either for theregulatory compliance or catch counting. Hierarchical classification basedon videos allows for inexpensive and efficient fish species identificationof catches from longline fishing, where fishes are under severe deforma-tion and self-occlusion during the catching process. More importantly,the flexibility of hierarchical classification mitigates the laborious effortsof human reviews by providing confidence scores in different hierarchicallevels. Some related works either use cascaded models for hierarchicalclassification or make predictions per image or predict one overlappinghierarchical data structure of the dataset in advance. However, with aknown non-overlapping hierarchical data structure provided by fisheriesscientists, our method enforces the hierarchical data structure and intro-duces an efficient training and inference strategy for video-based fisheriesdata. Our experiments show that the proposed method outperforms theclassic flat classification system significantly and our ablation study jus-tifies our contributions in CNN model design, training strategy, and thevideo-based inference schemes for the hierarchical fish species classifica-tion task.
Keywords:
Electronic monitoring · Hierarchical classification · Video-based classification · Longline fishing.
Automated imagery analysis techniques have drawn increasing attention in fish-eries science and industry [1–3, 7, 8, 14–16, 20], because they are more scalableand deployable than conventional manual survey and monitoring approaches.One of the emerging fisheries monitoring methods is electronic monitoring(EM), which can effectively take advantage of the automated imagery analysis a r X i v : . [ c s . C V ] F e b J. Mei et al. for fisheries activities [7]. The goal of EM is to monitor fish captures on fishingvessels either for catching counting or regulatory compliance. Fisheries managersneed to assess the amount of fish caught by species and size to monitor catchquotas by vessel or fishery. Such data are also used in analyses to evaluate thestatus of fish stocks. Managers also need to detect the retention of specific fishspecies or sizes of particular species that are not allowed to be kept. Therefore,accurate detection, segmentation, length measurement, and species identificationare critically needed in the EM systems.
Especially in the EM systems, a hierarchical classifier is more meaningful forthe fisheries than a flat classifier with the standard softmax output layer. Thehierarchical classifier can predict coarse-level groups and fine-level species at thesame time. If the system predicts some images with high confidence in one coarse-level group but with low confidence in the corresponding fine-level species, then ahierarchical classifier stops predictions of those images at the correct coarse-levelgroup and allows fisheries personnel to assign corresponding experts to reviewthose images and get the correct fine-level labels.To address the hierarchical classification needs, in this paper, we developa video-based hierarchical species classification system for the longline fishingmonitoring, where fish are caught are caught on hooks and viewed as they arepulled up from the sea and over the rail of the fishing vessel as shown in Fig.1.
Fig. 1.
Longline Fishing: Each column is a sequence of an individual fish caught ona longline hook, as it is being pulled up from the sea and over the rail of the fishingvessel.
The proposed hierarchical prediction, which allows coarse-level prediction tobe the final output if fine-level confidence score is too low, improves accuracy ontail-class species when training data follows a long-tail (imbalanced) distribution. ideo-based Hierarchical Species Classification 3
Our contributions can be summarized as follows: 1) Our proposed CNN archi-tecture enforces an effective hierarchical data structure. 2) An efficient trainingstrategy. 3) Two robust video-based hierarchical inference schemes.The remaining sections of this paper are organized as follows. In SectionII, overviews of the related works for flat classifiers with the standard softmaxoutput layer and hierarchical classifiers are provided. Section III describes theproposed system in details. The experimental results, including the ablationstudy, are demonstrated and discussed in Section IV. Finally, Section V givesthe conclusion of this work.
We use ’flat classifiers’ to represent all deep learning classification systems withsoftmax as the final layer to normalize the outputs of all classes, without intro-ducing any hierarchical level of prediction.AlexNet [11] is the first CNN-based winner in 2012 ImageNet Large ScaleVisual Recognition Challenge (ILSVRC), which introduces the 1000-way softmaxlayer for classifying the 1000 classes of objects. The subsequent ILSVRC winners,VGGNet [12], GoogLeNet [13], and ResNet [5] continue to use softmax as thefinal layer to achieve good performance. Until now, flat classifiers with softmaxoperations as the final layer are the dominant design structure for classificationtasks.
A hierarchical classifier means the system can output all confidence scores atdifferent levels in the hierarchical data structure. One obvious advantage is thatif the confidence score of a sample is too low at the fine level but very high atcoarse level, then we can use the coarse-level prediction to be the final prediction.In contrast, flat classifiers have no alternative ways if the confidence score is toolow at the final prediction.Hand-crafted features are used in [6] for hierarchical fish species classifica-tion. Hierarchical medical image classification [10] and text classification [9] usecascaded flat classifiers to be their hierarchical classifiers, which use only oneflat classifier for each level’s prediction. They stack CNN-based models with flatclassifiers without considering any hierarchical architecture design and increasedcomputational complexity. HDCNN [18] introduces confidence-score multiplica-tion operations to enforce hierarchical data structure but the model uses thesame feature maps for both coarse level and fine level, resulting in learning anoverlapping hierarchy of training data. B-CNN [19] uses different feature mapsfor different levels’ predictions without enforcing any hierarchical data structurein the architecture. Deep RTC [17] adopts hierarchical classification to deal withlong-tailed recognition, resulting in improved accuracy of tail classes. It adopts
J. Mei et al. a simple confidence-score thresholding method, which is also adopted in our ap-proach, to decide to output fine-level prediction or coarse-level prediction. ButDeep RTC predicts an overapping hierarchical data structure in the first place,which is different with our situation.
The hierarchical dataset utilized for training our system is professionally labeledand provided by the Fisheries Monitoring and Analysis (FMA) Division, AlaskaFisheries Science Center (AFSC) of NOAA, researchers can contact AFSC di-rectly about getting permission to access this dataset and the correspondinghierarchical data structure.
Fig. 2.
Hierarchical Data Structure: The dataset, labeled and provided by NOAA fish-eries scientists, includes frames and corresponding labels which are bounding box loca-tion, start and end frames’ IDs of each individual fish, coarse-level group ground truth,and fine-level species ground truth. The sample images shown here are randomly chosenfrom the dataset.
To construct the dataset used for our system, we use labels of bounding boxlocation to crop objects from raw videos and use labeled start and end frames’IDs for each individual fish to divide raw videos into individual tracks (videoclips). There are 6 coarse-level groups and 31 fine-level species in this hierarchicaldataset (see Fig. 2). Our dataset is challenging because some fine-level speciesare very similar to one another. The total number of frames is 186,592 (see ideo-based Hierarchical Species Classification 5(a) Image-Species Distribution (b) Track-Species Distribution
Fig. 3.
Dataset Distribution: The left column black numbers in both figures are thenumber of images or tracks for training while the green numbers are for evaluation.There is a 0 . . Fig. 3(a)). The total number of video clips/tracks is 3,021 (see Fig. 3(b)). Eachvideo clip contains one individual fish pulled up from sea surface to the fishingvessel during the longline fishing activities.
Instead of using cascaded flat classifiers in our species identification in the long-line fishing, as inspired by the success of Mask-RCNN [4], which feeds sharedfeature maps extracted from the backbone to different heads for object classifi-cation and instance segmentation at the same time, our proposed architectureis also an end-to-end training network including two parts: a backbone and sev-eral hierarchical classification heads (see Fig. 4). Inspired by B-CNN [19], weuse ResNet101 as our backbone to extract shallow feature maps from images for’Head-1’ and shared deeper feature maps for the other 6 classification heads. Weuse Head-1 for coarse-level (6 groups) classification and Head-2 to Head-7 forfine-level (31 species in total) predictions.
Enforcing Hierarchical Data Structure
We use confidence-score multipli-cation operations to enforce the hierarchical data structure in our system. Thefinal confidence score of one specific species is the product of the confidence scorein the corresponding coarse group and the confidence score in that specific (fine)species as shown in the following equation, i.e., P (cid:48) j,i = P ,j − · P j,i , j ∈ [2 , , (1) J. Mei et al.
Fig. 4.
Hierarchical Architecture: We call our 7 classification heads ’HierarchicalHeads’. Head-1 is for 6 groups in coarse-level and uses shallower feature maps ex-tracted from the backbone for predictions, while the rest of 6 heads are for fine levels,i.e., Head-2 for ’Skates’ group, Head-3 for ’Sharks’ group, Head-4 for ’Roundfish’ group,Head-5 for ’Flatfishes’ group, Head-6 for ’Rockfishes’ group, and Head-7 for ’Inverte-brates’ group, respectively. All fine-level heads use the shared deeper feature maps fromthe same backbone for predictions. Head-1 has two fully connected layers followed bya sofmax layer. The rest 6 fine-level heads have one fully connected layer followed bya sofmax layer. where P ,j − represents the confidence score in the ( j − th group in coarse level, P j,i represents the confidence score in the i th species of the ( j − th group, P (cid:48) j,i represents the final confidence score in the i th species of the ( j − th group.As a result, the final confidence score, P (cid:48) j,i , includes both scores from coarselevel and fine level so that this CNN architecture enforces a hierarchical datastructure when using the final confidence score product to calculate the trainingloss. Training loss is meaningful when using P (cid:48) j,i because the final layers in allheads are softmax outputs, which satisfy the following equations: (cid:26) (cid:80) i P j,i = 1 , (cid:80) j =2 (cid:80) i P (cid:48) j,i = 1 . (2) Efficient Training Strategy
During the image-based training, one input im-age has both labeled coarse-level ground truth as well as fine-level ground truth.For our architecture, there are two options experimented about how to use thesetwo ground truths.The first option is training the ’Head-1’ and the fine-level head correspondingto the ground truth coarse-level group. Since the corresponding fine-level head ispicked by coarse-level group ground truth, therefore the losses are only calculatedbased on these two heads.
Loss = − (cid:88) i y ,i · log ( P ,i ) − (cid:88) i y ,i · log ( P (cid:48) j,i ) , (3) ideo-based Hierarchical Species Classification 7 where the first summation is the cross entropy loss in the ’Head-1’ and y ,i iscoarse-level ground truth. The second summation is the cross entropy loss inthe corresponding fine-level head using final predictions, P (cid:48) j,i , after confidence-score multiplication operations. Note that y ,i is the ground-truth among specieswithin this fine-level head. This regular loss does not involve P (cid:48) ji from otherheads, therefore, it does not fully enforce hierarchical data structure duringtraining and only trains two heads each time.The second option is training the ’Head-1’ and all the other fine-level headsusing final predictions P (cid:48) ji after confidence-score multiplication operations, re-sulting in a more efficient training strategy because it enables confidence-scoremultiplication operations to fully enforce hierarchical data structure during train-ing and all heads can be trained simultaneously. Loss = − (cid:88) i y ,i · log ( P ,i ) − (cid:88) j =2 (cid:88) i y (cid:48) ,i · log ( P (cid:48) j,i ) , (4)where y (cid:48) ,i denotes the ground-truth among 31 species. The reason why givenone input image we can calculate cross entropy on all final predictions, P (cid:48) j,i , isthat after the confidence-score multiplication operations, summation of theseproducts is still 1. Video-based Inference Schemes
Although we use image-based training,where training loss is calculated on each individual input image, two video-based (track-based) inference methods are implemented and compared. Sincefor each input image frame, our system outputs confidence scores of 31 species, P (cid:48) j,i , therefore the first way is to pick the species with the maximum averageconfidence score of all frames in each track to be the prediction for each track.The second way is to pick the species with maximum confidence score for ev-ery frame in each track, then uses majority vote to select one species as the pre-diction for that track. Finally, we calculate the average confidence scores amongframes corresponding to the selected species. We report performance under thesetwo video-based inference schemes and calculate their average confidence scoreswith image-based confidence scores in the following section. These two inferenceschemes can be summarized into the following equations: (cid:26) p ,i = 1 /T · (cid:80) t P (cid:48) ,i,t ,p ,i = 1 /T · (cid:80) t P (cid:48) j,i,t , j ∈ [2 , , (5)where t is frame index. P (cid:48) j,i,t is P (cid:48) j,i at the t th frame. In the first way, T is thetotal number of frames in one video clip (a track from the start-frame to theend-frame of one catching), while in the second way, it is the total number offrames corresponding to the selected species in one video clip. As a result, p ,i isvideo-based average confidence scores in 6 groups and p ,i is video-based averageconfidence scores in 31 species. J. Mei et al.
We use the video-based data split, i.e., each short video clip (a track) is associ-ated with one individual fish and all frames from 80% of all tracks are used astraining data for image-based training. All frames from the rest 20% tracks arethe evaluation data (see Fig. 3). As a result, images for training and evaluationare totally from tracks of different individual fishes. All hyper-parameters liketraining epochs, learning rate, data augmentation, and so on are kept the samein the following different competing approaches.
The dominant species classification architecture is extracting deep features usingCNN followed by a flat classifier. As a result, for the baseline, we use ResNet101as the backbone and two fully connected layers followed by a 31-way softmaxlayer as the flat classifier head, which is a classic deep learning classificationarchitecture. During training, we only use fine-level ground truth to calculatethe cross-entropy loss based on the flat classifier output confidence scores in31 species without any coarse-level predictions.
From Table 1, we can see theaccuracy of the baseline is far below our hierarchical method . Using all frames from the rest 20% tracks for evaluation, we try the followingevaluation methods, where we calculate both image-based accuracy as well asvideo-based (track-based) accuracy, denoted in the ’Unit’ column in Table 1.We also calculate classification accuracy on the coarse level based on coarse-level ground truth, denoted as ’Level-1’ in Table 1.Moreover, with confidence scores in the coarse level as well as the fine level, wecan pick the species with maximum fine-level confidence score under the groupwith maximum coarse-level confidence score as the final prediction, which is de-noted as ’Level-2 A’ in Table 1. While, with final confidence scores in 31 species,we can directly pick the species with the maximum confidence score productof coarse and fine levels as the final prediction, which is denoted as ’Level-2 B’in Table 1. For these two metrics (’Level-2 A’ and ’Level-2 B’) in video-basedschemes, we further use either maximum average confidence score (denoted as’ video ’) or majority vote (denoted as ’ video ∗ ’) to report the performance, asdiscussed in the ’Video-based Inference Schemes’ in Section 3.2.Finally, with these final confidence scores, P (cid:48) j,i , j ∈ [2 , ideo-based Hierarchical Species Classification 9 Scheme’ section with the threshold. This metric, being able to stay at a coarselevel, is denoted as ’Level-2 C’ in Table 1. Theoretically, the ceiling limit of ’Level-2 C’ is ’Level-1’ if all samples stop at the coarse level. Therefore we use the greedysearch to find a threshold for each scheme in Table 1 to make sure that afterstopping at the coarse level, the overall video-based inference accuracy will notdegrade. We fix these thresholds in image-based inference for every competingscheme.
Table 1.
Comparison with Flat Classifier and Ablation Study: ’ video ’ denotes video-based inference by using average confidence score among 31 species to pick one pre-dicted species for each track. ’ video ∗ ’ denotes video-based inference through majorityvote to pick one species for each track. Two numbers under ’Level-2 C’ column follow-ing the accuracy value are total number of stopping at coarse-level and total numberof proceeding to fine-level respectively.Model Unit Level-1 Level-2 A Level-2 B Level-2 CBaseline img - - 78.3 -Scheme-1 img 86.3 77.4 77.4 82.0(8567, 27393) video ∗ video video ∗ video video ∗ video (293, 324) From Table 1, we can see video-based inference is always better than image-based inference in all competing schemes . And these two video-based inferencemethods, average confidence and majority vote, are comparable with each other.Scheme-3 is our full system demonstrated in Fig. 4, which includes confidence-score multiplication operations to enforce hierarchical data structure and usesthe efficient training strategy (
Loss ). Scheme-2 only removes the efficient train-ing strategy and uses Loss instead. Scheme-1 removes confidence-score mul-tiplication operations in the architecture but keeps 7 heads. It also removesthe efficient training strategy and instead uses standard cross-entropy losses on’Head-1’ and the fine-level head corresponding to the ground truth coarse-levelgroup. Scheme-1 shares the same architecture as B-CNN [19]. When evaluatingunder ’Level-2 B’ and ’Level-2 C’ for Scheme-1, we have to multiply the coarse-level confidence scores with the fine-level confidence score in advance to get thefinal confidence scores.Detailed accuracy on the coarse level and fine level of Scheme-3 (our proposedcomplete system), based on the maximum average confidence score (denoted as’ video ’ in Table 1) is in Fig. 5. Fig. 5.
Detailed Accuracy on Coarse Level and Fine Level of the complete proposedsystem: The orange bar is image-based inference and the blue bar is video-based infer-ence. (d) is under ’Level-2 C’ evaluation method. We can see most tail species stop atcoarse-level prediction, which makes 5 .
5% improvement in overall video-based accuracyshowed in Table 1.
Scheme-1 and Scheme-2 are experimented mainly for ablation study purposes.
Comparing Scheme-1 with Scheme-2 from Table 1, we can see confidence-scoremultiplication operations can effectively enforce hierarchical data structure andimprove the performance even when Scheme-2 only trains two heads each time.Comparing Scheme-2 with Scheme-3, we can see our efficient training strategy(
Loss ) improves the performance by fully enforcing hierarchical data structureduring training. Under ’Level-2 C’, the completing systems’ final predictions can stop at acoarse level if the final confidence score is lower than the greedy-searched thresh-old mentioned in the previous section. We call ’Level-2 C’ as hierarchical predic-tion, which is one big advantage of the hierarchical classifier over flat classifiers,which allows fisheries managers to assign corresponding experts to review those ideo-based Hierarchical Species Classification 11 images in a certain group and get the correct fine-level labels.
Besides, fromFig. 5(d), we can see most tail-class species identification stop at a coarse level,resulting in a significantly higher overall accuracy in ’Level-2 C’ over that of’Level-2 B’ in Table 1. Also, our full system, Scheme-3, has the greatest numberof images or tracks proceeding to fine level and at the same time achieves thebest performance.
We proposed an efficient hierarchical CNN classifier to enforce hierarchical datastructure for fish species identification, combined with an efficient training strat-egy, and two video-based inference schemes. Our experiments show that the in-tegrated use of these three main strategies indeed improves accuracy clearly.Additionally, hierarchical predictions allow images that cannot be confidentlyclassified at the fine level to be confidently classified at a coarse level for ex-perts future examination, which especially improve overall accuracy on tail-classspecies identification significantly by stopping at coarse-level predictions. More-over, our method greatly outperforms the baseline method, a flat classifier. Fu-ture work will be devoted to adding more techniques like data sampling or addi-tional training losses for tail-class species identification. It would be interestingto combine more strategies for long-tailed data with hierarchical classification. ibliography [1] Chuang, M.C., Hwang, J.N., Rose, C.S.: Aggregated segmentation of fishfrom conveyor belt videos. In: 2013 IEEE International Conference onAcoustics, Speech and Signal Processing. pp. 1807–1811. IEEE (2013)[2] Chuang, M.C., Hwang, J.N., Williams, K., Towler, R.: Tracking live fishfrom low-contrast and low-frame-rate stereo videos. IEEE Transactions onCircuits and Systems for Video Technology (1), 167–179 (2014)[3] Gupta, S., Abu-Ghannam, N., Massini, R., Mottiar, Y., Altosaar, I., Peleg,M., Normand, M.D., Corradini, M.G.: Trends in application of imagingtechnologies to inspection of fish and[4] He, K., Gkioxari, G., Doll´ar, P., Girshick, R.B.: Mask R-CNN. CoRR abs/1703.06870 (2017), http://arxiv.org/abs/1703.06870 [5] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog-nition. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 770–778 (2016)[6] Huang, P.X., Boom, B.J., Fisher, R.B.: Hierarchical classification with rejectoption for live fish recognition. Machine Vision and Applications (1), 89–102 (2015)[7] Huang, T.W., Hwang, J.N., Romain, S., Wallace, F.: Live tracking of rail-based fish catching on wild sea surface. In: 2016 ICPR 2nd Workshop onComputer Vision for Analysis of Underwater Imagery (CVAUI). pp. 25–30.IEEE (2016)[8] Huang, T.W., Hwang, J.N., Rose, C.S.: Chute based automated fish lengthmeasurement and water drop detection. In: 2016 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP). pp. 1906–1910.IEEE (2016)[9] Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S.,Barnes, L.E.: Hdltex: Hierarchical deep learning for text classification. In:2017 16th IEEE international conference on machine learning and applica-tions (ICMLA). pp. 364–371. IEEE (2017)[10] Kowsari, K., Sali, R., Ehsan, L., Adorno, W., Ali, A., Moore, S., Amadi,B., Kelly, P., Syed, S., Brown, D.: Hmic: Hierarchical medical image classi-fication, a deep learning approach. Information (6), 318 (2020)[11] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification withdeep convolutional neural networks. In: Advances in neural information pro-cessing systems. pp. 1097–1105 (2012)[12] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)[13] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Er-han, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions.In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 1–9 (2015) ideo-based Hierarchical Species Classification 13 [14] Wang, G., Hwang, J.N., Williams, K., Cutter, G.: Closed-loop tracking-by-detection for rov-based multiple fish tracking. In: 2016 ICPR 2nd Workshopon Computer Vision for Analysis of Underwater Imagery (CVAUI). pp. 7–12. IEEE (2016)[15] White, D.J., Svellingen, C., Strachan, N.J.: Automated measurement ofspecies and length of fish by computer vision. Fisheries Research (2-3),203–210 (2006)[16] Williams, K., Lauffenburger, N., Chuang, M.C., Hwang, J.N., Towler, R.:Automated measurements of fish within a trawl using stereo images froma camera-trawl device (camtrawl). Methods in Oceanography , 138–152(2016)[17] Wu, T.Y., Morgado, P., Wang, P., Ho, C.H., Vasconcelos, N.: Solving long-tailed recognition with deep realistic taxonomic classifier. arXiv preprintarXiv:2007.09898 (2020)[18] Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu,Y.: Hd-cnn: Hierarchical deep convolutional neural networks for large scalevisual recognition. In: Proceedings of the IEEE International Conference onComputer Vision (ICCV) (December 2015)[19] Zhu, X., Bain, M.: B-cnn: branch convolutional neural network for hierar-chical classification. arXiv preprint arXiv:1709.09890 (2017)[20] Zion, B.: The use of computer vision technologies in aquaculture–a review.Computers and electronics in agriculture88