[PDF] Video-based Hierarchical Species Classification for Longline Fishing Monitoring

Abstract

The goal of electronic monitoring (EM) of longline fishing is to monitor the fish catching activities on fishing vessels, either for the regulatory compliance or catch counting. Hierarchical classification based on videos allows for inexpensive and efficient fish species identification of catches from longline fishing, where fishes are under severe deformation and self-occlusion during the catching process. More importantly, the flexibility of hierarchical classification mitigates the laborious efforts of human reviews by providing confidence scores in different hierarchical levels. Some related works either use cascaded models for hierarchical classification or make predictions per image or predict one overlapping hierarchical data structure of the dataset in advance. However, with a known non-overlapping hierarchical data structure provided by fisheries scientists, our method enforces the hierarchical data structure and introduces an efficient training and inference strategy for video-based fisheries data. Our experiments show that the proposed method outperforms the classic flat classification system significantly and our ablation study justifies our contributions in CNN model design, training strategy, and the video-based inference schemes for the hierarchical fish species classification task.

Full PDF

VVideo-based Hierarchical Species Classiﬁcationfor Longline Fishing Monitoring

Jie Mei − − − , Jenq-Neng Hwang − − − , SuzanneRomain , Craig Rose , Braden Moore , and Kelsey Magrane University of Washington, Seattle, WA 98195, USA { jiemei, hwang } @uw.eduhttps://ipl-uw.github.io/ EM Research and Development, National Oceanic and Atmospheric Administration(NOAA) Aﬃliate, Paciﬁc States Marine Fisheries Commission, Seattle, WA 98115,USA { suzanne.romain, craig.rose, braden.j.moore, kelsey.magrane } @noaa.gov Abstract.

The goal of electronic monitoring (EM) of longline ﬁshing isto monitor the ﬁsh catching activities on ﬁshing vessels, either for theregulatory compliance or catch counting. Hierarchical classiﬁcation basedon videos allows for inexpensive and eﬃcient ﬁsh species identiﬁcationof catches from longline ﬁshing, where ﬁshes are under severe deforma-tion and self-occlusion during the catching process. More importantly,the ﬂexibility of hierarchical classiﬁcation mitigates the laborious eﬀortsof human reviews by providing conﬁdence scores in diﬀerent hierarchicallevels. Some related works either use cascaded models for hierarchicalclassiﬁcation or make predictions per image or predict one overlappinghierarchical data structure of the dataset in advance. However, with aknown non-overlapping hierarchical data structure provided by ﬁsheriesscientists, our method enforces the hierarchical data structure and intro-duces an eﬃcient training and inference strategy for video-based ﬁsheriesdata. Our experiments show that the proposed method outperforms theclassic ﬂat classiﬁcation system signiﬁcantly and our ablation study jus-tiﬁes our contributions in CNN model design, training strategy, and thevideo-based inference schemes for the hierarchical ﬁsh species classiﬁca-tion task.

Keywords:

Electronic monitoring · Hierarchical classiﬁcation · Video-based classiﬁcation · Longline ﬁshing.

Automated imagery analysis techniques have drawn increasing attention in ﬁsh-eries science and industry [1–3, 7, 8, 14–16, 20], because they are more scalableand deployable than conventional manual survey and monitoring approaches.One of the emerging ﬁsheries monitoring methods is electronic monitoring(EM), which can eﬀectively take advantage of the automated imagery analysis a r X i v : . [ c s . C V ] F e b J. Mei et al. for ﬁsheries activities [7]. The goal of EM is to monitor ﬁsh captures on ﬁshingvessels either for catching counting or regulatory compliance. Fisheries managersneed to assess the amount of ﬁsh caught by species and size to monitor catchquotas by vessel or ﬁshery. Such data are also used in analyses to evaluate thestatus of ﬁsh stocks. Managers also need to detect the retention of speciﬁc ﬁshspecies or sizes of particular species that are not allowed to be kept. Therefore,accurate detection, segmentation, length measurement, and species identiﬁcationare critically needed in the EM systems.

Especially in the EM systems, a hierarchical classiﬁer is more meaningful forthe ﬁsheries than a ﬂat classiﬁer with the standard softmax output layer. Thehierarchical classiﬁer can predict coarse-level groups and ﬁne-level species at thesame time. If the system predicts some images with high conﬁdence in one coarse-level group but with low conﬁdence in the corresponding ﬁne-level species, then ahierarchical classiﬁer stops predictions of those images at the correct coarse-levelgroup and allows ﬁsheries personnel to assign corresponding experts to reviewthose images and get the correct ﬁne-level labels.To address the hierarchical classiﬁcation needs, in this paper, we developa video-based hierarchical species classiﬁcation system for the longline ﬁshingmonitoring, where ﬁsh are caught are caught on hooks and viewed as they arepulled up from the sea and over the rail of the ﬁshing vessel as shown in Fig.1.

Fig. 1.

Longline Fishing: Each column is a sequence of an individual ﬁsh caught ona longline hook, as it is being pulled up from the sea and over the rail of the ﬁshingvessel.

The proposed hierarchical prediction, which allows coarse-level prediction tobe the ﬁnal output if ﬁne-level conﬁdence score is too low, improves accuracy ontail-class species when training data follows a long-tail (imbalanced) distribution. ideo-based Hierarchical Species Classiﬁcation 3

Our contributions can be summarized as follows: 1) Our proposed CNN archi-tecture enforces an eﬀective hierarchical data structure. 2) An eﬃcient trainingstrategy. 3) Two robust video-based hierarchical inference schemes.The remaining sections of this paper are organized as follows. In SectionII, overviews of the related works for ﬂat classiﬁers with the standard softmaxoutput layer and hierarchical classiﬁers are provided. Section III describes theproposed system in details. The experimental results, including the ablationstudy, are demonstrated and discussed in Section IV. Finally, Section V givesthe conclusion of this work.

We use ’ﬂat classiﬁers’ to represent all deep learning classiﬁcation systems withsoftmax as the ﬁnal layer to normalize the outputs of all classes, without intro-ducing any hierarchical level of prediction.AlexNet [11] is the ﬁrst CNN-based winner in 2012 ImageNet Large ScaleVisual Recognition Challenge (ILSVRC), which introduces the 1000-way softmaxlayer for classifying the 1000 classes of objects. The subsequent ILSVRC winners,VGGNet [12], GoogLeNet [13], and ResNet [5] continue to use softmax as theﬁnal layer to achieve good performance. Until now, ﬂat classiﬁers with softmaxoperations as the ﬁnal layer are the dominant design structure for classiﬁcationtasks.

A hierarchical classiﬁer means the system can output all conﬁdence scores atdiﬀerent levels in the hierarchical data structure. One obvious advantage is thatif the conﬁdence score of a sample is too low at the ﬁne level but very high atcoarse level, then we can use the coarse-level prediction to be the ﬁnal prediction.In contrast, ﬂat classiﬁers have no alternative ways if the conﬁdence score is toolow at the ﬁnal prediction.Hand-crafted features are used in [6] for hierarchical ﬁsh species classiﬁca-tion. Hierarchical medical image classiﬁcation [10] and text classiﬁcation [9] usecascaded ﬂat classiﬁers to be their hierarchical classiﬁers, which use only oneﬂat classiﬁer for each level’s prediction. They stack CNN-based models with ﬂatclassiﬁers without considering any hierarchical architecture design and increasedcomputational complexity. HDCNN [18] introduces conﬁdence-score multiplica-tion operations to enforce hierarchical data structure but the model uses thesame feature maps for both coarse level and ﬁne level, resulting in learning anoverlapping hierarchy of training data. B-CNN [19] uses diﬀerent feature mapsfor diﬀerent levels’ predictions without enforcing any hierarchical data structurein the architecture. Deep RTC [17] adopts hierarchical classiﬁcation to deal withlong-tailed recognition, resulting in improved accuracy of tail classes. It adopts

J. Mei et al. a simple conﬁdence-score thresholding method, which is also adopted in our ap-proach, to decide to output ﬁne-level prediction or coarse-level prediction. ButDeep RTC predicts an overapping hierarchical data structure in the ﬁrst place,which is diﬀerent with our situation.

The hierarchical dataset utilized for training our system is professionally labeledand provided by the Fisheries Monitoring and Analysis (FMA) Division, AlaskaFisheries Science Center (AFSC) of NOAA, researchers can contact AFSC di-rectly about getting permission to access this dataset and the correspondinghierarchical data structure.

Fig. 2.

Hierarchical Data Structure: The dataset, labeled and provided by NOAA ﬁsh-eries scientists, includes frames and corresponding labels which are bounding box loca-tion, start and end frames’ IDs of each individual ﬁsh, coarse-level group ground truth,and ﬁne-level species ground truth. The sample images shown here are randomly chosenfrom the dataset.

To construct the dataset used for our system, we use labels of bounding boxlocation to crop objects from raw videos and use labeled start and end frames’IDs for each individual ﬁsh to divide raw videos into individual tracks (videoclips). There are 6 coarse-level groups and 31 ﬁne-level species in this hierarchicaldataset (see Fig. 2). Our dataset is challenging because some ﬁne-level speciesare very similar to one another. The total number of frames is 186,592 (see ideo-based Hierarchical Species Classiﬁcation 5(a) Image-Species Distribution (b) Track-Species Distribution

Fig. 3.

Dataset Distribution: The left column black numbers in both ﬁgures are thenumber of images or tracks for training while the green numbers are for evaluation.There is a 0 . . Fig. 3(a)). The total number of video clips/tracks is 3,021 (see Fig. 3(b)). Eachvideo clip contains one individual ﬁsh pulled up from sea surface to the ﬁshingvessel during the longline ﬁshing activities.

Instead of using cascaded ﬂat classiﬁers in our species identiﬁcation in the long-line ﬁshing, as inspired by the success of Mask-RCNN [4], which feeds sharedfeature maps extracted from the backbone to diﬀerent heads for object classiﬁ-cation and instance segmentation at the same time, our proposed architectureis also an end-to-end training network including two parts: a backbone and sev-eral hierarchical classiﬁcation heads (see Fig. 4). Inspired by B-CNN [19], weuse ResNet101 as our backbone to extract shallow feature maps from images for’Head-1’ and shared deeper feature maps for the other 6 classiﬁcation heads. Weuse Head-1 for coarse-level (6 groups) classiﬁcation and Head-2 to Head-7 forﬁne-level (31 species in total) predictions.

Enforcing Hierarchical Data Structure

We use conﬁdence-score multipli-cation operations to enforce the hierarchical data structure in our system. Theﬁnal conﬁdence score of one speciﬁc species is the product of the conﬁdence scorein the corresponding coarse group and the conﬁdence score in that speciﬁc (ﬁne)species as shown in the following equation, i.e., P (cid:48) j,i = P ,j − · P j,i , j ∈ [2 , , (1) J. Mei et al.

Fig. 4.

Hierarchical Architecture: We call our 7 classiﬁcation heads ’HierarchicalHeads’. Head-1 is for 6 groups in coarse-level and uses shallower feature maps ex-tracted from the backbone for predictions, while the rest of 6 heads are for ﬁne levels,i.e., Head-2 for ’Skates’ group, Head-3 for ’Sharks’ group, Head-4 for ’Roundﬁsh’ group,Head-5 for ’Flatﬁshes’ group, Head-6 for ’Rockﬁshes’ group, and Head-7 for ’Inverte-brates’ group, respectively. All ﬁne-level heads use the shared deeper feature maps fromthe same backbone for predictions. Head-1 has two fully connected layers followed bya sofmax layer. The rest 6 ﬁne-level heads have one fully connected layer followed bya sofmax layer. where P ,j − represents the conﬁdence score in the ( j − th group in coarse level, P j,i represents the conﬁdence score in the i th species of the ( j − th group, P (cid:48) j,i represents the ﬁnal conﬁdence score in the i th species of the ( j − th group.As a result, the ﬁnal conﬁdence score, P (cid:48) j,i , includes both scores from coarselevel and ﬁne level so that this CNN architecture enforces a hierarchical datastructure when using the ﬁnal conﬁdence score product to calculate the trainingloss. Training loss is meaningful when using P (cid:48) j,i because the ﬁnal layers in allheads are softmax outputs, which satisfy the following equations: (cid:26) (cid:80) i P j,i = 1 , (cid:80) j =2 (cid:80) i P (cid:48) j,i = 1 . (2) Eﬃcient Training Strategy

During the image-based training, one input im-age has both labeled coarse-level ground truth as well as ﬁne-level ground truth.For our architecture, there are two options experimented about how to use thesetwo ground truths.The ﬁrst option is training the ’Head-1’ and the ﬁne-level head correspondingto the ground truth coarse-level group. Since the corresponding ﬁne-level head ispicked by coarse-level group ground truth, therefore the losses are only calculatedbased on these two heads.

Loss = − (cid:88) i y ,i · log ( P ,i ) − (cid:88) i y ,i · log ( P (cid:48) j,i ) , (3) ideo-based Hierarchical Species Classiﬁcation 7 where the ﬁrst summation is the cross entropy loss in the ’Head-1’ and y ,i iscoarse-level ground truth. The second summation is the cross entropy loss inthe corresponding ﬁne-level head using ﬁnal predictions, P (cid:48) j,i , after conﬁdence-score multiplication operations. Note that y ,i is the ground-truth among specieswithin this ﬁne-level head. This regular loss does not involve P (cid:48) ji from otherheads, therefore, it does not fully enforce hierarchical data structure duringtraining and only trains two heads each time.The second option is training the ’Head-1’ and all the other ﬁne-level headsusing ﬁnal predictions P (cid:48) ji after conﬁdence-score multiplication operations, re-sulting in a more eﬃcient training strategy because it enables conﬁdence-scoremultiplication operations to fully enforce hierarchical data structure during train-ing and all heads can be trained simultaneously. Loss = − (cid:88) i y ,i · log ( P ,i ) − (cid:88) j =2 (cid:88) i y (cid:48) ,i · log ( P (cid:48) j,i ) , (4)where y (cid:48) ,i denotes the ground-truth among 31 species. The reason why givenone input image we can calculate cross entropy on all ﬁnal predictions, P (cid:48) j,i , isthat after the conﬁdence-score multiplication operations, summation of theseproducts is still 1. Video-based Inference Schemes

Although we use image-based training,where training loss is calculated on each individual input image, two video-based (track-based) inference methods are implemented and compared. Sincefor each input image frame, our system outputs conﬁdence scores of 31 species, P (cid:48) j,i , therefore the ﬁrst way is to pick the species with the maximum averageconﬁdence score of all frames in each track to be the prediction for each track.The second way is to pick the species with maximum conﬁdence score for ev-ery frame in each track, then uses majority vote to select one species as the pre-diction for that track. Finally, we calculate the average conﬁdence scores amongframes corresponding to the selected species. We report performance under thesetwo video-based inference schemes and calculate their average conﬁdence scoreswith image-based conﬁdence scores in the following section. These two inferenceschemes can be summarized into the following equations: (cid:26) p ,i = 1 /T · (cid:80) t P (cid:48) ,i,t ,p ,i = 1 /T · (cid:80) t P (cid:48) j,i,t , j ∈ [2 , , (5)where t is frame index. P (cid:48) j,i,t is P (cid:48) j,i at the t th frame. In the ﬁrst way, T is thetotal number of frames in one video clip (a track from the start-frame to theend-frame of one catching), while in the second way, it is the total number offrames corresponding to the selected species in one video clip. As a result, p ,i isvideo-based average conﬁdence scores in 6 groups and p ,i is video-based averageconﬁdence scores in 31 species. J. Mei et al.

We use the video-based data split, i.e., each short video clip (a track) is associ-ated with one individual ﬁsh and all frames from 80% of all tracks are used astraining data for image-based training. All frames from the rest 20% tracks arethe evaluation data (see Fig. 3). As a result, images for training and evaluationare totally from tracks of diﬀerent individual ﬁshes. All hyper-parameters liketraining epochs, learning rate, data augmentation, and so on are kept the samein the following diﬀerent competing approaches.

The dominant species classiﬁcation architecture is extracting deep features usingCNN followed by a ﬂat classiﬁer. As a result, for the baseline, we use ResNet101as the backbone and two fully connected layers followed by a 31-way softmaxlayer as the ﬂat classiﬁer head, which is a classic deep learning classiﬁcationarchitecture. During training, we only use ﬁne-level ground truth to calculatethe cross-entropy loss based on the ﬂat classiﬁer output conﬁdence scores in31 species without any coarse-level predictions.

From Table 1, we can see theaccuracy of the baseline is far below our hierarchical method . Using all frames from the rest 20% tracks for evaluation, we try the followingevaluation methods, where we calculate both image-based accuracy as well asvideo-based (track-based) accuracy, denoted in the ’Unit’ column in Table 1.We also calculate classiﬁcation accuracy on the coarse level based on coarse-level ground truth, denoted as ’Level-1’ in Table 1.Moreover, with conﬁdence scores in the coarse level as well as the ﬁne level, wecan pick the species with maximum ﬁne-level conﬁdence score under the groupwith maximum coarse-level conﬁdence score as the ﬁnal prediction, which is de-noted as ’Level-2 A’ in Table 1. While, with ﬁnal conﬁdence scores in 31 species,we can directly pick the species with the maximum conﬁdence score productof coarse and ﬁne levels as the ﬁnal prediction, which is denoted as ’Level-2 B’in Table 1. For these two metrics (’Level-2 A’ and ’Level-2 B’) in video-basedschemes, we further use either maximum average conﬁdence score (denoted as’ video ’) or majority vote (denoted as ’ video ∗ ’) to report the performance, asdiscussed in the ’Video-based Inference Schemes’ in Section 3.2.Finally, with these ﬁnal conﬁdence scores, P (cid:48) j,i , j ∈ [2 , ideo-based Hierarchical Species Classiﬁcation 9 Scheme’ section with the threshold. This metric, being able to stay at a coarselevel, is denoted as ’Level-2 C’ in Table 1. Theoretically, the ceiling limit of ’Level-2 C’ is ’Level-1’ if all samples stop at the coarse level. Therefore we use the greedysearch to ﬁnd a threshold for each scheme in Table 1 to make sure that afterstopping at the coarse level, the overall video-based inference accuracy will notdegrade. We ﬁx these thresholds in image-based inference for every competingscheme.

Table 1.

Comparison with Flat Classiﬁer and Ablation Study: ’ video ’ denotes video-based inference by using average conﬁdence score among 31 species to pick one pre-dicted species for each track. ’ video ∗ ’ denotes video-based inference through majorityvote to pick one species for each track. Two numbers under ’Level-2 C’ column follow-ing the accuracy value are total number of stopping at coarse-level and total numberof proceeding to ﬁne-level respectively.Model Unit Level-1 Level-2 A Level-2 B Level-2 CBaseline img - - 78.3 -Scheme-1 img 86.3 77.4 77.4 82.0(8567, 27393) video ∗ video video ∗ video video ∗ video (293, 324) From Table 1, we can see video-based inference is always better than image-based inference in all competing schemes . And these two video-based inferencemethods, average conﬁdence and majority vote, are comparable with each other.Scheme-3 is our full system demonstrated in Fig. 4, which includes conﬁdence-score multiplication operations to enforce hierarchical data structure and usesthe eﬃcient training strategy (

Loss ). Scheme-2 only removes the eﬃcient train-ing strategy and uses Loss instead. Scheme-1 removes conﬁdence-score mul-tiplication operations in the architecture but keeps 7 heads. It also removesthe eﬃcient training strategy and instead uses standard cross-entropy losses on’Head-1’ and the ﬁne-level head corresponding to the ground truth coarse-levelgroup. Scheme-1 shares the same architecture as B-CNN [19]. When evaluatingunder ’Level-2 B’ and ’Level-2 C’ for Scheme-1, we have to multiply the coarse-level conﬁdence scores with the ﬁne-level conﬁdence score in advance to get theﬁnal conﬁdence scores.Detailed accuracy on the coarse level and ﬁne level of Scheme-3 (our proposedcomplete system), based on the maximum average conﬁdence score (denoted as’ video ’ in Table 1) is in Fig. 5. Fig. 5.

Detailed Accuracy on Coarse Level and Fine Level of the complete proposedsystem: The orange bar is image-based inference and the blue bar is video-based infer-ence. (d) is under ’Level-2 C’ evaluation method. We can see most tail species stop atcoarse-level prediction, which makes 5 .

5% improvement in overall video-based accuracyshowed in Table 1.

Scheme-1 and Scheme-2 are experimented mainly for ablation study purposes.

Comparing Scheme-1 with Scheme-2 from Table 1, we can see conﬁdence-scoremultiplication operations can eﬀectively enforce hierarchical data structure andimprove the performance even when Scheme-2 only trains two heads each time.Comparing Scheme-2 with Scheme-3, we can see our eﬃcient training strategy(

Loss ) improves the performance by fully enforcing hierarchical data structureduring training. Under ’Level-2 C’, the completing systems’ ﬁnal predictions can stop at acoarse level if the ﬁnal conﬁdence score is lower than the greedy-searched thresh-old mentioned in the previous section. We call ’Level-2 C’ as hierarchical predic-tion, which is one big advantage of the hierarchical classiﬁer over ﬂat classiﬁers,which allows ﬁsheries managers to assign corresponding experts to review those ideo-based Hierarchical Species Classiﬁcation 11 images in a certain group and get the correct ﬁne-level labels.

Besides, fromFig. 5(d), we can see most tail-class species identiﬁcation stop at a coarse level,resulting in a signiﬁcantly higher overall accuracy in ’Level-2 C’ over that of’Level-2 B’ in Table 1. Also, our full system, Scheme-3, has the greatest numberof images or tracks proceeding to ﬁne level and at the same time achieves thebest performance.

We proposed an eﬃcient hierarchical CNN classiﬁer to enforce hierarchical datastructure for ﬁsh species identiﬁcation, combined with an eﬃcient training strat-egy, and two video-based inference schemes. Our experiments show that the in-tegrated use of these three main strategies indeed improves accuracy clearly.Additionally, hierarchical predictions allow images that cannot be conﬁdentlyclassiﬁed at the ﬁne level to be conﬁdently classiﬁed at a coarse level for ex-perts future examination, which especially improve overall accuracy on tail-classspecies identiﬁcation signiﬁcantly by stopping at coarse-level predictions. More-over, our method greatly outperforms the baseline method, a ﬂat classiﬁer. Fu-ture work will be devoted to adding more techniques like data sampling or addi-tional training losses for tail-class species identiﬁcation. It would be interestingto combine more strategies for long-tailed data with hierarchical classiﬁcation. ibliography [1] Chuang, M.C., Hwang, J.N., Rose, C.S.: Aggregated segmentation of ﬁshfrom conveyor belt videos. In: 2013 IEEE International Conference onAcoustics, Speech and Signal Processing. pp. 1807–1811. IEEE (2013)[2] Chuang, M.C., Hwang, J.N., Williams, K., Towler, R.: Tracking live ﬁshfrom low-contrast and low-frame-rate stereo videos. IEEE Transactions onCircuits and Systems for Video Technology (1), 167–179 (2014)[3] Gupta, S., Abu-Ghannam, N., Massini, R., Mottiar, Y., Altosaar, I., Peleg,M., Normand, M.D., Corradini, M.G.: Trends in application of imagingtechnologies to inspection of ﬁsh and[4] He, K., Gkioxari, G., Doll´ar, P., Girshick, R.B.: Mask R-CNN. CoRR abs/1703.06870 (2017), http://arxiv.org/abs/1703.06870 [5] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog-nition. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 770–778 (2016)[6] Huang, P.X., Boom, B.J., Fisher, R.B.: Hierarchical classiﬁcation with rejectoption for live ﬁsh recognition. Machine Vision and Applications (1), 89–102 (2015)[7] Huang, T.W., Hwang, J.N., Romain, S., Wallace, F.: Live tracking of rail-based ﬁsh catching on wild sea surface. In: 2016 ICPR 2nd Workshop onComputer Vision for Analysis of Underwater Imagery (CVAUI). pp. 25–30.IEEE (2016)[8] Huang, T.W., Hwang, J.N., Rose, C.S.: Chute based automated ﬁsh lengthmeasurement and water drop detection. In: 2016 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP). pp. 1906–1910.IEEE (2016)[9] Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S.,Barnes, L.E.: Hdltex: Hierarchical deep learning for text classiﬁcation. In:2017 16th IEEE international conference on machine learning and applica-tions (ICMLA). pp. 364–371. IEEE (2017)[10] Kowsari, K., Sali, R., Ehsan, L., Adorno, W., Ali, A., Moore, S., Amadi,B., Kelly, P., Syed, S., Brown, D.: Hmic: Hierarchical medical image classi-ﬁcation, a deep learning approach. Information (6), 318 (2020)[11] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation withdeep convolutional neural networks. In: Advances in neural information pro-cessing systems. pp. 1097–1105 (2012)[12] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)[13] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Er-han, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions.In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 1–9 (2015) ideo-based Hierarchical Species Classiﬁcation 13 [14] Wang, G., Hwang, J.N., Williams, K., Cutter, G.: Closed-loop tracking-by-detection for rov-based multiple ﬁsh tracking. In: 2016 ICPR 2nd Workshopon Computer Vision for Analysis of Underwater Imagery (CVAUI). pp. 7–12. IEEE (2016)[15] White, D.J., Svellingen, C., Strachan, N.J.: Automated measurement ofspecies and length of ﬁsh by computer vision. Fisheries Research (2-3),203–210 (2006)[16] Williams, K., Lauﬀenburger, N., Chuang, M.C., Hwang, J.N., Towler, R.:Automated measurements of ﬁsh within a trawl using stereo images froma camera-trawl device (camtrawl). Methods in Oceanography , 138–152(2016)[17] Wu, T.Y., Morgado, P., Wang, P., Ho, C.H., Vasconcelos, N.: Solving long-tailed recognition with deep realistic taxonomic classiﬁer. arXiv preprintarXiv:2007.09898 (2020)[18] Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu,Y.: Hd-cnn: Hierarchical deep convolutional neural networks for large scalevisual recognition. In: Proceedings of the IEEE International Conference onComputer Vision (ICCV) (December 2015)[19] Zhu, X., Bain, M.: B-cnn: branch convolutional neural network for hierar-chical classiﬁcation. arXiv preprint arXiv:1709.09890 (2017)[20] Zion, B.: The use of computer vision technologies in aquaculture–a review.Computers and electronics in agriculture88