An Enhanced Prohibited Items Recognition Model
AA N E NHANCED P ROHIBITED I TEMS R ECOGNITION M ODEL
Tianze Rong, Hongxiang Cai, Yichao Xiong
Media Intelligence Technology Co.,Ltd {tianze.rong, hongxiang.cai, yichao.xiong}@media-smart.cn
February 25, 2021
Most of the security inspection contains prohibited items recognition. And the approach to solve the package checkingis to scan the package or bag to acquire the ability of looking inside to recognize and locate the prohibited items. Butit is high-maintenance and high-time-consuming to keep professional staff to read X-ray images and recognize theitems in the images. Nowadays X-ray reading is rapid and automatic, benefiting from the development of deep neuralnetworks and computer vision.We investigated data of this field, SIXray as a typical dataset, to probe the characteristics of X-ray images and prohibiteditems recognition.In this work we found the primary factor that restricts the model performance is that the object scale is so small thatreins the enhancing module. Then we investigated a bundle of data augmentation and enhancing modules to improvethe performance of the model.Our model can achieve a state of the art performance at SIXray——the best mAP of our model at SIXray10 is 0.899, atSIXray100 is 0.748.
The description and analysis to data can be representative of the data from similar tasks. In this case, we analyzedthe statistics of SIXray dataset [1] and regard it as a typical pattern to research the bottleneck of prohibited itemsrecognition via X-ray.
Amount of Images
SIXray has three subsets in different imbalance levels called SIXray10, SIXray100 andSIXray1000. The imbalance level of SIXray1000 is almost as high as SIXray100. We only investigate the SIXray10and SIXray100. Following Table 1 is the size of the dataset and its subsets.SIXray10 SIXray100Train Set 74959 749574Test Set 13411 133201Total 88370 882775Table 1: Quantitative Statistics of SIXray a r X i v : . [ c s . C V ] F e b EBRUARY
25, 2021
Statistics of Labels
Label imbalanced is one of dominant characteristics in prohibited items detection. It is intuitiveto assume that prohibited items are much more rare than regular ones so that it is inevitable to estimate the level ofimbalance. Table 2 shows the amount of each label.Gun Knife Wrench Pliers Scissors NegativeCounts 2705 1748 2012 3434 807 67464Percentage(%) 3.60 2.33 2.68 4.58 1.08 90.0Table 2: Label Distribution of SIXray on Train Set
Some of the prohibited items are fairly small-sized to be recognized so we investigated the dataset to estimate thescale of the object. Fortunately, the author of SIXray has built the detection annotations. For each of bounding box wecalculated the scale by the formula: scale = (cid:112) width ∗ height Then we displayed the scales from different categories in their own histogram as in Figure 1 It is apparent that pliersand scissors are much smaller than the other categories. The most likely scale of scissors is about 50 pixels.Figure 1: Scale Histograms of Categories
According to [1], X-ray images are transparent and even the item is occluded which can also be seen. The absorption ofX-ray traveling through obeys Lambert-Beer’s Law which points out that the attenuation of light(X-ray included) in atransparent object is linear: A = (cid:15)lc Where A is the absorbance of light, (cid:15) is a coefficient related to attenuating species, l is the traveling length of light, c is the concentration of the attenuating species. Hence, it is natural to blend two images by linear overlay due to thetransparency of X-ray images. Since the X-ray image is perspective and taken from the vertical view which makes the pose ofpackages can be variant and arbitrary. At least, the vertical flipping and horizontal flipping will not harm the semantics.
Random Rotation
The same reason with random flipping but the random rotation can be more variant than randomflipping. 2
EBRUARY
25, 2021Figure 2: Data Augmentation about Overlapping
Random Cropping
The X-ray image mostly has a blank margin around the X-ray, meanwhile the object to berecognized is not placed right in the middle of the image. As we draw the random crop into the data augmentation,whether the margin or the position of the object can be a prerequisite. In addition, we sampled some image to check thepossibility, which is seldom occur, of cropping the object out.
Image Synthesis
Since the data is imbalanced, oversampling is a good way to ease the imbalance, what’s more,adding two samples with weight can be an option benefiting from the transparency of X-ray image. Therefore, we usethe weighted adding as an augmentation, which is called blending by us. You can see in Figure 2. The mixup is also anoption to enlarge the dataset capacity:
Image
Blend = Blend ( Image , Image , λ ) = λImage + (1 − λ ) Image Here λ is a super-parameter. We designed a rescoring mechanism against the imbalance between classes [2]. As we showed in Table 2, the amount ofnegative sample is more than any of the amount of a single category of prohibited items, which means as we decouplethe imbalance between positive and negative samples and the imbalance between positive classes can somehow ease theimbalance among the whole dataset. It can be regarded as a rather weak hierarchy structure.To represent the imbalance between positive sample and negative sample, we adopted the probability of positive sampleas the objectness score.And based on the objectness score, we regress probability of the corresponding class by the formula: P ( class i ) = P ( class i | object ) ∗ P ( object ) Prohibited items are mostly local and partial from the whole image. In a way, the majority of the image is uninformativeor should be ignored. Attention mechanisms, especially spatial attention, can lead the model to focus on the local regionto promote the performance of the model.Spatial attention [3] is widely used as a plug and play module to promote the performance of models which distribute aweight to each of the pixel height-wise and weight-wise. Prohibited items should be weighted more in this case. Channel-wise attention [4] is another form of attention on feature-channel. The Convolutional Block Attention Module(CBAM)is combined with the channel-wise and spatial attention. 3
EBRUARY
25, 2021Figure 3: CBAM Structure [5]
An ordinary ResNet-34 [6] architecture pre-trained on ImageNet [7], but with sigmoid function to outputthe confidence as multi-label.
Loss Function
Intuitively we adopted the binary cross entropy loss as the loss function since this task is multi-labelclassification.
Optimizer
The optimizer is Nesterov Accelerated Gradient(NAG), learning rate is set to 0.01 without learning ratescheduler. Also the momentum parameter is set to 0.9.
Data Augmentation
The input data will be random flipping on both vertical and horizontal direction with a probabilityof 0.5. Besides, all images will be resized to (224, 224).
Training Procedure
The batch size is 128 and training for 60 epochs. The training duration is optimal and selectedas the best setting. The metrics of our baseline is shown below in Figure 3.AP(%) Gun Knife Wrench Pliers Scissors meanResNet34 [1]
Table 3: Metrics on Baseline Setting Trained on SIXray10
Instead of directly resize to (224, 224), we firstly resize the image to a size of (256, 256), thenrandomly cropped to (224, 224).
Random Rotate
To rotate the image with a random degree between ( − ◦ , ◦ ) , and resize the image to keep allpixels are still in the region of image. MixUp
MixUp [8] is to mix two group images up with a partition coefficient λ which is submitting to a betadistribution B ( α, β ) . We add two different image via adding with: Image iblending = λImage ioriginal + (1 − λ ) Image ishuffled
Their loss is calculated by the formula:
Loss = λLoss original + (1 − λ ) Loss shuffled
Blending
Due to the overlapping property, directly add two images with a constant coefficient λ , here Image ishuffled means images from a certain batch after shuffling:
Image iblending = λImage ioriginal + (1 − λ ) Image ishuffled EBRUARY
25, 2021corresponding label is: label blending = label original | label shuffled where | is dimension-wise-or.Method Baseline Random Crop MixUp(0.2, 0.2) MixUp(0.2, 0.2) Blend(0.5)mAP(%) 82.3 82.5 81.9 82.2 66.0Table 4: Metrics on Different Data Augmentation As stated before, the prohibited item could be aimed by attention mechanism. We employed the Convolutional BlockAttention Module(CBAM) [5] as the attention mechanism into our model. CBAM is an attention module with bothchannel-wise and pixel-wise attention. But we modified the implement from the original one, the structure is like Figure4 following. Figure 4: Implementation of CBAM StructureBaseline CBAMmAP(%) 82.3 83.0Table 5: Result of CBAM
After the experiment we mentioned, we found it is counter-intuitive that the normal and universal methods to promoteour model are all disabled. We checked most of the potential factors to check out the bottleneck. It can be clearlydelivered from the baseline Table 3 and P-R curve in Figure 5 that the category of scissors has a lower performancethan the others and similarly the label of pliers has the second lower performance, particularly the recall of scissors isflopping on the curve.To associate with the statistics of objects, the label of scissors is exactly the smallest category while the pliers is second.We change the input scale into (384, 384) and (512, 512) and enlarge the crop size with equal ratio.
After the scale-related experiment, we found the scale of images maybe a performance bottleneck holds the accuracyoff. We ran a set of experiments to release the restriction and revalidate those methods that did not work before. Here issome description of notes might be used: 5
EBRUARY
25, 2021Figure 5: P-R curve of BaselineInput Size Crop Size mAP(%)256 224 82.3384 336 85.8512 448
Table 6: Result under Different Input Scale• Input scale: The edge length of images after resizing.• Crop scale: The edge length of images after random cropping.• Flip: Using the random flip on vertical and horizontal direction with probability of 0.5.• Rotation: Using the random rotation with a rotating angle between ( − ◦ , ◦ ) .• Synthesis: Using the image synthesized with MixUp or blending.• CBAM: Using CBAM module on classification head.Finally, we considered that the final setting is:• Scale: Input scale is (512, 512).• Random Crop: Cropping scale is (448, 448).• Random Flip: Using the random flip on vertical and horizontal direction with probability of 0.5.• CBAM Module on classification head.• Mixup: Alpha and beta is 0.4. 6 EBRUARY
25, 2021Input Scale Crop Scale Flip Rotation Synthesis CBAM mAP224 (cid:88) (cid:88)
256 224 (cid:88) (cid:88)
256 224 (cid:88) (cid:88) (cid:88)
MixUp(0.2, 0.2) 81.9256 224 (cid:88)
MixUp(0.4, 0.4) 82.2256 224 (cid:88)
Blend(0.5) 66.0384 336 (cid:88)
512 448 (cid:88)
512 448 (cid:88) (cid:88)
512 448 (cid:88) (cid:88)
512 448 (cid:88)
MixUp(0.4, 0.4)
512 448 (cid:88)
Blend(0.5) 86.5512 448 (cid:88) (cid:88)
MixUp(0.4, 0.4) 86.5512 448 (cid:88)
MixUp(0.4, 0.4) (cid:88)
Table 7: Results and Conditions of All Experiment
We tested the best model trained on SIXray10 and a model whose setting is inherited from the best SIXray10 but trainedon SIXray 100. AP(%) Gun Knife Wrench Pliers Scissors meanResNet34 [1] 83.1 78.8 30.5 55.2 16.1 52.7DenseNet-CHR [1] 82.1 78.8 43.2 66.8 28.8 60.0Trained on SIXray100 82.0 85.8
Table 8: Metrics on Baseline Setting Trained on SIXray100
By analyzing the results of baseline on SIXray100, the set trained on SIXray100 is even worse than SIXray10 againstthe common sense that the bigger the data is, the better the model works. Furthermore the variant is controlled it isreasonable to believe that the model degradation is due to the higher imbalance level. The solution we designed is therescoring mechanism.We modified the output layer of the FC-layer to adapt the rescoring mechanism.According to the label of SIXray the dimension of the output layer should be 5, equal to the number of categories. Wemodified the dimension into 6. The surplus is the objectness, which is to predict the probability if there is a prohibiteditem in the image but without classification. The consequent probability of each categories is the five componentsmultiply with the objectness as the formula we mentioned before: P ( class i ) = P ( class i | object ) ∗ P ( object ) AP(%) Gun Knife Wrench Pliers Scissors meanTrained on SIXray10 85.1
Table 9: Metrics of Rescoring Mechanism7
EBRUARY
25, 2021
Conclusively, we adopt a input scale of 512, crop scale of 448, and with random flip, CBAM and mix up whose alphaand beta is 0.4 as basic configuration can achieve a mAP of 89.9% on SIXray10. As for SIXray100 need rescoringmechanism additionally, which achieve a mAP of 74.8% on SIXray100.
References [1] Caijing Miao, Lingxi Xie, Fang Wan, Chi Su, Hongye Liu, Jianbin Jiao, and Qixiang Ye. Sixray: A large-scalesecurity inspection x-ray benchmark for prohibited item discovery in overlapping images. In , 2020.[2] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In
IEEE Conference on Computer Vision &Pattern Recognition , pages 6517–6525, 2017.[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization.In , pages 2921–2929, 2016.[4] Jie Hu, Li Shen, Gang Sun, and Samuel Albanie. Squeeze-and-excitation networks.
IEEE Transactions on PatternAnalysis and Machine Intelligence , PP(99), 2017.[5] Sanghyun Woo, Jongchan Park, Joon Young Lee, and In So Kweon. Cbam: Convolutional block attention module.2018.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In