STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection
>> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1
Abstract —Industrial smoke emissions present a serious threat to natural ecosystems and human health. Prior works have shown that using computer vision techniques to identify smoke is a low cost and convenient method. However, industrial smoke detection is a challenging task because industrial emission particles are often decay rapidly outside the stacks or facilities and steam is very similar to smoke. To overcome these problems, a novel Spatio-Temporal Cross Network (STCNet) is proposed to recognize industrial smoke emissions. The proposed STCNet involves a spatial pathway to extract texture features and a temporal pathway to capture smoke motion information. We assume that spatial and temporal pathway could guide each other. For example, the spatial path can easily recognize the obvious interference such as trees and buildings, and the temporal path can highlight the obscure traces of smoke movement. If the two pathways could guide each other, it will be helpful for the smoke detection performance. In addition, we design an efficient and concise spatio-temporal dual pyramid architecture to ensure better fusion of multi-scale spatiotemporal information. Finally, extensive experiments on public dataset show that our STCNet achieves clear improvements on the challenging RISE industrial smoke detection dataset against the best competitors by 6.2%. The code will be available at: https://github.com/Caoyichao/STCNet . Index Terms —smoke detection; spatio-temporal; dual pyramid. I. I NTRODUCTION ndustrial smoke emissions may cause adverse effects on human health and ecological environment. Large amounts of air pollutants may cause or contribute to an increase in mortality or serious illness or may pose a present or potential hazard to human health. Smoke detection technology based on computer vision can help regulators to obtain visual evidence and enterprises to implement self-monitoring. In addition, smoke is the main manifestation of early fires. Smoke detection is an efficient approach for large range fire monitoring. In industrial smoke detection task, plants usually not only emit smoke, but also emit a lot of steam. Steam and smoke have very similar appearance, which brings a great challenge to smoke detection in this scene. Different from some smoke datasets (in which steam and smoke are not deliberately distinguished), RISE dataset [29] makes a clear distinction between steam and smoke.
This work was supported by the National Natural Science Foundation of China (No.61871123), Key Research and Development Program in Jiangsu Province (No.BE2016739) and a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions. We thank the Big Data Center of Southeast University for providing the facility support on the numerical calculations in this paper. More realistic industrial smoke emission dataset makes practical smoke detection method possible. So far, there are a lot of references about the recognition of specific features of smoke. According to the dimension difference of input data, existing methods can be divided into image-based and video-based. Image-based smoke detection methods tend to detect smoke areas from single-frame image. Video-based methods usually not only learn spatial features from single frame, but also learn temporal information in temporal domain from videos. In some cases, image-based method is a good choice when stable and reliable image sequences are not available. Tian et al. [2] proposed to detect and separate smoke from a single image frame by convex optimization. Yuan et al. [37] proposed to combine local binary pattern (LBP) like features, kernel principal component analysis (KPCA), and Gaussian process regression (GPR) for smoke detection. There are also some researches that apply convolutional neural networks (CNNs) to smoke detection and recognition. Yin et al. [38] proposed a deep normalization and convolutional neural network (DNCNN) with 14 layers for smoke recognition, in which batch normalization is used to speed up the training process and boost accuracy of smoke recognition. However, the dynamic characteristics of smoke often play an important role in the recognition process. When the human eyes distinguish smoke from the video, dynamic features are often used as key reference information. If the recognition model could learn context information from sequence data, its recognition accuracy will be improved theoretically. From motion point of view, namely higher order linear dynamical system (h-LDS) descriptor was proposed as a dynamic texture descriptor for video smoke identification [5]. There are some researchers trying to use deep learning method for smoke detection. Lin et al. [6] proposed a joint detection framework for video smoke detection, in which faster RCNN is employed to generate suspected smoke boxes and 3D CNN is used to extract spatio-temporal features of the clip. However, this method also have weak points, such as the considerable computational complexity. Despite these efforts, video-based smoke recognition is still a challenging task. Early smoke objects are usually small in size and the variance of smoke color, texture, and interference is large,
Yichao Cao, Xiaobo Lu and Jinde Cao is with School of Automation, Southeast University, Nanjing 210096, China. Qingfei Tang was with Northeastern University, Shenyang, China. He is now with the Nanjing Enbo Technology Co., Ltd., Nanjing, 210007, China . Fan Li is with School of Information Science and Engineering, Southeast University, Nanjing 210096, China.
STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection
Yichao Cao,
Student Member, IEEE , Qingfei Tang, Xiaobo Lu, Fan Li, and Jinde Cao,
Fellow, IEEE I REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 which make industrial smoke detection very difficult. The key to accurate smoke detection is the spatiotemporal feature learning ability. Given the aforementioned concerns, we propose a novel industrial smoke detection framework, denoted as Spatio-Temporal Cross Network (STCNet). Inspired by the two-stream methods [39][21], this framework attempts to cross fuse multi-scale spatial and temporal features in forward of the smoke detection process. The main contributions of this paper are as follows: A novel video smoke detection architecture utilizing residual frames is proposed, which can effectively focus on subtle smoke objects. The Spatio-Temporal Cross Network (STCNet) to integrate the spatial feature learning approach with the temporal modeling through dual pyramid, achieve collaborative promotion of two-path networks. Experimental results on challenging RISE dataset demonstrate our proposed method achieves state-of-the-art results on smoke detection task. The rest of this paper is organized as follows. Section Ⅱ summarizes related works. The proposed architecture for smoke representation described in Section Ⅲ. Detailed performance studies and analysis are conducted in Section Ⅳ. Finally, conclusions and discussions are drawn in Section Ⅴ. Our model code will be released if the paper is accepted. II. R ELATED W ORK
In this section, some representative smoke detection methods are reviewed. Although few literatures research on video smoke detection, there is substantial literature on video understanding. It is a nature idea to use video understanding methods for smoke recognition. Therefore, we use a subsection to introduce and discuss video understanding methods in details. A. Smoke Detection
The success of existing approaches for smoke detection rely on robust smoke feature description. Both smoke feature descriptors and deep learning methods can be divided into two categories according to the dimension of smoke features: image-based methods [2], [3], [4], [17], [18], [19], [20] and video-based methods [5], [6]. To motivate the rationale for the proposed methods, some representative works are reviewed from three aspects: image-based, video-based and deep learning-based. Image based methods usually focus on the smoke texture, color, shape and edge. Yuan [18] proposed a double mapping framework for smoke detection, in which the first mapping calculates histograms of edge orientation, edge magnitude and Local Binary Pattern (LBP) bit, and densities of edge magnitude, LBP bit, color intensity and saturation, the second mapping computes the statistical characteristics of mean, variance, skewness, kurtosis and Hu moments. Some researchers formulated the smoke detection task as sparse representation and convex optimization problem [2], [19], [20]. Tian et al. [2] proposed to separate quasi-smoke and quasi-background components by dual over-complete dictionaries, in which the respective sparse coefficients is concatenated for smoke detection. Based on the airlight albedo ambiguity model, Long et al. [17] proposed to detect the smoke and predict thickness distribution through transmission. Although image-based methods have achieved impressive results, these methods do not meet the requirements of the practical applications, since they ignore the motion information of the smoke. The dynamic characteristics of smoke often play an important role in the recognition process of human vision. Dimitropoulos et al . [5] proposed a higher order linear dynamical system (h-LDS) descriptor for multidimensional dynamic smoke texture analysis. In recent years, Deep learning methods has achieved competitive results on various tasks, such as visual recognition [40], speech recognition [41] and natural language processing [42]. Researchers designed a deep normalization and convolutional neural network (DNCNN) for smoke detection, which is a superior alternative for traditional hand-crafted methods [38]. Zhao et al. [43] demonstrated the effectiveness of using saliency detection and deep convolutional neural network in localization and recognition of wildfire in unmanned aerial vehicle (UAV) images. Given the smoke candidate patch, the dark channel was reported to has more elaborate information of the smoke and the detailed features of dark channel images have been used as cue to perform smoke detection [4]. One of the difficulties in smoke recognition is the limitation of smoke sample number for training. To ease this limitation, Xu et al. [44] proposed an framework based fast detector SSD and MSCNN for smoke detection using synthetic smoke image samples. Recently, there are some video-based smoke detection methods using deep learning [5]. Lin et al. [6] proposed a joint smoke detection framework to locate and recognize smoke from videos, in which a faster RCNN is used to generate suspected smoke region proposals and 3D CNN is used to extract temporal information. Although this method achieved better smoke detection performance compared with image-based methods, large computational cost limited its practical applications. B. Video Understanding
Although smoke detection from video is still a challenging task, great breakthrough has been made in video understanding [10]-[14][12]-[16]. Long-term Recurrent Convolutional Networks (LRCNs) [10] combined Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) for activity recognition and video description. 2D CNN was applied to process individual frames and output representation of image features for stack of LSTMs. Zhou et al. [11] proposed Temporal Relation Network (TRN) to learn and reason about temporal dependencies between video frames at multiple time scales. TRNs aims to describe the temporal relations between the spatial features extracted by 2D CNN. Temporal Shift Module (TSM) [12] was designed to solve video understanding from another side, which shifts part of the channels along the temporal dimension, both forward and backward, to exchange information among neighboring frames. Above methods explored to extract spatial feature through 2D REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 CNN, and then relied on different temporal model methods in the middle or output layer of 2D CNN. Another research direction of video understanding is 3D CNN methods and its (2+1)D CNN variants. Tran et al. [15] proposed to use 3D CNN to model appearance and motion information simultaneously for videos. Carreira et al. [16] introduced a Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation for video, and proved that, after pre-training on a large video dataset, performance of I3D models can be considerably improved. In this work, I3D models and its variants are considered as baseline methods on video smoke detection dataset. Here, we explore to solve the smoke detection using video classification methods. The goal of this work is to explore whether the spatial and temporal branches in the smoke detection model could work together to improve the modeling ability of the model for smoke characteristics. Our key motivation of STCNet is that: a) subtle smoke motion feature may be highlighted by a simple and efficient residual frame calculation method; b) multi-scale spatio-temporal dual pyramid may be able to better integrate spatial and temporal information in the middle layers of two-path network; c) a coherent two-path network could take both temporal and spatial characteristics into account in the smoke detection process. III. S PATIO -T EMPORAL C ROSS N ETWORKS
In this section, we give a detailed description of the proposed video smoke detection method. The intuition behind the proposed methods is introduced firstly. Second, we present the architectures of Spatio-Temporal Cross Network. After that, the multi-level spatio-temporal representation structure details are reported, which are very important to decomposition of spatial and temporal components for smoke videos. Finally, we describe the spatio-temporal feature cross operation. A. Intuition
For video smoke detection task, it is a natural idea to apply existing video understanding methods directly to detect smoke from videoes. However, experiments suggest that general video understanding frameworks are not good at dealing with light smoke objects. These methods seem to pay more attention to obvious motion information. For the RISE dataset studied in this paper, industrial smoke objects usually have no obvious contours and texture features. Moreover, some samples are hard to recognize from single image, even for human eyes. In addition, due to the lack of effective smoke feature descriptors, it is difficult to generate the smoke optical flow representation. Detailed experimental results will be shown in the Section Ⅳ-C. Inspired by two-stream methods, the proposed STCNet uses residual frames to focus on subtle temporal features of moving smoke areas. The intuition behind the STCNet is that spatial and temporal branches may guide each other and recognize smoke objects cooperatively. We assume that residual frames and RGB frames could focus on motion information and texture semantic information respectively. The cross fusion of these information may be helpful for detection model to make the final prediction. B. Overview of Methodology
With the assumption that the magnitudes of residual frames usually correlate with the smoke motion regions, we devise the Spatio-Temporal Cross Network (STCNet) as shown in Figure 3. For smoke video detection task, an input video is split into N subsections of same size, one RGB frame is sampled in each subsection (N=8 in this work). By jointly processing a small number of frames sampled from a whole video, the most relevant information of smoke objects can be captured. At the same time, processing fewer frames can reduce the inference time of model. The proposed method is a two-path architecture, which consists of a spatial path using a CNN to extract smoke texture features and a temporal path using a same CNN (non-weight sharing) to calculate smoke motion features. Let
𝐹𝑟𝑎𝑚𝑒 𝑖 ∈ℝ 𝐶×𝑇×𝐻×𝑊 , 𝑅𝑒𝑠𝐹𝑟𝑎𝑚𝑒 𝑖 ∈ ℝ 𝐶×𝑇×𝐻×𝑊 be the 𝑖 𝑡ℎ RGB frames and 𝑖 𝑡ℎ residual frames in the spatial and temporal network respectively, with 𝐶 , 𝑇 , 𝐻 , 𝑊 being the number of channels, temporal length, height and width of the image. The channel number 𝐶 is 3 for both RGB frames and residual frames. The RGB frames and residual frames are processed by spatial and temporal network respectively. We assume that the two paths could guide each other. For example, the spatial path can easily filter out the obvious distractions such as swaying trees, while the temporal path can highlight the subtle traces of smoke. If they could guide each other, it would be helpful to improve the performance of the model. Therefore, we perform multi-scale feature cross fusion between the two branches. Spatial path
The backbone design of neural network is very important in the whole framework design. In recent years, many famous network structures have been designed for image classification, such as ResNet [22], ResNext [23], and so on. Here, following these successful network structures in image classification, we adapt them to design our Spatio-Temporal Cross Network for smoke detection. For spatial path, stacked frames for one batch is
𝐶 × 𝑇 × 𝐻 ×𝑊 . However, it should be noted that there is no temporal interaction between the stacked frames in the inference of spatial branch. The design goal of space path is to focus on the texture and appearance of smoke regions. Therefore, each input frame is processed as an independent individual. Multi-scale spatial feature pyramid is extracted by spatial CNN backbone. The detail of spatio-temporal feature pyramid will be reported in the following. Residual frames
For the industrial smoke detection task, one of the most challenging problems is that smoke features are not as intuitive as general object recognition. Therefore, we hope to obtain residual frames by subtracting adjacent frames to highlight the changing regions between frames. In order to reserve color and long-term dependence information, stacked residual frames of RGB channel are set as temporal path input. Assuming the 𝑖 𝑡ℎ RGB frame is formulated as
𝐹𝑟𝑎𝑚𝑒 𝑖 , the 𝑖 𝑡ℎ residual frame can be define as: 𝑅𝑒𝑠𝐹𝑟𝑎𝑚𝑒 𝑖 = {𝛼 ∗ |𝐹𝑟𝑎𝑚𝑒 𝑖 − 𝐹𝑟𝑎𝑚𝑒 𝑖+1 |, 𝑖𝑓 < 𝛽𝛽, 𝑖𝑓 ≥ 𝛽 REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4 where the 𝛼 is an expanding coefficient, which highlights the frame differences. Then, the maximum residual frame pixel value is limited with the parameter 𝛽 ( 𝛽 = 255 in experiments), its function is to prevent numerical overflow. Some RGB and corresponding frames are shown in Figure 1. The first row shows the normal RGB images, in which the red arrow indicates the approximate location of smoke. The smoke area in residual frames usually contains light cyan components. Compared with RGB frames, the information in residual images is relatively sparse, and they mainly focus on moving objects. The subtle smoke feature is enhanced by the expanding coefficient 𝛼 , and then the maximum pixel value is limited to 255 by parameter β . These operations not only highlight the characteristics of smoke, but also suppress the interference such as steam. For each frame, we only need to calculate the difference with the next frame. The computational cost can even be ignored when compared with the convolutional neural network latency or optical flow calculation. Figure 1.
Input RGB frames (the top row) in RISE dataset and corresponding residual frames (the bottom row).
For clarity, red and blue arrows are used to mark smoke and steam (non smoke) respectively in the input images. Temporal path
In parallel to the spatial path, temporal path is another CNN model designed to capture smoke motion features. After obtaining the residual frames, convolutional neural network can extract motion features from the frame differences. In order to reduce the complexity of the network structure design, the architecture of temporal network is the same as that of spatial network. In order to capture different spatial and temporal information, the weights of spatial and temporal branches are different. The spatial path focuses on the smoke appearance and background information, while the temporal branch focuses on the moving region. Our spatio-temporal cross network architecture is generic, and it can be instantiated with different backbones, such as Se-ResNext and MobilNetv2. Generally, the complex backbone can obtain better recognition performance, while the advantage of lightweight model lies in the computational speed. An STCNet example using SE-ResNeXt-50 as backbone is specified in Table Ⅰ. C. Multi-level Structure
A deep CNN model generates feature maps layer by layer, and with pooling layers the feature maps have different sizes and depths. The high-resolution in-network feature maps have low-level features which contain detailed spatial information. However, the low-resolution feature maps have high-level semantic features which have stronger representational capacity for object recognition. The Feature Pyramid Network (FPN) [45] is a common technique in object detection and image classification task to improve the feature representation ability of the CNN.
TABLE
Ⅰ A
N EXAMPLE INSTANTIATION OF THE
STCN ET . Stage Spatial path
Temporal path Output sizes (T × C × W × H) data - -
Spatial : T×3× Temporal : T×3× conv1 , 64 stride ,64 stride Spatial : T×3× Temporal : T×3× pool1 , max stride ,max stride Spatial : T×64× Temporal : T×64× res1 [ , 1281 × 3 , 1281 × 1 , 256 ] ×3 [ , 1281 × 3 , 1281 × 1 , 256 ] ×3 Spatial : T×256× Temporal : T×256× res2 [ , 2561 × 3 , 2561 × 1 , 512 ] ×4 [ , 2561 × 3 , 2561 × 1 , 512 ] ×4 Spatial : T×512× Temporal : T×512× res3 [ , 5121 × 3 , 5121 × 1 , 1024 ] ×6 [ , 5121 × 3 , 5121 × 1 , 1024 ] ×6 Spatial : T×1024× Temporal : T×1024× res4 [ , 10241 × 3 , 10241 × 1 , 2048 ] ×3 [ , 10241 × 3 , 10241 × 1 , 2048 ] ×3 Spatial : T×2048× Temporal : T×2048× fuse conv , , T×256× cls conv , , adaptive average pool out fully connected layer In STCNet, in order to effectively fuse spatial and temporal features from multi-scales, a dual pyramid fusion structure of spatio-temporal features for dual path was designed. Our REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 method takes T RGB and residual frames as input, and outputs spatial and temporal feature maps at several scales with a scaling step of 2. There are often many residual blocks generating same size feature maps and we combine these blocks into the same stage (as shown in Table Ⅰ). For our multi-level structure, the output of the last layer of each residual block in spatial and temporal path is chosen to build the dual feature pyramid.
Figure 2.
The network architecture of multi-scale spatio-temporal feature pyramid (
SE-ResNext-50 as backbone ). Specifically, each residual stage of SE-ResNext-50 is denoted as {res1, res2, res3, res4}. Their output have stride of {4, 8, 16, 32} with respect to the input frame. The first three pairs of feature maps are selected to build the dual pyramid. The last 7×7 size feature map is fused for the final classification head prediction, rather than as the top layer of the pyramid. In addition, we also design a dual pyramid variant, in which connections fuse from the temporal to the spatial path. The detail results will be reported in Section Ⅳ-D. D. Cross Operation
As shown in Figure 2, a spatio-temporal dual pyramid sample is constructed to improve the smoke feature recognition ability. In this structure, the feature maps of spatial and temporal path are summed with each other to participate in model inference. We assume that the dual pathway could guide each other and make progress together in this way. Specifically, the sum fusion operation of two feature maps can be define as: 𝑆𝐹 𝑖,𝑗,𝑐𝑠𝑢𝑚 = 𝑇𝐹 𝑖,𝑗,𝑐𝑠𝑢𝑚 = 𝑆𝐹 𝑖,𝑗,𝑐 + 𝑇𝐹 𝑖,𝑗,𝑐 where , , and 𝑆𝐹, 𝑇𝐹 ∈ℝ
𝐶×𝐻×𝑊 . Since the feature maps to be fused come from the same location of neural networks with the same architecture. Therefore, the sum fusion is element-wise addition between two feature maps. Although the summed features in two paths are same, the focus of two branches are different. In Section Ⅳ-E, we will show the activated regions on 7×7 size feature maps of two branches by Grad-CAM [46], which helps us to analyze whether the model working as expected. In order to make the research more convincing, we discuss different ways of fusing methods between two pathway. Conv fusion is another common feature fusion method. We also design to change the orange arrow and addition operation into convolution fusion in Figure 2. However, two problems are exposed, one is the slow convergence speed, another is the reduction of accuracy. Finally, identity mapping is used in the orange arrows in this work.
Figure 3.
Framework of the proposed
Feature Foreground Enhanced Network (FFENet) for smoke source prediction and detection.
IV. E XPERIMENTS
In this section, we evaluate proposed methods on RISE industrial smoke emission dataset. Experiments were conducted on a personal computer with CPU of Intel Core i7-9700 and GPU of NVIDIA RTX2080TI using the Pytorch framework. × 56 × 28 × 14 PredictionConvNetblocks Feature maps Identity mapping
Spatial path
Temporal path
RGB framesResidualframes
Element-wise sum
Cross operation
Convor FC
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6 A. RISE Dataset
The RISE video smoke dataset [29] is the first large-scale video dataset for recognizing industrial smoke emissions. RISE dataset contains 12,567 clips with 19 distinct views from cameras on three sites that monitored three different industrial facilities. The clips are from 30 days that spans four seasons in two years in the daytime. RISE is a challenging video classification dataset, as it covers various characteristics of smoke emissions, including opacity and color, under diverse weather (e.g., haze, fog, snow, cloud) and lighting conditions. Moreover, RISE involves distractions of various types of steam, which can be similar to smoke and challenging to distinguish. B. Implementation Details
Stochastic Gradient Descent (SGD) with a mini-batch size of 3 is used to optimize model weights. The weight decay is 0.0005 and the momentum is 0.9. We start from a learning rate of 0.001 and a pre-trained SE-ResNeXt-50 on ImageNet because of its balance between accuracy and efficiency. In addition, the proposed framework based on other backbone models has also been tested, such as MobileNetv2 [47]. C. Comparisons on RISE
We compare the proposed method with primary methods on video understanding [10] [12] [14] [16] [35]. The comparison results are shown in Table Ⅱ. Abbreviation ND and FP means no data augmentation and frame perturbation, respectively. Abbreviation TSM, LSTM, NL and TC means Temporal Shift module [12], Long Short-Term Memory layers [10], Non-Local module [14], and Timeception layers [35], respectively. Results in Table Ⅱ show that plain CNN model based on SE-ResNeXt-50 achieves a better performance than the original baseline methods, which show that a stronger backbone is helpful for smoke recognition. When adopting 8 frames as input, our STCNet gains 6.2% higher F-score than the RGB-I3D-TC method, which confirms the remarkable ability of STCNet for smoke temporal modeling. Data augmentation technology can further improve the performance of STCNet. In training stage, The data augmentation methods (horizontal flipping, random resizing and cropping, perspective transformation, area erasing, and color jittering) same as in [29] is used by default. Moreover, the parameters, FLOPs, latency and throughput characterization of each method are also reported in Table Ⅲ. Generally, the proposed STCNet can achieve 42.57 videos per second processing speed. Although the STCNet model of SE-ResNext backbone has no obvious advantage in processing speed, we also designed the STCNet based on MobileNetv2, which achieves a much faster computing speed than other methods: 109.7 videos per second, as shown in Table Ⅲ.
TABLE
Ⅱ F-
SCORES FOR COMPARING DIFFERENT METHODS ON RISE DATASET . Model S S S S S S Average Flow-SVM .42 .59 .47 .63 .52 .47 .517 Flow-I3D .55 .58 .51 .68 .65 .50 .578 RGB-SVM .57 .70 .67 .67 .57 .53 .618 RGB-I3D [16] .80 .84 .82 .87 .82 .75 .817 RGB-I3D-ND [16] .76 .79 .81 .86 .76 .68 .777 RGB-I3D-FP [16] .76 .81 .82 .87 .81 .71 .797 RGB-I3D-TSM [12] .81 .84 .82 .87 .80 .74 .813 RGB-I3D-LSTM [10] .80 .84 .82 .85 .83 .74 .813 RGB-I3D-NL [14] .81 .84 .83 .87 .81 .74 .817 RGB-I3D-TC [35] .81 .84 .84 .87 .81 .77 .823 Plain SE-Resnext .83 .82 .84 .85 .78 .83 .826
STCNet(MobileNetv2) .86 .88 .87 .89 .84 .86 .868
STCNet(SE-ResNext) .88 .89 .90 .90 .86 .88 .885
TABLE
Ⅲ C
OMPARISON WITH OTHER METHODS ON RISE DATASET . Model Backbone Params Flops Latency Throughput Average RGB-I3D [16] Inception I3D 12.3M 62.7G 30.56ms 32.71vid/s .817 RGB-I3D-TSM [12] Inception I3D 12.3M 62.7G 31.85ms 31.40vid/s .813 RGB-I3D-LSTM [10] Inception I3D 38.0M 62.9G 31.01ms 32.25vid/s .813 RGB-I3D-NL [14] Inception I3D 12.3M 62.7G 30.32ms 32.98vid/s .817 RGB-I3D-TC [35] Inception I3D 12.3M 62.7G 30.41ms 32.88vid/s .823 Plain SE-Resnext SE-ResNeXt-50 26.6M 34.4G 22.10ms 45.25vid/s .826
STCNet (Proposed) Mobilenetv2 .868
STCNet (Proposed) SE-ResNeXt-50 27.2M 34.6G 23.49ms 42.57vid/s .885 D. Comparison with other variants
At the beginning of architecture design, we also try some test schemes to gradually explore the optimal detection architecture. Here, three STCNet variants are designed for comparison, which are denoted as STCNet-A (Figure 4), STCNet-B (Figure 5) and STCNet-C (Figure 6). Among them, STCNet-A is a typical two-stream network, which sums the output feature maps of two pathway to predict probability. The difference is that there is no optical flow input, but the residual frames calculated in the way of Section Ⅲ-B-2). STCNet-B is equipped with unidirectional feature fusion method, which only fuses temporal to spatial feature. The aim of designing STCNet-B is to verify the efficiency of fusion operation from spatial to temporal path. In addition, STCNet-C performs the spatio-temporal feature fusion only after the first residual block without our scale feature fusion operation. Three variant models are trained and tested on RISE dataset. The same hyperparameters are adopted for three models. The results are reported in Table Ⅳ. With the multiscale fusion architecture, the STCNet increases by 1.3% compared with STCNet-A. After adding REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 multi-scale fusion operation from temporal path to spatial path in STCNet-A, the F-score of STCNet-B is improved to 0.882. But compared with the STCNet of bidirectional fusion, it is 0.3% behind. Finally, the F-score of STCNet-C is reduced by 0.7% using only shallow feature fusion. These ablation experiments prove the effectiveness of the multi-scale spatio-temporal feature fusion operation in STCNet.
Figure 4.
Framework of the STCNet-A (similar to Two-stream network [39]).
Figure 5.
Framework of the STCNet-B, in which internal connections fuse from the temporal to the spatial path . Figure 6.
Framework of the STCNet-C, which performs single scale feature fusion near the residual frame input.
TABLE
Ⅳ F-
SCORES FOR COMPARING DIFFERENT VARIANTS ON RISE DATASET . Model S S S S S S Average
STCNet-A .87 .88 .89 .89 .84 .86 .872
STCNet-B .88 .89 .90 .90 .85 .87 .882
STCNet-C .88 .89 .90 .88 .85 .86 .877
STCNet .88 .89 .90 .90 .86 .88 .885 E. Visualization of two path
In order to diagnose the attention of two pathway in STCNet, we apply the Gradient-weighted Class Activation Mapping (Grad-CAM [46]) to visualize the active regions in input frames. The 7 × Figure 7.
Grad-CAM visualization for spatial and temporal pathway. For clarity, red and blue arrows are used to mark smoke and steam (non smoke) respectively in input images. The second row is visualization for spatial path, and the last row is for temporal pathway.
Prediction
Spatial path
Temporal path……
RGBframesResidual frames
PredictionSpatial path
Temporal path … … RGBframesResidual frames
PredictionSpatial path …… Temporal path
RGBframes
Residual frames
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8
Figure 8.
False positive cases in the testing set. The undetected smoke regions is marked by red arrows.
Figure 9.
False negative cases in the testing set. Areas that may cause false positives are marked with red dotted arrows.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9 Finally, we show some false positive and false negative cases in Figure 8 and Figure 9, respectively. We analyze the deficiencies of the proposed method through these cases. In Figure 8, red arrows are used to mark smoke regions which cannot be detected. The smoke in the first and third rows is almost completely obscured by steam. Smoke in second and third rows have a small amount for a short duration. These may be the reasons why the model failed to detect smoke. Some false positive cases are shown in Figure 9. We use red dotted arrows to mark the regions that cause false positives. In the first two rows, light steam leads to false detection results. The third cause of false alarm is the large moving fog, which may mislead the model to make wrong predictions. In fact, small areas and short duration steam or smoke object detection is still a challenging task. C
ONCLUSION
In this work, a novel Spatio-Temporal Cross Network (STCNet) was designed and verified for video smoke detection task. Inspired by two-stream methods, spatio-temporal dual pyramid architecture was proposed to focus on efficient spatio-temporal information fusion. Extensive experiment results on challenging smoke dataset demonstrated the remarkable ability of STCNet in smoke detection. However, the task of industrial smoke detection is far from being solved, and many challenges still remain, such as the classification between smoke and steam. For further research, we will study on the fine-grained category classification between smoke and steam in videos. R
EFERENCES [1]
E. Jang, Y. Kang, J. Im, D.-W. Lee, J. Yoon, and S.-K. Kim, “Detection and Monitoring of Forest Fires Using Himawari-8 Geostationary Satellite Data in South Korea,”
Remote Sensing , vol. 11, no. 3, Art. no. 3, Jan. 2019, doi: 10.3390/rs11030271. [2]
H. Tian, W. Li, P. O. Ogunbona and L. Wang, "Detection and Separation of Smoke From Single Image Frames," in
IEEE Transactions on Image Processing , vol. 27, no. 3, pp. 1164-1177, March 2018, doi: 10.1109/TIP.2017.2771499. [3]
F. Yuan, L. Zhang, X. Xia, Q. Huang, and X. Li, “A Wave-shaped Deep Neural Network for Smoke Density Estimation,”
IEEE Transactions on Image Processing , pp. 1–1, 2019, doi: 10.1109/TIP.2019.2946126. [4]
Y. Liu, W. Qin, K. Liu, F. Zhang, and Z. Xiao, “A Dual Convolution Network Using Dark Channel Prior for Image Smoke Classification,”
IEEE Access , vol. 7, pp. 60697–60706, 2019, doi: 10.1109/ACCESS.2019.2915599. [5]
K. Dimitropoulos, P. Barmpoutis and N. Grammalidis, "Higher Order Linear Dynamical Systems for Smoke Detection in Video Surveillance Applications," in
IEEE Transactions on Circuits and Systems for Video Technology , vol. 27, no. 5, pp. 1143-1154, May 2017, doi: 10.1109/TCSVT.2016.2527340. [6]
G. Lin, Y. Zhang, G. Xu, and Q. Zhang, “Smoke Detection on Video Sequences Using 3D Convolutional Neural Networks,”
Fire Technol , Feb. 2019, doi: 10.1007/s10694-019-00832-w. [7]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in
Advances in Neural Information Processing Systems 25 , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [8]
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition,” in
Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 577–585. [9]
Y. Goldberg, “Neural Network Methods for Natural Language Processing,”
Synthesis Lectures on Human Language Technologies , vol. 10, no. 1, pp. 1–309, Apr. 2017. [10]
J. Donahue et al. , “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634. [11]
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal Relational Reasoning in Videos,” in
Computer Vision – ECCV 2018 , vol. 11205, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp. 831–846. [12]
J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” in , Seoul, Korea (South), Oct. 2019, pp. 7082–7092, doi: 10.1109/ICCV.2019.00718. [13]
M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient Convolutional Network for Online Video Understanding,” in
Computer Vision – ECCV 2018 , vol. 11206, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp. 713–730. [14]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local Neural Networks,” in , Salt Lake City, UT, USA, Jun. 2018, pp. 7794–7803, doi: 10.1109/CVPR.2018.00813. [15]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” in , Dec. 2015, pp. 4489–4497, doi: 10.1109/ICCV.2015.510. [16]
J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. [17]
C. Long et al. , “Transmission: A New Feature for Computer Vision Based Smoke Detection,” in
Artificial Intelligence and Computational Intelligence , vol. 6319, F. L. Wang, H. Deng, Y. Gao, and J. Lei, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 389–396. [18]
F. N. Yuan, “A double mapping framework for extraction of shape-invariant features based on multi-scale partitions with AdaBoost for video smoke detection,”
Pattern Recognition , vol. 45, no. 12, pp. 4326–4336, Dec. 2012. [19]
H. Tian, W. Li, L. Wang, and P. Ogunbona, “Smoke Detection in Video: An Image Separation Approach,”
International Journal of Computer Vision , vol. 106, no. 2, pp. 192–209, Jan. 2014. [20]
H. Tian, W. Li, P. Ogunbona, and L. Wang, “Single Image Smoke Detection,” in
Computer Vision -- ACCV 2014 , 2015, pp. 87–101. [21]
A. Sobral and A. Vacavant, “A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos,”
Computer Vision and Image Understanding , vol. 122, pp. 4–21, May 2014, doi: 10.1016/j.cviu.2013.12.005. [22]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [23]
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated Residual Transformations for Deep Neural Networks,” in , Honolulu, HI, Jul. 2017, pp. 5987–5995, doi: 10.1109/CVPR.2017.634. [24]
Zhou X, Wang D, Krähenbühl P. “Objects as Points.” arXiv preprint arXiv:1904.07850, 2019. [25]
Joseph Redmon, Ali Farhadi; “YOLO9000: Better, Faster, Stronger” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263-7271. [26]
J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767 [cs] , Apr. 2018, Accessed: Nov. 03, 2019. [Online]. Available: http://arxiv.org/abs/1804.02767. [27]
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi Tian; “CenterNet: Keypoint Triplets for Object Detection” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6569-6578. [28]
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.
International journal of computer vision , 88(2):303–338, 2010. 1. [29]
Y.-C. Hsu et al. , “RISE Video Dataset: Recognizing Industrial Smoke Emissions,” arXiv:2005.06111 [cs] , May 2020, Accessed: Jul. 09, 2020. [Online]. Available: http://arxiv.org/abs/2005.06111.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10 [30]
N. Ketkar,
Introduction to PyTorch . Berkeley, CA: Apress, 2017, pp. 195–208. [Online]. Available: https://doi.org/10.1007/978-1-4842-2766-4_12. [31]
O. Barnich and M. V. Droogenbroeck, “ViBe: A Universal Background Subtraction Algorithm for Video Sequences,”
IEEE Transactions on Image Processing , vol. 20, no. 6, pp. 1709–1724, Jun. 2011, doi: 10.1109/TIP.2010.2101613. [32]
L. Wang et al. , “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in
Computer Vision – ECCV 2016 , Cham, 2016, pp. 20–36, doi: 10.1007/978-3-319-46484-8_2. [33]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision , pages 740–755. Springer, 2014. [34]
Garcia-Garcia, Alberto, et al. "A review on deep learning techniques applied to semantic segmentation." arXiv preprint arXiv:1704.06857 (2017). [35]
Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders; “Timeception for Complex Action Recognition” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 254-263. [36]
Lin T Y, Goyal P, Girshick R, et al. “Focal loss for dense object detection” presented at the Proceedings of the IEEE international conference on computer vision. 2017, pp. 2980-2988. [37]
F. Yuan, X. Xia, J. Shi, H. Li, and G. Li, “Non-Linear Dimensionality Reduction and Gaussian Process Based Classification Method for Smoke Detection,”
IEEE Access , vol. 5, no. 99, pp. 6833–6841, 2017. [38]
Z. Yin, B. Wan, F. Yuan, X. Xia, and J. Shi, “A Deep Normalization and Convolutional Neural Network for Image Smoke Detection,”
IEEE Access , vol. 5, pp. 18429–18438, 2017. [39]
K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” Advances in neural information processing systems. pp. 568-576, 2014. [40]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in
Advances in Neural Information Processing Systems 25 , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [41]
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition,” in
Advances in Neural Information Processing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 577–585. [42]
Y. Goldberg, “Neural Network Methods for Natural Language Processing,”
Synthesis Lectures on Human Language Technologies , vol. 10, no. 1, pp. 1–309, Apr. 2017. [43]
Y. Zhao, J. Ma, X. Li, and J. Zhang, “Saliency Detection and Deep Learning-Based Wildfire Identification in UAV Imagery,”
Sensors (Basel) , vol. 18, no. 3, Feb. 2018. [44]
G. Xu, Q. Zhang, D. Liu, G. Lin, J. Wang, and Y. Zhang, “Adversarial Adaptation From Synthesis to Reality in Fast Detector for Smoke Detection,”
IEEE Access , vol. 7, pp. 29471–29483, 2019. [45]
T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2117–2125. [46]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in , Oct. 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74. [47]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV : Inverted Residuals and Linear Bottlenecks,” arXiv:1801.04381 [cs] , Mar. 2019, Accessed: Sep. 03, 2020. [Online]. Available: http://arxiv.org/abs/1801.04381., Mar. 2019, Accessed: Sep. 03, 2020. [Online]. Available: http://arxiv.org/abs/1801.04381.