[PDF] Localizing Anomalies from Weakly-Labeled Videos

Abstract

Video anomaly detection under video-level labels is currently a challenging task. Previous works have made progresses on discriminating whether a video sequencecontains anomalies. However, most of them fail to accurately localize the anomalous events within videos in the temporal domain. In this paper, we propose a Weakly Supervised Anomaly Localization (WSAL) method focusing on temporally localizing anomalous segments within anomalous videos. Inspired by the appearance difference in anomalous videos, the evolution of adjacent temporal segments is evaluated for the localization of anomalous segments. To this end, a high-order context encoding model is proposed to not only extract semantic representations but also measure the dynamic variations so that the temporal context could be effectively utilized. In addition, in order to fully utilize the spatial context information, the immediate semantics are directly derived from the segment representations. The dynamic variations as well as the immediate semantics, are efficiently aggregated to obtain the final anomaly scores. An enhancement strategy is further proposed to deal with noise interference and the absence of localization guidance in anomaly detection. Moreover, to facilitate the diversity requirement for anomaly detection benchmarks, we also collect a new traffic anomaly (TAD) dataset which specifies in the traffic conditions, differing greatly from the current popular anomaly detection evaluation benchmarks.Extensive experiments are conducted to verify the effectiveness of different components, and our proposed method achieves new state-of-the-art performance on the UCF-Crime and TAD datasets.

Full PDF

11 Localizing Anomalies from Weakly-Labeled Videos

Hui Lv, Chuanwei Zhou, Chunyan Xu, Zhen Cui, Jian Yang

Abstract —Video anomaly detection under video-level labelsis currently a challenging task. Previous works have madeprogresses on discriminating whether a video sequencecontainsanomalies. However, most of them fail to accurately localize theanomalous events within videos in the temporal domain. In thispaper, we propose a Weakly Supervised Anomaly Localization(WSAL) method focusing on temporally localizing anomaloussegments within anomalous videos. Inspired by the appearancedifference in anomalous videos, the evolution of adjacent tem-poral segments is evaluated for the localization of anomaloussegments. To this end, a high-order context encoding model isproposed to not only extract semantic representations but alsomeasure the dynamic variations so that the temporal contextcould be effectively utilized. In addition, in order to fully utilizethe spatial context information, the immediate semantics aredirectly derived from the segment representations. The dynamicvariations as well as the immediate semantics, are efﬁcientlyaggregated to obtain the ﬁnal anomaly scores. An enhancementstrategy is further proposed to deal with noise interferenceand the absence of localization guidance in anomaly detection.Moreover, to facilitate the diversity requirement for anomalydetection benchmarks, we also collect a new trafﬁc anomaly(TAD) dataset which speciﬁes in the trafﬁc conditions, differinggreatly from the current popular anomaly detection evaluationbenchmarks. Extensive experiments are conducted to verify theeffectiveness of different components, and our proposed methodachieves new state-of-the-art performance on the UCF-Crime andTAD datasets.

Index Terms —Anomaly Detection, Anomaly Localization,Weak Supervision, Trafﬁc Anomaly Dataset.

I. I

NTRODUCTION

Anomaly detection, which aims to recognize those behav-iors or appearance patterns that do not conform to usualpatterns [1], [2], [3], is of great importance for the alarm ofpotential risks or dangers. With the large-scale deployment ofsurveillance, an urgent requirement of intelligent systems is toautomatically ﬁlter out possibly abnormal events.Anomaly detection is typically tackled under constrainedsupervision that only normal data or limited annotations areprovided in the training phase. [5], [6], [4], [7], [8], [9]. Asanomalous events rarely happen in real-life situations, whichbrings in the scarcity of annotations, several methods [7], [8],[9] have been proposed to model the shared pattern amongnormal videos in the training phase and detect the outliers asanomalies during testing. However, these methods often failin identifying anomalies when facing complicated or unseen

Corresponding author: Zhen Cui, [email protected] address: (hubrthui, cwzhou, cyx, csjyang) @njust.edu.cn (H. Lv, C.Zhou, C. Xu, J. Yang).H. Lv, C. Zhou, C. Xu, Z. Cui and J. Yang are from School of ComputerScience and Engineering, Nanjing University of Science and Technology,Nanjing, Jiangsu, China. The dataset and the benchmark test codes, as well as experimental results,will be made public as soon as the paper is accepted. A n o m a l y s c o r e Frame

Fig. 1. Anomaly localization comparisons.

Left:

A comparison of

Burglary case on UCF-Crime (x-axis corresponds to frames and y-axis corresponds tothe anomaly score.). Groundtruth is shown in the top-left, following by threemethods: Sultani et al. [4], Zhong et al. [5] and ours.

Right:

ROC curves offrame-level anomaly localization on all anomaly videos scenes. Recently, researchers [4] select to leverage the video-level labels for developing robust anomaly detectors. Therelease of UCF-Crime dataset [4] activates this direction whichencourages the detectors to take the best of the weak signals ofvideo-level. Although a large gain has been observed in thisdomain [6], [5], it still lacks an efﬁcient way to temporallylocalize anomalous frames.In previous methods, the performances on the overall test setare calculated and reported as the evaluation results. However,in this case the temporal anomaly localization capability ofdetectors is somewhat unrevealed. Since the whole test setcontains both the normal and anomaly videos, the superiorperformance on normal videos conceals the poor accuracyof anomaly localization within anomalous videos. To revealthe problem therein, we conduct statistic analysis on theanomaly data of UCF-Crime test set. ROC curves of two state-of-the-art (SOTA) methods, as well as ours, are plotted inFigure 1. The details of corresponding metrics can be foundin Section IV-B. A test sample (video name:

Burglary079 ) isalso shown in the left part of the ﬁgure. We can ﬁnd thatthe localization accuracy of the two methods on anomalousvideos are 54.25% and 59.02% respectively, in term of AUC.It is worth mentioning that an AUC of 50% can be obtainedby random binary prediction of anomalies. To sum up, thereexists a large space for improving the temporal localization ofanomalies.To facilitate the localization property of anomaly detec-tion, we propose a Weak-Supervised Anomaly Localization(WSAL) method to detect anomalies with video-level labels.In our WSAL model, we investigate into two aspects of theanomaly, which are the semantic and context. The anomaliesare deﬁned as the uncommon activities that differ from the a r X i v : . [ c s . C V ] A ug usual pattern. Thus, the extracted semantics can act as a directcue to infer anomalies. Based on this point, existing methods[4], [5] treat each video as frame-by-frame images or directoptical ﬂows and extract ﬁne-grained semantic representationsfor further anomaly detection. While in this manner, thetemporal evolution across consecutive frames is not adequatelyexploited. For example, in the long temporal domain, a suddenchange of the dynamic variation uncovers the anomaly itself.On the other hand, owing to rough supervisory signalsof video level, anomaly detectors are prone to false alarmsor missed detection. For instances, the drastic environmentchanges as well as noise interruptions caused by hardwarefailure may lead to unwanted high probabilities from anomalydetectors. These inﬂuences in long and untrimmed videosought to be suppressed or excluded from the anomalies.Toward this end, we put forward a noise stimulation strategyto tackle inevitable interference lying in untrimmed videos,whose quality can not be guaranteed. Moreover, we introducehand-crafted anomalies, similar to actual anomalies, to providepseudo location signals as guidance for the model learningprocess. Above two strategies make up for our enhance-ment strategy to boost the weakly-supervised learning andstrengthen the robustness of anomaly detection. Thoroughly,we equip raw video data with the augments of video noises andhand-crafted anomalies. As a consequence, the weak labels areexpanded with pseudo location signals as auxiliary.So far, there are few datasets available for anomaly detec-tion, most of them are with small-scale or constrained sce-narios, like UCSD Peds [10], Avenue [11], ShanghaiTech [8],and Street Scene [12]. Also, these datasets are initially used forsemi-supervised anomaly detection with normal training sam-ples. For the problem under video-level scenario, only UCF-Crime [4] dataset is now available publicly to our knowledge.Thus, we build a new large-scale trafﬁc anomaly detection(TAD) dataset with long surveillance videos under trafﬁcscene. The proposed dataset consists of realistic anomalieson roads with various appearance and motion pattern, whichfacilitates the diversity requirement for anomaly detectionbenchmarks. In addition, we implement and compare differentSOTA anomaly detection approaches on the UCF-Crime andour TAD dataset. We hope the newly collected benchmarkwill boost the development of anomaly detection in researchdomain and real-life application. The main contributions ofthis paper are as follows:1) Deeply delving into anomaly detection, we propose aweak-labeled anomaly localization method, in which weemploy a high-order context encoding model to encodetemporal variations as well as high-level semantic infor-mation for weak-supervised anomaly detection;2) We introduce a weak-supervision enhancement strategyby stimulating video noises and building virtual indicativelocations to suppress or exclude those interruption offalse-anomaly signals;3) We build a new weak-labeled trafﬁc anomaly detectiondataset with extensive benchmark tests, and report thenew SOTA results on the proposed TADdataset as wellas the UCF-Crime dataset. The rest parts of the paper are organized as follows: InSection II, we review the literature of anomaly detectionin surveillance videos. In Section III we introduce the pro-posed WSAL method in details. In Section IV, we conductexperiments to compare our proposed method with otherSOTA methods as well as elaborated ablation studies to fullyanalyze different components. Finally, we conclude the paperin Section V. II. R ELATED WORK

The techniques of anomaly detection in surveillance videoshave long been developed as a tool for mining unusual patternsin videos [13], [14], [15], [10], [16]. The family can be dividedinto two categories, based on how and how much supervisionis accessible. The details are discussed in the following.Video anomaly detectors are originally designed in anunsupervised manner [9], [17], [18], [19], [20] that onlynormal samples are available in the training phase withoutany labels. They ﬁrst involve modeling normal behavior andthen detecting samples that deviate from it. Motion trajectory,as one of common basic factors, has been utilized to detectanomalies in [21], [22], [15]. Although such methods can beeasily implemented and have a fast execution speed, tracking isprone to failure in crowded or cluttered scenes. An alternativeapproach is to tackle the original task as a problem of noveltydetection, e.g., sparse coding [23], [11], [13], distance-basedmethods [24], the mixture of dynamic models on texture [25]and the mixture of probabilistic PCA [26]. These models aregenerally built on the low-level features (e.g., a histogramof oriented gradients (HOG) and the histogram of orientedﬂows(HOF)) extracted from densely sampled image patches.Several recent approaches have investigated the learning-basedfeatures using autoencoders [27], [28], which minimize recon-struction errors on the normal patterns in the training process.Shi et al. [29] have proposed to modify original LSTM withConvLSTM and used it for precipitation forecasting. Liu etal. [7] have designed a future prediction network to infer thecoming frames and detect anomalies according to the qualityof predicted frames. Despite the advances in developing un-supervised anomaly detection approaches, these detectors areeasily to fall down when dealing with complicated or unseenenvironments.Compared with approaches that build their detection modelson normal behavior only, various methods based on weaksupervision situation have been introduced, they employ bothnormal and abnormal data along with video-level annotationsfor building robust anomaly detection model [30], [4], [31],[6], [5]. Among them, Multiple Instance Learning (MIL) isintroduced for pattern modeling under weak supervision [4],[31], [6]. Sultani et al. [4] consider anomaly detection as aMIL problem with a novel ranking loss function. Later, byextending it, Zhu et al. [6] introduce the attention mechanismfor better localizing anomalies. Due to the absence of anomalypositions in training phase, these two methods cannot predictanomaly frames well. For this, Zhong et al. [5] attempt toconstruct supervised signals of anomaly positions throughiteratively reﬁning them. However these methods focus on predicting segment labels while neglecting modeling hiddentemporal context information. In order to explore the anomalyinformation in video sequences, we propose high-order contextencoding through modeling variations on context of sequencesand incorporate the cue with semantic cues to better localizeanomalies. Besides, we introduce a weak-supervision enhance-ment strategy to suppress false-anomaly signals.III. T HE P ROPOSED M ETHOD

In this section, we will introduce our Weak-SupervisedAnomaly Localization (WSAL) method in details. We ﬁrstgive the basic formulation for the anomaly localization andcore modules of our WASL are elaborated thoroughly then.

A. Formulation

The purpose of anomaly detection is to estimate the anomalystatus of a video and localize the anomalies in the videosequence if exist. In the weakly supervised scenario, a videosequence X and its corresponding video-level annotation y ∈ { , } are given, where the case ‘y=1’ means there ex-ists anomaly in this sequence otherwise ‘y=0’ indicates thatthere is no anomaly in X . We start with dividing the entirevideo into several segments with equal lengths, denoted as X = ( X , X , · · · , X m ) . The goal of video segmenting is toalleviate computation burden resulting from almost-repetitivevideo frames. For the i -th segment X i , we ﬁrst use a classicalconvolutional network to extract features for each frame, andthe segment feature x i is obtained by aggregating the featuresof all frames within the segment. As a consequence, thesequence X could now be represented by the m -tuple features ( x , x , · · · , x m ) . We can now use this m -tuple to determinewhether the current video contains any anomaly or not, inthe manner that assigning each segment in the video with ananomaly score, indicating the probability of being anomalous.To predict the state (normal or abnormal) of a video, wederive a novel function to describe the video by estimatingthe anomalous margin among a video, formally, S ( X ) = max i,j =1 ,...,m f ( ψ ( x i − k , . . . , x i , . . . , x i + k ) ,ψ ( x j − k , . . . , x j , . . . , x j + k )) , (1)where • ψ is a high-order function that encodes an anchoredsegment as well as its adjacent k segments in thetemporal context. To mine the anomalies, we consider twoaspects of information: spatial semantics and temporalvariations. The function ψ is modeled with a high-orderdynamic regression to generate semantic features andpredict variations within local window [ − k, k ] . Please seeSection III-B for more details. • f is a margin distance metric measuring the anomalyscore margin between the segment position i and j . Themore close the predicted anomaly scores are, the smallerthe distance is. • S ( · ) is the score of a video that computes the maximumrelative distance of pairwise positions. The scores ofnormal videos are expected to be smaller than anomalous videos. Thus, the maximum-distance strategy constrainsentire normal videos more smooth than anomaly videos,which complies with the conventional assumption. • max function is chosen to capture the largest scoremargin, which can represent the extent of abnormalitiesin a video. Since anomaly scores are all close to zero ina normal video, leading to the score margin with a smallvalue. While in an anomalous video, the anomalies, lyingin normal background, will lead to a large score margin.Given a batch of training data {X (1) , X (2) , · · · , X ( n ) } and the corresponding video labels { y (1) , y (2) , · · · , y ( n ) } , wedeﬁne a margin loss function as: ζ ( {X ( i ) }| ni =1 ) = max { , − n n (cid:88) i =1 [ S ( X ( i ) ) | y ( i ) = 1]+ 1 n n (cid:88) j =1 [ S ( X ( j ) ) | y ( j ) = 0] } , (2)here n , n are the total amounts of anomaly and normalsamples. As the function only depends on video-level labels,the learning process belongs to the case of weak supervision.In addition, we augment training samples to generate twotypes of data: noise data { ¨ X ( i ) }| ¨ ni =1 and pseudo-location data { ˘ X ( i ) }| ˘ ni =1 , where ¨ n and ˘ n are the amounts of pseudo samples.The former could help the detector reduce mis-judgementwhere some noised normal videos are predicted as anomalylabels, whilst the latter provides direct guidance to localizeanomalous frames. Let {X (cid:48) }| n (cid:48) i =1 = { ¨ X ( i ) }| ¨ ni =1 ∪ { ˘ X ( i ) }| ˘ ni =1 denote all augmentation samples, where n (cid:48) = ¨ n + ˘ n . Finally,we derive the objective function to optimize as: ζ = ζ O ( {X ( i ) }| ni =1 ) + λζ A ( {X (cid:48) ( i ) }| n (cid:48) i =1 ) , (3)where λ is the balance factor between the original and theaugmented data. Loss function ζ O , deﬁned on the originalweak-labeled data, uses the margin loss ζ in Equation (2),and the details will be listed in Section III-B. Loss ζ A isimposed on noise data as well as pseudo-location data, whichwill be introduced in Section III-C.In testing process, given a video, we obtain the anomalystatus of each segment by aggregating the consensus of spatialsemantics and dynamic variations deﬁned in the following. B. High-order Context Encoding

Previous approaches [4], [6], [5] directly infer the anomalyscores from input visual features in an intuitive way, whileneglecting the guidance of the temporal context for anomalylocalization. Intuitively, the rarely occurred anomalies amongthe normal patterns will lead to signiﬁcant changes in the timedomain. Therefore, the dynamic variations in the time seriesare able to indicate the existence of anomalies. Inspired bythis, we propose to leverage the temporal context informationfor the immediate spatial semantics and dynamic temporalvariations, and aggregate both cues for accurately locatinganomalies.In the beginning, we design a High-order Context Encoding(HCE) model to extract high-level semantic features and encode the variations in time series. The input is the featurevectors ( x , · · · , x m ) extracted from consecutive segments.The regression process is formulated as: (cid:101) x t = (cid:88) j = − k, ··· ,k, j (cid:54) =0 W j (cid:101) x t + j + W x t + b , (4)where W j is a projection function on the j -th segment, b is a bias term. The output encodes the context informa-tion of the anchored segment and adjacent segments, i.e., ( (cid:101) x t − k , · · · , (cid:101) x t − , x t , (cid:101) x t +1 , · · · , (cid:101) x t + k ) . The intuition is that t -th high-order feature vector collects the fruitful informationfrom its k neighbors, which can facilitate both the miningof immediate spatial semantics and local dynamic variations.Actually the regression can be stacked as a hierarchicalstructure by taking the output (cid:101) x as the input in a recursivemanner. In practice, we ﬁnd the simple one-layer regressioncan perform well.The neighbor size k controls the temporal context modeledin each local segment (cid:101) x t . Then to exploit the immediatesemantic information of the anchored segment, we use a fullyconnected layer, activated by a Sigmoid function, to obtain ananomaly score. Formally: ψ sem ( (cid:101) x t ) = σ ( w sem (cid:101) x t + b sem ) , (5)where ψ sem ( (cid:101) x t ) represents the semantics score, w sem and b sem are the weight and bias of the fully connected layer and σ stands for the sigmoid function.To measure the variation between two adjacent segments,we take the cosine similarity measurement: cos( (cid:101) x t − , (cid:101) x t ) = (cid:101) x (cid:62) t − (cid:101) x t / ( (cid:107) (cid:101) x t − (cid:107) (cid:107) (cid:101) x t (cid:107) ) . The corresponding distance metricis − cos( (cid:101) x t − , (cid:101) x t ) , which has a large value for dramatic vari-ations. Then the second-order discrepancy of local variationsis computed as an indicator of anomaly, which becomes: ψ var ( (cid:101) x t ) = (2 − cos( (cid:101) x t − , (cid:101) x t ) − cos( (cid:101) x t , (cid:101) x t +1 ))) / , (6)where we make the score value divided by four to normalizethe scalar into [0 , .Then, we obtain the singularity of a sequence from the dualcontext cues, with the margin measurement f as L1-distance: S sem ( X ) = max i,j =1 , ··· ,m | ψ sem ( (cid:101) x i ) − ψ sem ( (cid:101) x j ) | , (7) S var ( X ) = max i,j =1 , ··· ,m | ψ var ( (cid:101) x i ) − ψ var ( (cid:101) x j ) | . (8)By plugging above singularity tuple into Equation (2), theacquired margin losses of the dual context are denoted as ζ sem and ζ var . Since the scores of normal events are targeted to ,and those of anomalous are sparse (scarce of anomalies), weplace a sparsity constraint on the loss function. Added withthe sparsity constraint of weight β , the margin loss of dualcontext becomes: ζ O = ζ sem ( {X ( i ) }| ni =1 ) + ζ var ( {X ( i ) }| ni =1 )+ βn n (cid:88) i =1 m (cid:88) t =1 ( | s semt | + | s vart | ) . (9) … … Dynamic Variation score Immediate Semantic score Final score MIL Margin

Function

Enhanced Weak

Supervision strategy …… Video-level label

Enhanced Weak Supervision step Score aggregation step

HCE

High-order Context Encoding moduleExtracted features

Fig. 2. Framework of WASL. Video clips are organized in segment leveland inputted into the backbone model. These extracted features are processedin HCE module to generate anomaly scores from the cues of immediatesemantics and dynamic variations. Then the predicted scores are aggregatedand supervised in a novel MIL Margin objective function using the video-level labels. In addition, we introduce Enhanced Weak Supervision strategyfor data augmentation and generating pseudo anomaly signals. Better viewedin color.

C. Enhanced Weak Supervision

Noise Simulation . As is mentioned in Section I, noises invideos lead to serious interference for anomaly detection, es-pecially localization. Due to the unavoidable external factors,it tends to exist noisy artifacts such as lens jitter in the videoswhich is going to result in misjudgments. To mitigate thisissue, we introduce a noise simulation strategy in which wefuse the raw videos with varying degrees of video noises, suchas blur, picture interruption as well as lens jitter. Speciﬁcally,we augment the normal video sequences with three kinds ofvideo noise simulations, which are motion blur (kernel size: , angle: [ − ◦ , ◦ ]), black/blue/purple blocks ([ / , ] ofraw image size) and random scale ( − to +20% on x- andy-axis independently)). We randomly chose m segments in anormal video sequence to augment and the augmented dataare still treated normal.Given the simulating noise data { ¨ X ( i ) }| ¨ ni =1 and the corre-sponding label set { ¨ y ( i ) }| ¨ ni =1 , we apply a supervised constrainton the predicted anomaly states { ¨ s ( i ) t }| ¨ ni =1 : ζ nse ( { ¨ X ( i ) }| ¨ ni =1 ) = 1¨ n ¨ n (cid:88) i =1 m (cid:88) t =1 (¨ s ( i ) t − ¨ y ( i ) ) , (10)s.t. , ¨ s ( i ) t = 12 ([¨ s ( i ) t ] sem + [¨ s ( i ) t ] var ) . (11) Hand-crafted Anomaly . The noise simulation strategyintroduced above is able to alleviate the false alarm fornormal videos. However, for anomaly videos, there still lacksenough data for model training. In particular, there exists noexplicit location supervision in those anomalous videos whichbrings in great challenge for effective anomaly localization. Tomitigate this issue, we then introduce hand-crafted anomalyto boost the anomaly localization performance via creatingexplicit location instructions for anomaly localization. Wename the hand-crafted anomalies as pseudo-location data { ˘ X ( i ) }| ˘ ni =1 . For speciﬁc, we ﬁrst randomly choose a pair ofnormal and abnormal videos. Then several random segmentsof the normal video are deleted and meantime several segmentsof the abnormal video are extracted. Finally, the extracted segments from the abnormal video are combined with theremaining normal video segments to form a pseudo anomaloussequence. Those segments extracted from the abnormal videosare viewed anomalous since the substitutes differ from thedistribution of the substitutive video due to different scenes.Since the abnormal and normal segments are fused witha random weight, simply assigning a ﬁxed score (e.g., )for the simulated abnormal video will bring in a degeneratesolution because the signal can encourage the remainingnormal segments to have a high anomaly score along withpseudo-location data. To mitigate the issue above, we proposea simple yet effective skill by barely pushing the fusedsegments { ˘ X i }| ˘ ni =1 to have a higher score than the others.The supervision constraint is derived as: ζ loc ( { ˘ X i }| ˘ ni =1 ) = 1˘ n ˘ n (cid:88) i =1 (cid:88) t ∈I max(0 , ˘ s ( i ) t − max j / ∈I { ˘ s ( i ) j } ) , (12)where ˘ s ( i ) t denotes the anomaly score estimated by HCEmodule and I is a collection of the indexes for those pseudolocation segments of the hand-crafted anomalies. Integratingthe above two augmentation techniques, the objective functionof weak supervision enhancement strategy becomes: ζ A = ζ nse ( { ¨ X ( i ) }| ¨ ni =1 ) + ζ loc ( { ˘ X i }| ˘ ni =1 ) . (13)Combining Eqn. 9 and Eqn. 13, we ﬁnally arrive at theoverall objective function which is denoted by Eqn. 3. D. Trafﬁc Anomaly Detection (TAD) Dataset

So far, most existing video anomaly datasets are preparedfor unsupervised case, e.g., UCSD Pedestrian 1&2 [10], Sub-way Entance & Exit [2], Avenue [11], etc. These unsuperviseddatasets are either small in scale or under the constraint oflimited scenes. For example, videos in Avenue are short andsome of the anomalies are performed by actors (e.g., throwingpaper), which are unrealistic. Different from them, UCF-Crime[4] dataset is a newly released large-scale dataset proposed forweak supervision case. Long untrimmed surveillance videos,covering 13 real-world anomalies, are collected in the dataset.It has a total of , surveillance videos, which consistsof , training videos and test videos. Note that onlyvideo-level annotations are provided in the training set, andframe-level annotations are available for evaluation on the testset. The comparison of video anomaly detection datasets areshown in Table I. TABLE IA

COMPARISON OF ANOMALY DETECTION DATASETS

Dataset Target domain

Ours Trafﬁc

500 540,272 400+

Although the datasets mentioned above have greatly pro-moted the development of anomaly detection methods, there still lacks benchmarks of enough diversity for evaluation. Tofurther meet the benchmark diversity requirement, we herepropose a new anomaly detection dataset which speciﬁes inthe trafﬁc scenes, differing greatly from the datasets mentionedbefore. Trafﬁc video monitoring plays an essential role inearly warning and emergency assistance for car accidents.It is an urgent need to design effective anomaly detectionsystems for surveillance videos on roads. In trafﬁc scenes,many factors, such as the vehicles moving at a high speedand various road conditions, add up to the hardness of anomalydetection. So far, there is not any speciﬁc dataset for trafﬁcanomaly detection. Although UCF-Crime contains road acci-dents videos, most of anomalies in trafﬁc scenarios are notcovered in this dataset. Basically, a large-scale and complexdataset is of great importance for devising and evaluatingvarious methods. It is out desire to push the study of anomalydetection towards the usage in real trafﬁc application. Hence,we are motivated to construct a new large-scale dataset underthe trafﬁc scenes. The collected TAD dataset consists oflong untrimmed videos which cover real-world anomalieson roads, including Vehicle Accidents, Illegal Turns, IllegalOccupations, Retrograde Motion, Pedestrian on Road, RoadSpills and

The Else (i.e., the remaining anomalies with fewerquantity are put together as one category). Some cases ofthe anomalies are shown in Figure 3. The proposed datasetis comprehensive that includes realistic videos from variousscenarios, weather conditions and daytime periods.

Data collection.

Trafﬁc videos from various countries arecollected and annotated under a detailed and uniﬁed plan.Raw videos are downloaded from YouTube or Google website.The collected videos are mostly recorded by CCTV camerasmounted on the roads. We remove videos which fall into any ofthe following conditions: manually edited, prank videos, andcontaining compilation. Videos with ambiguous anomalies arealso excluded.

Data partition and Annotation.

Our TAD dataset containsa total of about 25 hours videos, average 1075 frames perclips. The anomalies randomly occur in each clip, about 80frames average and there are one to two random anomalies ina video sequence. Finally, trafﬁc surveillance videos aresaved and annotated for anomaly detection, with 250 abnormaland normal videos respectively. The whole dataset is randomlypartitioned into two parts: training set with videos, and testset with videos. Both training and test sets contain normaland abnormal videos and all seven kinds of anomalies atvarious temporal locations in anomalous videos. Following thesetting of weak supervision as [4], the training set is equippedwith video-level annotations, and frame-level annotations areprovided for test set.Our proposed TAD dataset contains totally different abnor-mal scenarios than the current benchmarks. We believe that itcould be used to better evaluate the effects of different anomalydetection algorithms from another perspective. We hope ourTAD dataset could serve as a standard benchmark for betterpromoting the development of anomaly detection methods. A cc i d e n t s I ll e g a l T u r n s I ll e g a l O cc up a ti on s R e t r og r a d e M o ti on P e d e s t r i a n on R o a d R o a d S p ill s T h e E l s e N o r m a l Fig. 3. Examples of different anomalies in the collected TAD dataset.(a) ROC curves on UCF-Crime (b) ROC curves on our TADFig. 4. ROC curves with various anomaly detection methods on the UCF-Crime dataset and TAD dataset

IV. E

XPERIMENTS

A. Implementation Details

As in [5], we adopt the Temporal Segment Network (TSN)[ ? ], which is a powerful action feature extractor, as ourbackbone net. We use the BN-Inception version of TSN toextract features for our proposed WSAL method. We extractfeatures from the global average pooling layer (1024-dim). Forthe UCF-Crime dataset, we use the model weights ﬁnetunedon UCF-Crime as in [5] to extract features. While on our TADdataset we only use the model weights pretrained on Kinetics-400 dataset. We ﬁrst divide each video into non-overlappingsegments empirically as in previous works [5], [6], [4] for afair comparison. Hence, for each video, we have a × feature matrix. During the training phase, we randomly select30 positive and 30 negative bags as a mini-batch. We employAdagrad [32] optimizer with the initial learning rate of . .The parameter of sparsity constraint in the margin loss is setto β = 0 . as in [4], [6] and the weight of strategy forweak supervision enhancement is set to λ = 1 . for the bestperformance. We train the model for a total of K iterations,decrease the learning rate by half at . K, . K and stop at K. All hyper-parameters are the same for both UCF-Crimeand TAD datasets.

B. Evaluation Metrics

For anomaly detection [25], [7], Receiver Operation Char-acteristic (ROC) is used as a standard evaluation metric. Itis calculated by gradually changing the threshold of regularscores on the predicted anomaly scores. Then the Area UnderCurve (AUC) is accumulated to a score for the performanceevaluation. A higher value indicates a better anomaly detectionperformance. Following the previous works [6], [4], [5], weapply ROC curves and frame-level AUC for anomaly detectionperformance comparison. Due to the lack of frame-level anno-tations on the training split and veriﬁcation split for ablationstudies, we use video-level AUC as the measurement fortuning the hyper-parameters. In addition, we also use the ROCand AUC on the anomaly subset to serve as the evaluationmetric for anomaly localization ability.

C. Comparison with SOTA Methods

On UCF-Crime dataset.

For fair comparison, we reproducethe methods of [4] and [5] by running their publicly releasedcodes. Other statistical results are drawn from the work [4].We compare our our WSAL with several anomaly detectionmethods. Speciﬁcally, a binary SVM classiﬁer is set as thebaseline method. In this case, the anomalous and normalvideos are treated as two separate class. Models from Lu etal. [11] and Hasan et al. [17] are two unsupervised methods,training with the normal videos in UCF-Crime training set.The remaining Sultani et al. [4], Zhu et al. [6] and Zhonget al. [5] are SOTA weakly-supervised methods. As shown inTable II, on the whole test set which contains both the normaland abnormal videos, we boost the best performance of overallAUC from the . % to . % by a large margin. In Figure4(a), we plot the ROC curves of SOTA methods on the wholeUCF-Crime dataset and it vividly shows the superoirity ofour proposed WSAL method over other SOTA methods. Asfor the Anomaly subset, our proposed method exceeds theSOTA detectors by % over [5] and % over [4], achievinga signiﬁcant progress on the anomaly localization perspective.We draw the following conclusions upon above experimen-tal results: 1) SVM classiﬁer fails to distinguish the anoma-lous and normal videos, mainly because the normal patterns A no m a l y s c o r e Arson018

Frame

Explosion033 Normal_Videos_048Burglary033

Fig. 5. Visualization of predictions on the UCF-Crime testcases. The x-axis denotes the video frame

UANTITATIVE COMPARISON ON THE

UCF-C

RIME DATASET . *

SYMBOLINDICATES THE METHOD IS TRAINED WITH NORMAL VIDEOS ONLY

Method Overall AUC(%) Anomaly Subset AUC(%)SVM 50 50Hasan et al.* [17] 50.60 -Lu et al.* [11] 65.51 -Sultani et al. [4] 75.41 54.25Zhu et al. [6] 79.10 62.18Zhong et al. [5] 82.12 59.02

Ours 85.38 67.38 take the dominate position in both normal and anomalousvideos, and make the classiﬁer difﬁcult to capture the rareanomalies; 2) By encoding the normal patterns and buildingthe corresponding semantic boundary, unsupervised methods[17] and [11] achieve better results than SVM classiﬁer; 3)Owing to the beneﬁts of weak labels, the performances ofweakly-supervised methods [4], [6] and [5] are superior thanabove approaches. Nevertheless, previous weakly-supervisedmethods infer the anomaly status from high-level semanticfeatures intuitively, while neglecting an important property ofthe anomaly, which is the dynamic evolution lying in timeseries. The considerable gain in anomaly localization promotesthe improvement of overall anomaly detection accuracy. Thesuperior results demonstrate that spotting anomalous segmentsis a key component in anomaly detection. Some predictedresults of our method on testcases are shown in Figure 5.

On the proposed TAD dataset.

To compare the perfor-mance of different methods under other circumstances, weconduct comparison experiments on the TAD dataset. Wecompare our WSAL model with four SOTA anomaly detectionmethods, including two unsupervised methods (Luo et al.[8] and Liu et al. [7]) and two weakly-supervised methods(Sultani et al. [4] and Zhu et al. [6]) For unsupervised models,we follow their implementation and train the models on thetraining subset where only normal videos are provided. Allmodels are re-trained with the same features extracted usingTSN, except [7] which takes the RGB frames as inputs.The quantitative comparisons of AUC are revealed in Ta-ble III and the corresponding ROC curves are drawn inFigure 4(b). Similar as upon UCF-Crime, weakly-supervisedmethods are able to obtain much better performance thanunsupervised ones. As both normal and abnormal trainingsamples are provided, the weakly-supervised methods ownmuch better understanding of the intrisic nature of anomaly.Our WSAL method also achieves better performance with a

TABLE IIIQ

UANTITATIVE COMPARISON ON OUR

TAD

DATASET . *

SYMBOLINDICATES THE METHOD IS TRAINED WITH NORMAL VIDEOS ONLY

Method Overall AUC(%) Anomaly Subset AUC(%)Luo et al.* [8] 57.89 55.84Liu et al.* [7] 69.13 55.38Sultani et al. [4] 81.42 55.97Zhu et al. [6] 83.08 56.89Ours gain of % AUC over previous SOTA [6]. The prominent ad-vances on the two large-scale and comprehensive benchmarksprove the superiority of our method on detecting and localizinganomalies. D. Ablation Studies

To comprehensively study the impact of different compo-nents we proposed, we conduct various ablation studies in thispart. All experiments are conducted on the UCF-Crime dataset,all hyper-parameters are kept the same as the WSAL methodif not otherwise claimed.

Analysis of the dual context ensemble.

To verify theeffectiveness of the proposed ensemble mechanism betweenthe immediate semantics and dynamic variations, we constructthree variants of the proposed WSAL method where only theimmediate semantics or the dynamic variations or both ofthem are adopted. Detailed comparisons are presented in theﬁrst three lines of Table IV. When only immediate semanticsare exploited, the algorithm is able to achieve satisfactoryaccuracies of . % and . % w.r.t anomaly detection andanomaly localization. It means that the immediate semanticsonly is able to provide useful information for the tasks. Whileif only dynamic variations are adopted, the performances areboosted by . % and . % separately. One case for visual-ization is shown in Figure 6. There is only one clear and sharppeak in the prediction of dynamic variations cue, comparedwith the results of immediate semantics cue. It demonstratesthat the dynamic variations are good at capturing the suddenoccurrence of the anomaly even when the immediate semanticscue may bring in uncertainty. By aggregating the immediatesemantics and dynamic variations cues, the detection perfor-mance is more robust under various circumstances, with ahigher detection and localization accuracy. It is owing to thecomplementary characteristic of the immediate semantics anddynamic variations cue, which stand for different aspects ofthe anomaly. Immediate semantics cueDynamic variations cue Dual context ensemble A no m a l y s c o r e Frame

Fig. 6. An visualization case of the dual context cues on the UCF-Crimedataset. The light orange region denotes the groundtruth anomaly. From thetop to the bottom, the curves represent the anomaly scores of immediatesemantics cue, dynamic variations cue and consensus of above two cues,respectively. A more robust and smooth prediction is observed from the dualcontext model. TABLE IVA

BLATION STUDIES OF OUR

WSAL

METHOD ON THE

UCF-C

RIMEDATASET . T

HE MEANINGS OF THE ABBREVIATIONS IN THE TABLE ARE ASFOLLOWS : IS: I

MMEDIATE S EMANTICS ; DV: D

YNAMIC V ARIATIONS ;HCE: H

IGH - ORDER C ONTEXT E NCODING ; NS: N

OISE S UPPRESSION ;HA: H

AND - CRAFTED A NOMALYIS DV HCE NS HA Overall AUC(%) Anomaly Subset AUC(%) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Analysis of HCE model.

We study the inﬂuence of High-order Context Encoding on the new training and veriﬁcationsplits of UCF-Crime. Since only video-level labels are avail-able in veriﬁcation split, video-level AUC is measured byaggregating the segment-level model predictions as in Formula1 and then calculating the AUC results. As to the incorporationof temporal context, an appropriate temporal window size k is critical for the ﬁnal performance. We slowly increasethe window size k from 0 to 3 and the results are listedin Table V. When the window size k grows, the accuracyof video-level predictions improves drastically from 0 to 1,with a performance gain of 1.6%. It means that appropriateaggregation of the temporal context possesses great potentialfor anomaly detection. The fruitful information in the temporalneighborhood facilitates the learning of anomaly semantics, aswell as the encoding of the temporal evolution. Finally, wechoose the size k = 2 for trading off between model size andperformance, since the accuracy gain slows down when thewindow size further increases. Enhanced Weak Supervision

We conduct studies to verifythe effectiveness of the proposed video noise augmentationand hand-crafted anomaly separately. The detailed results are A no m a l y s c o r e Frame

Fig. 7. An visualization case of video noises on the UCF-Crime dataset.The light blue region in the video sequence contains video noises. The redcurves in the top-right denotes the groundtruth anomalies. The curve in themiddle-right represents the result of baseline method [5]. The bottom-rightcurve belongs to the result of our method. In this case, the noise comes fromthe lens jitters. The drastic view change easily leads to the false detection ofthe basic model. TABLE VA

NALYSIS OF THE WINDOW SIZE IN

HCE

MODEL ON THE

UCF-C

RIMEDATASET

Window Size 0 1 2 3Video-level AUC(%) 93.39 95.01 95.65 95.73 reported in Table IV. We ﬁrst augment the training process byadding video noise simulations. The anomaly detection AUC isboosted from . % to . % and the anomaly localizationaccuracy is further boosted by . %. One case is plotted inFigure 7, our noise stimulation strategy contributes to alleviatethe interference caused by lens jitters. Note that the anomalylocalization performance improvement is non-trivial. It clearlydemonstrates that the proposed noise augmentation strategy isable to aid the dynamic variation module for better capturingthe real anomaly and achieving better understanding of theintrinsics of anomalies. When we manually synthetic someanomalies to aid the training process, our method achieves . % and . % performance gain over the training strategywhere no noise augmentations are adopted. The performancepromotions demonstrate that our synthetic anomaly data isable to provide extra useful supervision, indicating that largerabnormal detection dataset is needed for sufﬁcient training ofabnormal detection methods. If both augmentation strategiesare combined, the proposed method is able to achieve muchbetter performance than the two separate augmentation strate-gies. It indicates that the proposed two augmentation strategiesare beneﬁcial for the understanding of the anomaly concept bysuppressing the interference coming from the environment aswell as hardware failures, and generating pseudo signals thatsimulating the occurrence of anomalies.V. C ONCLUSION

In this work, we focused on anomaly localization in surveil-lance videos and proposed a weakly supervised anomaly local-ization network that deeply exploring the temporal context inconsecutive segments. Our model encoded temporal dynamic variations as well as high-level semantic information, andleveraged both of them for anomaly detection and localization.Furthermore, we devised a weak supervision enhancementstrategy. The accuracy of anomaly localization was greatlyimproved under the introduced supervision of video noise aug-mentation and pseudo-location data. We also collected a newtrafﬁc anomaly detection dataset for evaluating methods underrealistic scenarios on roads. SOTA methods were veriﬁed onUCF-Crime dataset and our TAD dataset. The experimentalresults showed that the proposed anomaly detector has per-formed signiﬁcantly better than previous methods.R

EFERENCES[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

ACM computing surveys (CSUR) , 2009. 1[2] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-timeunusual event detection using multiple ﬁxed-location monitors,”

TPAMI ,2008. 1, 5[3] Y. Benezeth, P.-M. Jodoin, V. Saligrama, and C. Rosenberger, “Abnormalevents detection based on spatio-temporal co-occurences,” in

CVPR ,2009. 1[4] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection insurveillance videos,” in

CVPR , 2018. 1, 2, 3, 5, 6, 7[5] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graphconvolutional label noise cleaner: Train a plug-and-play action classiﬁerfor anomaly detection,” in

CVPR , 2019. 1, 2, 3, 6, 7, 8[6] Y. Zhu and S. Newsam, “Motion-aware feature for improved videoanomaly detection,”

BMVC , 2019. 1, 2, 3, 6, 7[7] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction foranomaly detection–a new baseline,” in

CVPR , 2018. 1, 2, 6, 7[8] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomalydetection in stacked rnn framework,” in

ICCV , 2017. 1, 2, 5, 7[9] Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos usingspatiotemporal autoencoder,” in

International Symposium on NeuralNetworks , 2017. 1, 2[10] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection andlocalization in crowded scenes,”

TPAMI , 2013. 2, 5[11] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,”in

ICCV , 2013. 2, 5, 6, 7[12] B. Ramachandra and M. Jones, “Street scene: A new dataset andevaluation protocol for video anomaly detection,” in

WACV , 2020. 2, 5[13] B. Zhao, L. Fei-Fei, and E. P. Xing, “Online detection of unusual eventsin videos via dynamic sparse coding,” in

CVPR , 2011. 2[14] L. Kratz and K. Nishino, “Anomaly detection in extremely crowdedscenes using spatio-temporal motion pattern models,” in

CVPR , 2009. 2[15] S. Wu, B. E. Moore, and M. Shah, “Chaotic invariants of lagrangianparticle trajectories for anomaly detection in crowded scenes,” in

CVPR ,2010. 2[16] B. Anti´c and B. Ommer, “Video parsing for abnormality detection,” in

ICCV , 2011. 2[17] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,“Learning temporal regularity in video sequences,” in

CVPR , 2016. 2,6, 7[18] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh,“Learning regularity in skeleton trajectories for anomaly detection invideos,” in

CVPR , 2019. 2[19] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative frame-work for anomaly detection in large videos,” in

ECCV , 2016. 2[20] H. Park, J. Noh, and B. Ham, “Learning memory-guided normality foranomaly detection,” in

CVPR , 2020. 2[21] R. Bensch, N. Scherf, J. Huisken, T. Brox, and O. Ronneberger,“Spatiotemporal deformable prototypes for motion anomaly detection,”

IJCV , 2017. 2[22] A. Basharat, A. Gritai, and M. Shah, “Learning object motion patternsfor anomaly detection and improved object detection,” in

CVPR , 2008.2[23] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormalevent detection,” in

CVPR , 2011. 2[24] V. Saligrama and Z. Chen, “Video anomaly detection based on localstatistical aggregates,” in

CVPR , 2012. 2[25] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomalydetection in crowded scenes,” in

CVPR , 2010. 2, 6 [26] J. Kim and K. Grauman, “Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incremental updates,”in

CVPR , 2009. 2[27] M. Sabokrou, M. Fathy, M. Hoseini, and R. Klette, “Real-time anomalydetection and localization in crowded scenes,” in

CVPR workshops ,2015. 2[28] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep rep-resentations of appearance and motion for anomalous event detection,”

BMVC , 2015. 2[29] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.Woo, “Convolutional lstm network: A machine learning approach forprecipitation nowcasting,” in

NIPs , 2015. 2[30] K. Adhiya, S. Kolhe, and S. S. Patil, “Tracking and identiﬁcation ofsuspicious and abnormal behaviors using supervised machine learningtechnique,” in

Proceedings of the International Conference on Advancesin Computing, Communication and Control , 2009. 2[31] C. He, J. Shao, and J. Sun, “An anomaly-introduced learning method forabnormal event detection,”

Multimedia Tools and Applications , 2018. 2[32] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,”