[PDF] Meta-AAD: Active Anomaly Detection with Deep Reinforcement Learning

Abstract

High false-positive rate is a long-standing challenge for anomaly detection algorithms, especially in high-stake applications. To identify the true anomalies, in practice, analysts or domain experts will be employed to investigate the top instances one by one in a ranked list of anomalies identified by an anomaly detection system. This verification procedure generates informative labels that can be leveraged to re-rank the anomalies so as to help the analyst to discover more true anomalies given a time budget. Some re-ranking strategies have been proposed to approximate the above sequential decision process. Specifically, existing strategies have been focused on making the top instances more likely to be anomalous based on the feedback. Then they greedily select the top-1 instance for query. However, these greedy strategies could be sub-optimal since some low-ranked instances could be more helpful in the long-term. In this work, we propose Active Anomaly Detection with Meta-Policy (Meta-AAD), a novel framework that learns a meta-policy for query selection. Specifically, Meta-AAD leverages deep reinforcement learning to train the meta-policy to select the most proper instance to explicitly optimize the number of discovered anomalies throughout the querying process. Meta-AAD is easy to deploy since a trained meta-policy can be directly applied to any new datasets without further tuning. Extensive experiments on 24 benchmark datasets demonstrate that Meta-AAD significantly outperforms the state-of-the-art re-ranking strategies and the unsupervised baseline. The empirical analysis shows that the trained meta-policy is transferable and inherently achieves a balance between long-term and short-term rewards.

Full PDF

MMeta-AAD: Active Anomaly Detection withDeep Reinforcement Learning

Daochen Zha, Kwei-Herng Lai, Mingyang Wan, Xia HuDepartment of Computer Science and Engineering, Texas A&M University { daochen.zha,khlai037,w1996,xiahu } @tamu.edu Abstract —High false-positive rate is a long-standing challengefor anomaly detection algorithms, especially in high-stake appli-cations. To identify the true anomalies, in practice, analysts ordomain experts will be employed to investigate the top instancesone by one in a ranked list of anomalies identiﬁed by ananomaly detection system. This veriﬁcation procedure generatesinformative labels that can be leveraged to re-rank the anomaliesso as to help the analyst to discover more true anomalies given atime budget. Some re-ranking strategies have been proposed toapproximate the above sequential decision process. Speciﬁcally,existing strategies have been focused on making the top instancesmore likely to be anomalous based on the feedback. Then theygreedily select the top-1 instance for query. However, thesegreedy strategies could be sub-optimal since some low-rankedinstances could be more helpful in the long-term. Motivatedby this, in this work, we study whether modeling long-termperformance can beneﬁt active anomaly detection. This is achallenging task because it is unclear how long-term performancecould be quantiﬁed. In addition, the query selection has a hugedecision space, which is difﬁcult to model. To address thesechallenges, we propose Active Anomaly Detection with Meta-Policy (Meta-AAD), a novel framework that learns a meta-policy for query selection. Speciﬁcally, Meta-AAD leverages deepreinforcement learning to train the meta-policy to select the mostproper instance to explicitly optimize the number of discoveredanomalies throughout the querying process. Meta-AAD is easy todeploy since a trained meta-policy can be directly applied to anynew datasets without further tuning. Extensive experiments on24 benchmark datasets demonstrate that Meta-AAD signiﬁcantlyoutperforms the state-of-the-art re-ranking strategies and theunsupervised baseline. The empirical analysis shows that thetrained meta-policy is transferable and inherently achieves abalance between long-term and short-term rewards.

Keywords -Anomaly Detection, Active Learning, Deep Rein-forcement Learning, Meta-Learning, Human-in-the-Loop

I. I

NTRODUCTION

Anomaly detection aims to identify the data objects or be-haviors that signiﬁcantly deviate from the majority. Anomalydetection has essential applications in various domains, suchas fraud detection, cybersecurity attack detection, and med-ical diagnosis [1]. Numerous anomaly detection algorithmshave been proposed, but they are usually unsupervised withassumptions on the anomaly patterns [2], [3]. The discrepancybetween the assumptions and the real-world scenarios canlead to high false-positive rates since users may have differentinterests and deﬁnitions of the anomalies.In this work, we consider an alternative approach to reducefalse-positive rates by involving humans in the loop. In manytraditional anomaly detection scenarios, an analyst will be asked to investigate the top instances from a ranked list ofanomalies to identify as many true anomalies as she canuntil the time budget is used up. In practice, this humanfeedback can be leveraged to help the analyst to identify moreanomalies. We consider a scenario where the anomaly detectorselects one of the instances at a time to query the analyst. Thenit adjusts the decision functions by leveraging the label fromthe analyst. Figure 1 shows a toy example of how humanfeedback is leveraged to improve the detector on the toydata. We can see that human feedback can help the anomalydetector to promote the instances of interest and discouragethe instances out of interest. As a result, the analyst will bepresented with more true anomalies under a time budget.Some re-ranking strategies have been proposed to approx-imate the above sequential decision process by greedily opti-mizing the immediate performance [4], [5], [6], [7]. Speciﬁ-cally, they adjust the anomaly scores based on the human feed-back, aiming to rank anomalous instances higher. Then theygreedily select the top-1 instance for the query, i.e., the one thatis most likely to be anomalous. This greedy choice may beneﬁtthe immediate performance; however, it can be sub-optimal inthe long-term. For example, some uncertain instances couldbe very helpful for correcting anomaly patterns [8]. Althoughthese instances can be lower-ranked and harm the immediateperformance, they may beneﬁt the anomaly detector and helpthe analyst to discover more anomalies in future iterations.Thus, we are motivated to study whether modeling long-termperformance can beneﬁt active anomaly detection.However, it is non-trivial to achieve this goal due to the fol-lowing challenges. First, it is unclear how we can quantify thelong-term performance. In the current iteration, we can onlypredict the intermediate outcome, i.e., whether the instancesare likely to be anomalous or not, but are not clear about futurebeneﬁts. Moreover, it is also difﬁcult to balance long-termand short-term performance in different scenarios. Second,the decision space is very large since we need to examineall the instances and select one of them for the query. Thismakes it hard to design the selection strategy, particularly inlarge or high-dimensional data. Third, different datasets havevarious distributions of data and different sizes of decisionspaces. We need a simple and transferable selection strategythat can be adopted across different datasets, which bringsfurther challenges in designing the strategy.To address these challenges, we propose A ctive A nomaly D etection with Meta -Policy (Meta-AAD), which learns a a r X i v : . [ c s . L G ] S e p abeled Anomaly Unlabeled Anomaly Labeled Normality Unlabeled Normality Probability0.1 0.3 0.5 0.90.3 − − − − (a) Initial state − − − − (b) 15 queries − − − − (c) 30 queries Fig. 1: Evolution of the decision of Meta-AAD on toy data. Data in blue area are more likely to be presented to the analyst.In (a), the meta-policy prefers the instances that are far away from the majority, which is similar to an unsupervised anomalydetector. In (b) and (c), with more queries, the decision pattern evolves. The probability decreases in the regions around thenormal instances (yellow). The probability increases for the regions around anomalies (red).meta-policy to explicitly optimize the number of discoveredanomalies. Speciﬁcally, we formulate active anomaly detectionas a Markov decision process and leverage deep reinforcementlearning to train the meta-policy to select the most properinstance in each iteration. The meta-policy is optimized tomaximize the discounted cumulative reward, which combinesshort-term and long-term rewards. Extensive experimentsdemonstrate the effectiveness of Meta-AAD, particularly inthe long-term. Moreover, Meta-AAD can be easily deployedsince the trained meta-policy can be directly applied to anynew datasets without further tuning. The main contributionsof this work are as follows. • We identify the importance of optimizing long-term per-formance for active anomaly detection. • We propose Meta-AAD, a novel framework that leveragesdeep reinforcement learning to train a meta-policy toinherently optimize long-term performance. • To enable the training of the meta-policy, we propose apractical solution that extracts transferable meta-featuresand optimizes the meta-policy on data streams. • We instantiate our framework with Proximal Policy Gra-dients (PPO) [9]. Extensive experiments on benchmarkdatasets demonstrate that Meta-AAD outperforms thestate-of-the-art alternatives and the unsupervised baseline.Our empirical analysis shows that Meta-AAD can transferacross various datasets and inherently achieve a balancebetween long-term and short-term rewards.II. P RELIMINARIES

In this section, we formulate the problem of active de-tection with meta-policy. We then provide a background ofMarkov Decision Process (MDP) and Deep ReinforcementLearning (DRL). After that, we provide a naive approach totraining the meta-policy with DRL and discuss its limitations.The main symbols used in this work are summarized in Table I. Code available at https://github.com/daochenzha/Meta-AAD

TABLE I: Main Symbols and deﬁnitions.

Symbol Deﬁnition n The number of instances. d The feature dimension of each instance. l The dimension of transferable features. X ∈ R n × d A dataset with n instances and d features. G ∈ R n × l Transferable features with dimension l . y ∈ R n The n labels of dataset, where y i ∈ {− , } . ˆ y ∈ R n The state vector, where ˆ y i ∈ {− , , } . c ∈ R n The anomaly scores by an unsupervised detector. S The state space in Markov Decision Process (MDP). A The action space in MDP. R The reward function in MDP. γ The discount factor in MDP.

A. Problem Formulation

We consider anomaly detection problems represented by aset of instances X = { x , x , ..., x n } ∈ R n × d , where n denotesthe number of instances, and d denotes the feature dimension.Each instance x i is an d -dimensional vector { x i, , x i, , ... x i,d } .Feature X i,j can be real-valued or categorical. Let y ∈ R n be the ground-truths that correspond to the n instances inthe dataset, where y i ∈ {− , } , − indicates that theinstance is anomalous, and indicates that the instance isnormal. Anomaly detection aims at partitioning the instancesinto a anomaly set A = { x , x , ..., x a } and a normalityset N = { x , x , ..., x b } , where a and b are the number ofanomalous and normal instances, respectively. Usually, the set A accounts for minority of the data, i.e., a (cid:28) b .Conventional unsupervised anomaly detectors assignanomaly scores c ∈ R n to all the instances based on X , i.e.,learning a mapping f : X → c , such that the lower scoresindicate that the instances are more likely to be anomalous.Given the anomaly scores, we can obtain an anomaly ranking,where the anomalous instances are expected to be higherranked than the normal instances. However, such ranking isusually not perfect since many of the higher-ranked instances ction Met a-Pol i cy

State

Tr ai ni ng

Normal Anomalous

Label

RewardQueryExtract features

Appl yi ng to unl abel ed dat aset ... Dat a st r eam

134 142 Query

Anal yst

DetectedAnom al i es

Labeled AnomalyLabeled Normality Unlabeled AnomalyUnlabeled Normality3

Fig. 2: An overview of Meta-AAD. In training, we shufﬂe the data and feed them to the meta-policy in a streaming manner.The meta-policy is rewarded based on the labels. The trained meta-policy can then be directly applied to a new unlabeleddataset. In each iteration, the meta-policy chooses one of the instances and queries an analyst (human).may be actually normal, and some lower ranked instancescould also turn out to be anomalous. Therefore, in practice,we usually require analyst (human) efforts to investigate thehigher-ranked instances and decide whether they are trulyanomalous or not.Based on the notations and intuitions above, we formallydescribe the problem of active anomaly detection with meta-policy as follows. Given a dataset X , at each step, a meta-policy will select one of the instances x i for query, anda human will give a label indicating whether x i is trulyanomalous or not. Formally, let ˆ y ∈ R n be a state vectorthat corresponds to the n instances in the dataset. Here, ˆ y i ∈ {− , , } , where − indicates that the instance has beenselected for query and is indeed an anomaly, indicates thatthe instance has been selected for query but it turns out to benormal, and suggests that the instance has not been presentedto the analyst yet. The state vector ˆ y is initialized with zerosfor all the instances, i.e., no instance has been chosen for thequery at the initial state. The state of the selected instancewill be updated to or − at each query step based on thefeedback of human. Given a budget of T queries, our goalis to learn a meta-policy (trained from some other labeleddatasets) to decide the instance to query at each step, i.e. amapping π : { X × ˆ y } → { , , ..., n } , such that the numberof discovered true anomalies among the chosen instances ismaximized until budget T is used up. B. Markov Decision Process & Deep Reinforcement Learning

Markov Decision Process (MDP) describes a framework forsequential decision making process. An MDP is deﬁned as M = ( S , A , P T , R , γ ) , where S denotes the set of states, A denotes the set of actions, P T : S × A × S → R + denotes thestate transition function, R : S → R denotes the immediatereward function, and γ ∈ (0 , is a discount factor to balance the short-term and long-term reward. At each timestep t , theagent takes action a t ∈ A according to the current state s t ∈S , and observes the next state s t +1 as well as a reward r t = R ( s t +1 ) . Our goal is to learn a policy π : S → A to maximizethe expected discounted cumulative reward E π [ (cid:80) ∞ t =0 γ t r t ] .Deep reinforcement learning (DRL) describes afamily of algorithms for solving the MDP withdeep neural networks [10]. Contemporary DRLalgorithms often learn a state value function V ( s t ) = E a t ,s t +1 ,... [ (cid:80) ∞ l =0 γ t R ( s t + l ) ] [11], [9] or state-action valuefunction Q ( s t , a t ) = E s t +1 ,a t +1 ,... [ (cid:80) ∞ l =0 γ t R ( s t + l ) ] [10],[12] with deep neural networks to decide the most rewardingaction at each state. C. Limitations of a Naive Approach

One may come up with a naive approach to train themeta-policy with deep reinforcement learning. Speciﬁcally, theactive learning process could be naturally treated as an MDPif we consider the state as the state vector and action as thequeried instance, i.e., S = { X × ˆ y } , A = { , , ..., n } . Thenby appropriately deﬁning a reward function, we can directlymodel the process as an MDP and train a policy to optimizeperformance with deep reinforcement learning algorithms.However, this approach is infeasible because it has twolimitations. First, the state and action spaces are too large.The state dimension and action dimension are O ( nd ) and O ( n ) , respectively, since at each iteration, we can observe theinformation of all the n instances and need to select one ofthe n instances for query. However, the state-of-the-art deepreinforcement learning algorithms usually perform not wellon large state and action spaces [13], [14]. In our preliminaryexperiments, we also observe that the above naive methodfails to train an effective meta-policy. Second, even if we cantrain a meta-policy, it is difﬁcult to transfer the meta-policy tonother dataset since the state and action spaces are differentin different datasets. The meta-policy will be of practical valueonly when it can be transferred. Therefore, this naive approachcan not be directly applied to our problem. In the followingsections, we discuss how we can address the above issues toenable stable meta-policy training.III. M ETHODOLOGY

In this section, we elaborate on the A ctive A nomaly D etection with Meta -Policy (Meta-AAD). An overview ofMeta-AAD is illustrated in Figure 2. In the training stage,we extract transferable features as states (Section III-A). Wethen shufﬂe the data and feed the data into meta-policy ina streaming manner so that the state and action spaces canbe signiﬁcantly reduced (Section III-B). The meta-policy istrained with deep reinforcement learning based on some la-beled datasets (Section III-C). Finally, the trained meta-policycan be directly applied to any new unlabeled datasets for activeanomaly detection without further tuning (Section III-D).

A. Extracting Transferable Meta-Features

In this subsection, we aim to extract transferable meta-features that can be used across different datasets, i.e., weaim at deﬁning a mapping g : { X × ˆ y } → G ∈ R n × l , where l is the dimension of extracted features, such that G is lessdependent on the dataset.Intuitively, there are three types of information that arecritical for deciding which instance to query. The ﬁrst isanomaly scores outputted by the anomaly detector. Anomalyscores can provide information about which instances are faraway from the majorities to help the meta-policy to discovermore anomalous instances. Second, the labeled anomalousinstances are helpful. With several queries, we may be ableto identify some anomalous instances. Properly promoting theinstances that are similar to these known anomalous instanceswill improve the performance. Third, labeled normal instancesare also useful. Similarly, discouraging the instances that aresimilar to the known normal instancs may decrease the falsepositives. Based on the intuitions above, we empirically extractsome features as follows, with a total of features. • Detector features:

The anomaly scores c outputtedby unsupervised anomaly detectors. Any off-the-shelfanomaly detection algorithms can serve as detectors. • Anomaly features:

The features indicating the related-ness to the labeled anomalous instances. In this work, weextract three features for this purpose. We standardize theoriginal features X and calculate the minimum and themean Euler distances to the labeled anomalous instances.In addition, we introduce a binary feature indicatingwhether there exists an anomalous instance in the k -nearest neighbors or not. • Normality features:

Similarly, we use the minimum andthe mean Euler distances to the labeled normal instancesas the normality features.Note that our framework allows ﬂexible choices of features.For example, we may be able to improve the performance by using an ensemble of unsupervised anomalous detectors ormore ﬁne-grained anomaly and normality features. To makeour contribution focused, we adopt these simple features inall our experiments, which lead to reasonable performancebased on our empirical results. How we can better model thetransferable information will be an interesting future work toenhance the meta-policy.By mapping the original features to the above transferablefeatures, we will have the same feature dimension in differentdatasets, i.e., l is the same. However, the new features arenot ready to be used for training since different datasetshave a different number of instances n . We will address thisremaining issue in the next subsection. B. Learning from Data Streams

The transferable features G ∈ R n × g obtained in the pre-vious section and the action space A = { , , ..., n } are stilltoo large for a learning algorithm. Moreover, the size of thespaces is proportional to the size of the dataset, which makesthe meta-policy impossible to transfer.To enable the training of transferable meta-policy, we pro-pose to instead operate on data streams. Speciﬁcally, giventhe transferable features of a training data G train and itscorresponding labels y train . In each episode, we randomlyshufﬂe G train and y train to obtain a perturbation, denoted as G train (cid:48) and y train (cid:48) . Instead of giving all the data to the meta-policy, we feed the meta-policy with one instance at a time.In the streaming setting, the state, action and reward of theMarkov Decision Process (MDP) are deﬁned as follows. • State S : The transferable features of the current observedinstance G train (cid:48) i ∈ R l , where i is the instance index. • Action A : Actions can be or , where suggests thatthe current instance should be selected, while suggeststhat current instance should be ignored. • Reward R : If the meta-policy queries an instance, wegive a positive reward of if the instance is indeedanomalous, and a small negative reward of − . if theinstance is normal. We give reward if the meta-policyignores an instance. The reward function is critical todescribe the desired behaviors. We will empirically studythe impact of different reward choices in the experi-ments (see the bottom of Figure 4).The above MDP describes an active learning procedure in astreaming setting. Intuitively, the meta-policy is encouragedto take action if the queried instance is anomalous and takeaction if the queried instance is normal. In this sense, themeta-policy will be taught to discover more anomalies undera budget. We note that the meta-policy trained in a streamingsetting could be sub-optimal when applied to the batch settingsince the two MDPs have different objectives. Nonetheless,we ﬁnd in practice that this concern is greatly outweighed bythe beneﬁts that the streaming setting brings. It signiﬁcantlyreduces the state and action spaces to make the training oftransferable meta-policy feasible. lgorithm 1 Training meta-policy with PPO

Input:

A set of features { X i } Ni =1 and the corresponding labels { y i } Ni =1 , rollout steps T Output:

The trained meta-policy Initialize meta-policy π θ , θ old ← θ for iteration = , , ... until convergence do if iteration = 1 or episode is over then Randomly sample { X (cid:48) , y (cid:48) } from { X i } Ni =1 , { y i } Ni =1 end if Run π θ old with { X (cid:48) , y (cid:48) } based on the MDP deﬁned inSection III-A for T timesteps. Compute advantages ˆ A , ..., ˆ A t based on Equation (1) Update θ based on Equation (3) θ old ← θ end for return π θ C. Training Meta-Policy with Deep Reinforcement Learning

Given the MDP deﬁned in Section III-B, we can trainthe meta-policy with any deep reinforcement learning (DRL)algorithms. In this work, we instantiate our framework withProximal Policy Optimization (PPO) [9]. We note that thereare more advanced algorithms, such as [15], which we willexplore in the future.The meta-policy is described as a parametric policy π θ ( a | s ) , where s is an l dimensional feature, a ∈ { , } , (cid:80) a ∈{ , } π ( a | s ) = 1 , and θ is the parameters of the network.Our goal is to maximize the discounted cumulative reward E π [ (cid:80) ∞ t =0 γ t r t ] . PPO is an actor-critic algorithm, where thecritic approximates the state values and the actor is the policy.Speciﬁcally, the critic of PPO trains a deep neural network toapproximate V ( s ) through interacting with the environment.Then a generalized advantage estimator [16] is used: ˆ A t = δ t + T − (cid:88) t (cid:48) =1 ( γλ ) t (cid:48) δ t + t (cid:48) , (1)where δ t = r t + γV ( s t +1 ) − V ( s t ) , T is the total timestepsin an episode, γ is the discount factor, and λ is a hyper-parameter to control the bias-variance trade-off. Intuitively,advantage values measure how much an action is better thanthe other actions. Based on the estimated advantages, the actoris updated by a clipped surrogate objective: L CLIPt ( θ ) = ˆ E t [min( r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t )] , (2)where r t ( θ ) = π θ ( a t | s t ) π θold ( a t | s t ) , π θ old is the policy before theupdate, clip ( r t ( θ ) , − (cid:15), (cid:15) ) will clip r t ( θ ) into range [1 − (cid:15), (cid:15) ] , and (cid:15) is a hyper-parameter to control the cliprange. The clipping objective makes sure that the new policywill not deviate too much from the old policy, which enablesstable policy improvement. In training, we use a combinedloss to simultaneously update the value loss: L t ( θ ) = ˆ E t [ L CLIPt ( θ ) − c L V Ft ( θ ) + c · entropy ( π θ ( ·| s t ))] , (3) Algorithm 2

Application of trained meta-policy

Input:

Unlabeled dataset X ∈ R n × d , trained meta-policy π θ Output:

The detected anomalies Initialize state vector ˆ y = { } ni =1 , anomalous list A = {} for iteration = , , ... until budget is used up do Obtain transferable features G ∈ R n × l from { X , ˆ y } Compute π ( a = 1 | s ) based on G as p ∈ R n Query the instance with the highest probability if the instance is anomalous then Put the instance into A end if Update ˆ y based on human feedback end for return A where L V Ft ( θ ) is a squared-error loss ( V θ ( s t ) − V targett ) , V targett is estimated based on the collected data, entropy ( · ) is a term to encourage exploration, c and c are hyper-parameters. The expectation in Equation (3) can be approx-imated by sampling data from the environment.The training procedure of the meta-policy is summarizedin Algorithm 1. We assume the availability of several labeleddatasets. In each episode, we randomly choose a dataset, shuf-ﬂe the instances and traverse the dataset from the beginningin a streaming manner. D. Application of Meta-Policy

Once the meta-policy is trained, we can directly apply it toany new unlabeled datasets without further tuning. However,we note that there are some major differences between theapplication and the training. First, instead of feeding onefeature to the meta-policy at a time, we give all the featuresto the meta-policy to compute the probabilities for all theinstances. Speciﬁcally, when applying the meta-policy to anunlabeled dataset X ∈ R n × d , we ﬁrst extract the transferablefeatures G ∈ R n × l according to X and the current state vector ˆ y ∈ R n . Then we compute π θ ( a = 1 | G i ) , ∀ i ∈ { , , ..., n } and obtain the probabilities p ∈ R n . Then we choose theinstance with highest probability for query, i.e., arg max i p i .Intuitively, the instance that is very likely to be selected in thestreaming setting is also very likely to be chosen in this batchsetting. The above procedure is summarized in Algorithm 2.Note that π θ ( a = 1 | G i ) is fundamentally different from theadjusted anomaly scores. In previous methods [5], [4], [6], [7],the anomalous scores are adjusted to promote the anomalousinstances to the top. The main goal of the adjustment is tomake the top-1 instance more likely to be anomalous so asto maximize the immediate performance. Whereas, the prob-ability of the meta-policy plays a signiﬁcantly different role.The probability is learned with the objective of maximizingthe discounted cumulative reward, which is a combinationof immediate and long-term rewards. That is, the long-termperformance is inherently incorporated into the probabilitiesand the top-1 selection strategy.V. E XPERIMENTS

In this section, we conduct extensive experiments to eval-uate Meta-AAD. We mainly focus on the following researchquestions. • RQ1:

How does the meta-policy select the query and howwill the decision of the meta-policy evolve in differentstages (Section IV-B)? • RQ2:

How does Meta-AAD compare with the state-of-the-art alternatives and unsupervised baseline (Sec-tion IV-C)? • RQ3:

How will Meta-AAD perform if using differentfeatures, the number of labeled datasets and rewardfunctions (Section IV-D)? • RQ4:

How many computational resources are needed totrain a meta-policy (Section IV-E)? • RQ5:

How does Meta-Policy balance long-term andshort-term reward (Section IV-E)?

A. Experimental Settings

Datasets and evaluation metric.

To demonstrate the gen-erality of Meta-AAD, we select datasets with various sizes,feature dimensions and anomaly ratios from ODDS . Table IIsummarizes the statistics of the datasets. We also use a toydataset from [5] for better visualization. For the evaluationmetric, we use anomaly discovery curve [17], which plots thenumber of discovered anomalies with respect to the numberof queries. A perfect result is a line with a slope , i.e., all thequeries are anomalous. The worst case is a line with a slope , i.e., all the queries are normal. Following [6], we set themaximum budget to be for all the datasets. Baselines.

We compare Meta-AAD with the state-of-the-artmethods as well as an unsupervised baseline as follows. • AAD.

Active Anomaly Detection [5] is a state-of-the-artmethod based on node re-weighting. • FIF.

Feedback-Guided Isolation Forest [6] is a recentlyproposed active anomaly detector via online optimization. • SSDO.

Semi-Supervised Detection of Outliers [18] is arecent semi-supervised point-wise anomaly detector. Weare interested in studying how semi-supervised methodswill perform in the active learning setting since they arealso designed to leverage label information. • Unsupervised.

We also include Isolation Forest (IF) [2]as an unsupervised baseline.While our Meta-AAD can be generally applied to any unsu-pervised anomaly detectors or an ensemble of detectors, fora fair comparison, we follow the previous work [5], [6] anduse Isolation Forest (IF) [2] with the same hyper-parametersas in [5], [6]. For SSDO and the unsupervised baseline, weselect the top-1 anomalous instance in each iteration.

Implementation details.

For training the meta-policy, weuse the PPO implementation in OpenAI baselines . Followingthe default settings, we set rollout steps T = 128 , entropycoefﬁcient c = 0 . , learning rate to be . × − , value http://odds.cs.stonybrook.edu/ https://github.com/hill-a/stable-baselines TABLE II: Statistics of the datasets.

Dataset Points Dim. Anomalies Anomaly%

Annthyroid 7200 6 534 7.4Arrhythmia 452 274 66 15.0Breastw 683 9 239 35.0Cardio 1831 21 176 9.6Glass 214 9 9 4.2Ionosphere 351 33 126 36.0Letter 1600 32 100 6.3Lympho 148 18 6 4.1Mammography 11183 6 260 2.3Mnist 7603 100 700 9.2Musk 3062 166 97 3.2Optdigits 5216 64 150 3.0Pendigits 6870 16 156 2.3Pima 768 8 268 35Satellite 6435 36 2036 32.0Satimage-2 5803 36 71 1.2Shuttle 49097 9 3511 7.0Speech 3686 400 61 1.7Thyroid 3772 6 93 2.5Vertebral 240 6 30 12.5Vowels 1456 12 50 3.4Wbc 278 30 21 5.6Wine 129 13 10 7.7Yeast 1364 8 64 4.7 function coefﬁcient c = 0 . , λ = 0 . , clip range (cid:15) = 0 . .Recall that γ is hyper-parameters to balance long-term andshort-term rewards. We empirically set γ = 0 . . We train themeta-policy with the top datasets (in alphabetical order) andapply it to the bottom datasets in Table II. We do it reverselyto evaluate the top datasets. The meta-policy is trainedwith × timesteps with the same hyper-parameters acrossall the datasets. The episode length is set to , . For thebase detector of Isolation Forest, we use the implementation insklearn with the default hyper-parameters setting. We use theoriginal implementations of FIF , AAD and SSDO by theirauthors. For FIF, we try both linear and log-likelihood losses,and report the best result. For SSDO, we ﬁnd it beneﬁcialto use Isolation Forest for the query at the beginning andthen switch to SSDO when we have hit at least one anomaly.We report the results with this strategy since we observe thatit outperforms randomly selecting instances at the beginning.All the experiments are run times. The average results andstandard errors are reported. B. A Case Study on the Toy Data

To study

RQ1 , we visualize the evolution of the decision ofMeta-AAD on a toy data [5] (see Figure 1), which is a smalldataset with -dimensional features. We use the pre-trainedmeta-policy on the top datasets in Table II. We visualizethe output of action in the meta-policy, i.e., the probability ofbeing selected for the query. Note that the probability is similarto the anomaly score, but it bases on a different objective. Thetop instances are expected to not only have good immediate https://scikit-learn.org/ https://github.com/siddiqmd/FeedbackIsolationForest https://github.com/shubhomoydas/ad examples https://github.com/Vincent-Vercruyssen/anomatools eta-AAD AAD SSDOFIF Unsuper vised

20 40 60 80 100queries020406080 a n o m a li e s (a) Annthyroid

20 40 60 80 100queries1020304050 (b) Arrhythmia

20 40 60 80 100queries20406080100 (c) Breastw

20 40 60 80 100queries20406080100 (d) Cardio

20 40 60 80 100queries246810 (e) Glass

20 40 60 80 100queries20406080 (f) Ionosphere

20 40 60 80 100queries010203040 a n o m a li e s (g) Letter

20 40 60 80 100queries4.254.504.755.005.255.505.756.006.25 (h) Lympho

20 40 60 80 100queries102030405060708090 (i) Mammography

20 40 60 80 100queries20406080100 (j) Mnist

20 40 60 80 100queries20406080100 (k) Musk

20 40 60 80 100queries020406080100 (l) Optdigits

20 40 60 80 100queries020406080 a n o m a li e s (m) Pendigits

20 40 60 80 100queries10203040506070 (n) Pima

20 40 60 80 100queries20406080100 (o) Satellite

20 40 60 80 100queries102030405060 (p) Satimage-2

20 40 60 80 100queries20406080100 (q) Shuttle

20 40 60 80 100queries02468 (r) Speech

20 40 60 80 100queries1020304050607080 a n o m a li e s (s) Thyroid

20 40 60 80 100queries0510152025 (t) Vertebral

20 40 60 80 100queries01020304050 (u) Vowels

20 40 60 80 100queries6810121416182022 (v) Wbc

20 40 60 80 100queries0246810 (w) Wine

20 40 60 80 100queries51015202530 (x) Yeast

Fig. 3: Performance comparison of Meta-AAD against the state-of-the-art alternatives and unsupervised baseline.performance, i.e., it should be very likely to be anomalous,but also beneﬁt the performance in the long-term.In the initial state, the meta-policy tends to choose theinstances that are far away from the majority, which is similarto the behavior of unsupervised anomaly detectors. We expectthat the meta-policy have learned to give more weights todetector features in the initial state when we do not havelabeled samples. We can also observe that, with more queries,the decision pattern evolves. On the one hand, the probabilitydecreases in the regions around the normal instances (theyellow instance on the bottom left corner). On the other hand,the probability increases for the regions around anomalies (thered triangles on the right-hand side). This behavior aligns withprevious active anomaly detectors [5], [6]. Instead of adjustinganomaly scores, the meta-policy is optimized to maximizethe discounted cumulative reward, which can better model thelong-term performance compared with the previous methods.

C. Performance on Benchmark Datasets

To answer

RQ2 , we compare Meta-AAD against the base-lines in the real-world datasets. The anomaly discoverycurves are illustrated in Figure 3. To better understand theperformance, we rank the discovered anomalies of the fouralgorithms under , , , and queries, report theaverage rankings, and highlight the improvement of Meta-AAD over the second-best method in Table III. We make thefollowing observations.First, all the active anomaly detectors perform signiﬁcantlybetter than the unsupervised baseline and the semi-supervisedmethod. Speciﬁcally, Meta-AAD, FIF and AAD can discovermore anomalies using the same number of queries in the outof datasets and perform similarly in the other datasets. Thisis expected since labeled instances provide useful informationthat can help us discover more anomalies. We observe thatSSDO performs slightly better than the unsupervised baselineABLE III: Average rankings of the number of discoveredanomalies under different queries across benchmarks, andthe improvement of Meta-AAD over the second best state-of-the-art method. The improvement improves with more queries.Meta-AAD delivers stronger performance in long-term. (cid:78) denotes the cases where Meta-AAD is signiﬁcantly better thanthe baseline w.r.t. the Wilcoxon signed rank test ( p < . ). Method 20 40 60 80 100 unsupervised [2] . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) SSDO [18] . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) AAD [5] . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) FIF [6] .

208 2 .

333 2 .

312 2 . (cid:78) . (cid:78) Meta-AAD

Improvement .

146 0 .

416 0 .

562 0 .

917 1 . but is far behind the active methods. A possible explanationis that SSDO optimizes a different objective and thus has sub-optimal performance in the active learning setting.Second, Meta-AAD consistently delivers better performancethan the state-of-the-art alternatives across all the datasets.With very few exceptions, Meta-AAD improves upon thebaselines. For example, Meta-AAD achieves more than improvement on Letter and Speech, and more than onArrhythmia, Ionosphere and Pima, compared with the bestalternative. In the other tasks, Meta-AAD also achieves betteror similar performance. Note that Meta-AAD achieve thisperformance without any training or tuning on the targetdatasets, and thus it is easy to use in applications. The aboveresults demonstrate the effectiveness of training a meta-policyfor active anomaly detection.Third, Meta-Policy tends to be stronger in the long-term.In Table III, we observe Meta-AAD is ranked higher andhigher with more queries. Speciﬁcally, with queries, theaverage rank of Meta-AAD is . , which only has minorimprovement over FIF. Interestingly, with queries, theaverage ranking of Meta-AAD becomes . . This suggeststhat Meta-AAD can better model long-term rewards. Wespeculate that deep reinforcement learning inherently modelsand balances short-term and long-term performance, whichbeneﬁts the anomaly detector in the long-term. D. Ablation Studies

To better understand where the performance comes from, weanswer

RQ3 with ablation studies (see Figure 4). We focuson Annthyroid, Mammography, and Satimage-2.First, we study the impact of using different features. Recallthat we have three types of features, i.e., detector feature,anomaly features and normality features. We remove either ofthem and plot the curves in the top of Figure 4. We obversethat each type of feature contributes to the ﬁnal performance.Using all three types of features leads to the best performance.This suggests the proposed three types of features may becomplementary for training a good meta-policy.Second, we investigate the impact of using different numberof datasets. To study whether the performance will drop

All w/o anomalyw/o normalityw/o detector

20 40 60 80 100queries20406080 a n o m a li e s

20 40 60 80 100queries020406080 20 40 60 80 100queries0204060

12 16

20 40 60 80 100queries20406080 a n o m a li e s

20 40 60 80 100queries20406080 20 40 60 80 100queries10203040506070

20 40 60 80 100queries020406080100 a n o m a li e s (a) Annthyroid

20 40 60 80 100queries20406080100 (b) Mammography

20 40 60 80 100queries10203040506070 (c) Satimage-2

Fig. 4: Ablation study of Meta-AAD. We show the learn-ing curves on Annthyroid, Mammography, Satimage-2 bydropping different features (top row), using different numberof training datasets (mid row), and using different negativerewards for a missed query.if we train the meta-policy with fewer data, we report theresults with and training datasets (middle of Figure 4).Speciﬁcally, we randomly drop some datasets and train themeta-policy on the resulting subset. We repeat the process times and report the average performance. We observe thatalthough the performance using more training datasets tendsto be more robust, we can train a strong meta-policy even withjust one dataset. This suggests that the proposed features areindeed transferable, and the proposed training strategy of themeta-policy is effective.Third, we are interested in how the reward will impactperformance. Recall that we give a positive reward of fordiscovered anomalies, a negative reward of − . for selectinga normal instance, and a reward of if not querying. Here,we vary the negative rewards with other rewards ﬁxed (bottomof Figure 4). Different negative rewards will lead to differentratios between positive and negative rewards, which deﬁnes thedesired behaviors of the meta-policy. We argue that the choicesof the rewards should depend on the situations. For example, ifexamining an instance requires lots of effort, a larger negativereward is preferred. On the contrary, if we do not need manyefforts to check an instance, a small negative reward couldbe better. As for anomaly discovery curves, we observe thattoo large negative rewards will worsen the performance, and asmall negative reward of − . works well across the datasets.To summarize, we ﬁnd that the default choices of work wellacross different datasets, delivering good performance evenwith few training data, which suggests that Meta-AAD couldbe a general framework for various scenarios. erformance with 100 queries a n o m a li e s a n o m a li e s Fig. 5: The average discovered anomalies across all thedatasets given 100 queries with respect to the number oftraining steps (left) and different γ values (right). E. Analysis of the Meta-Policy

We study

RQ4 by plotting the average performance with queries across the datasets with respect to the numberof training steps of the meta-policy in the left-hand side ofFigure 5. We observe that the policy converges very fast. Wenote that in a personal computer, it usually takes less than seconds to train , steps with one process. Therefore, thetraining of the meta-policy is computationally efﬁcient.We investigate RQ5 by showing the average performancewith queries using different γ in the right-hand side ofFigure 5. Recall that γ is a hyper-parameter to balance short-term and long-term performance. In extreme cases, suggeststhat we only care short-term performance, and suggeststhat long-term performance matters ( γ can not be larger than due to the nature of reinforcement learning algorithms).We can observe that giving too much preferences for long-term or short-term rewards will both harm the performance.We suggest that γ should be speciﬁed based on our needs,i.e., whether we care more about long-term or short-termperformance. In the conducted experiments, we set γ = 0 . across all the datasets.V. R ELATED W ORK

Anomaly detection.

Anomaly detection has been exten-sively studied in the past decades, e.g., density-based ap-proach [3], distance-based approach [19], [20], and ensem-bles [2], [21], [22]. Anomaly detection algorithms have alsobeen developed for various types of data, such as categoricaldata [23], multi-dimensional data [2], time-series data [24] andgraph data [25]. Most of these algorithms are unsupervised,with strong assumptions about the anomaly patterns [26].However, these algorithms may not work well when theassumptions do not hold. On the contrary, our Meta-AADrarely relies on the assumptions. It instead aligns anomalypatterns with human interests by leveraging human feedback

Semi-supervised anomaly detection.

Semi-supervisedlearning methods [27], [28] have been studied in the con-text of anomaly detection. Semi-supervised anomaly detectionassumes that a small set of labeled instances can be usedto improve the performance [29]. In [30], a small set ofanomalous instances are leveraged to re-weight the anomalyscores with belief propagation. [31] improves representation learning by using a few anomalous instances. [32] incorpo-rates label information with support vector data description.AI2 [33] ensembles unsupervised and supervised anomalydetectors. AutoML methods use a set of labeled instancesto perform automated algorithm selection and neural archi-tecture search [34], [35]. More recently, [36] proposes asemi-supervised anomaly detection approach for deep neuralnetworks. However, these methods are designed for batchsetting, which could be sub-optimal in the active learning.

Active anomaly detection.

Active learning in anomalydetection is much more challenging than traditional activelearning [37], [38] because of the imbalanced data. Insteadof assuming a batch of labeled data, active anomaly detectioninteracts with humans and recomputes the anomaly scoresbased on the feedback [4], [5], [39], [40]. These methodsusually deﬁne an optimization problem based on the hu-man feedback and re-weight the instances at each iteration.[41] proposes to adaptively adjust the ensemble for activeanomaly detection. [6] proposes to incorporate feedback byleveraging online convex optimization to improve efﬁciencyand simplicity. [17] proposes to use contextual multi-armedbandit and clustering techniques to identify the anomalies inattributed networks in an interactive manner. OJRANK [7]re-ranks the instances in each iteration based on the top-1 feedback. While these prior methods incorporate humansin the loop, they all adopt a greedy strategy to select thetop-1 anomalous instance in each iteration, which fails tomodel long-term performance. Whereas, our Meta-AAD buildsupon deep reinforcement learning, which inherently modelsand optimizes long-term performance. Moreover, the previousmethods require complicated optimization to re-weight theinstances in each iteration. On the contrary, the trained meta-policy of meta-AAD is easy to use since it can be directlyapplied to different datasets without further training or tuning.

Learning meta-policy.

Deep reinforcement learning algo-rithms have shown promise in various domains [10], [42]The idea of meta-policy learning is to train a reinforcementlearning agent to make decisions with the objective of op-timizing the overall performance of the task. Some recentstudies about deep reinforcement learning have demonstratedthe effectiveness of the meta-policy [43], [44], [45]. Somerelated studies in graph neural networks [46] and naturallanguage processing [47] also show the effectiveness of meta-policy learning. In addition to the difference of objectives,these studies are limited to the same or parallel datasets.Whereas, we demonstrate that the meta-policy in Meta-AADcan be generally transferred across various datasets.VI. C

ONCLUSIONS AND F UTURE W ORK

In this work, we propose Meta-AAD, a framework forincorporating human feedback into anomaly detection. Themeta-policy in Meta-AAD is trained with deep reinforcementlearning to optimize long-term performance. We instantiateour framework with PPO and evaluate it upon benchmarkdatasets. The empirical results demonstrate that Meta-AADoutperforms state-of-the-art alternatives. We further conductn extensive analysis of our framework. We ﬁnd that a singleconﬁguration performs well across different datasets, andMeta-AAD can inherently balance long-term and short-termrewards, which suggests that Meta-AAD could be a generalframework for active anomaly detection.For future work, we would like to conduct more studies onhow we can better extract transferable meta-features. In thiswork, we empirically choose features. We are interested inexploring more features to improve Meta-ADD or make theperformance more robust. We would also like to try other deepreinforcement learning algorithms. Finally, we will explore thepossibility of applying Meta-AAD on other tasks, such as timeseries, graphs, and images.A CKNOWLEDGEMENT

The work is, in part, supported by NSF (IIS-1750074, CNS-1816497, IIS-1718840). The views and conclusions in thispaper are those of the authors and should not be interpretedas representing any funding agencies.R

EFERENCES[1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

ACM computing surveys (CSUR) , vol. 41, no. 3, pp. 1–58, 2009.[2] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in

ICDM ,2008.[3] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifyingdensity-based local outliers,” in

SIGMOD , 2000.[4] S. Das, W.-K. Wong, T. Dietterich, A. Fern, and A. Emmott, “Incorpo-rating expert feedback into active anomaly discovery,” in

ICDM , 2016.[5] S. Das, W.-K. Wong, A. Fern, T. G. Dietterich, and M. A. Siddiqui, “In-corporating feedback into tree-based anomaly detection,” arXiv preprintarXiv:1708.09441 , 2017.[6] M. A. Siddiqui, A. Fern, T. G. Dietterich, R. Wright, A. Theriault,and D. W. Archer, “Feedback-guided anomaly discovery via onlineoptimization,” in

KDD , 2018.[7] H. Lamba and L. Akoglu, “Learning on-the-job to re-rank anomaliesfrom top-1 feedback,” in

SDM , 2019.[8] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.[9] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 ,2017.[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.[11] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustregion policy optimization,” in

ICML , 2015.[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” in

ICLR , 2016.[13] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep re-inforcement learning in large discrete action spaces,” arXiv preprintarXiv:1512.07679 , 2015.[14] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” in

ICML , 2018.[16] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438 , 2015.[17] K. Ding, J. Li, and H. Liu, “Interactive anomaly detection on attributednetworks,” in

WSDM , 2019.[18] V. Vercruyssen, M. Wannes, V. Gust, M. Koen, B. Ruben, and D. Jesse,“Semi-supervised anomaly detection with an application to water ana-lytics,” in

ICDM , 2018. [19] S. Ramaswamy, R. Rastogi, and K. Shim, “Efﬁcient algorithms formining outliers from large data sets,” in

SIGMOD , 2000.[20] F. Angiulli and C. Pizzuti, “Fast outlier detection in high dimensionalspaces,” in

ECML PKDD , 2002.[21] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga, “Outlier detection withautoencoder ensembles,” in

SDM , 2017.[22] G. Pang, L. Cao, L. Chen, D. Lian, and H. Liu, “Sparse modeling-based sequential ensemble learning for effective outlier detection inhigh-dimensional numeric data,” in

AAAI , 2018.[23] L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos, “Fast and reliableanomaly detection in categorical data,” in

CIKM , 2012.[24] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection fortemporal data: A survey,”

IEEE Transactions on Knowledge and DataEngineering , vol. 26, no. 9, pp. 2250–2267, 2013.[25] L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection anddescription: a survey,”

Data mining and knowledge discovery , vol. 29,no. 3, pp. 626–688, 2015.[26] Y. Zhao, Z. Nasrullah, and Z. Li, “Pyod: A python toolbox for scalableoutlier detection,” arXiv preprint arXiv:1901.01588 , 2019.[27] X. J. Zhu, “Semi-supervised learning literature survey,” University ofWisconsin-Madison Department of Computer Sciences, Tech. Rep.,2005.[28] D. Zha and C. Li, “Multi-label dataless text classiﬁcation with topicmodeling,”

Knowledge and Information Systems , vol. 61, no. 1, pp. 137–160, 2019.[29] Y. Zhao and M. K. Hryniewicki, “Xgbod: improving supervised outlierdetection with unsupervised representation learning,” in

IJCNN , 2018.[30] A. Tamersoy, K. Roundy, and D. H. Chau, “Guilt by association: largescale malware detection by mining ﬁle-relation graphs,” in

KDD , 2014.[31] G. Pang, L. Cao, L. Chen, and H. Liu, “Learning representations ofultrahigh-dimensional data for random distance-based outlier detection,”in

KDD , 2018.[32] N. G¨ornitz, M. Kloft, K. Rieck, and U. Brefeld, “Toward supervisedanomaly detection,”

Journal of Artiﬁcial Intelligence Research , vol. 46,pp. 235–262, 2013.[33] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li, “Aiˆ2: training a big data machine to defend,” in

BigDataSecurity , 2016.[34] Y. Li, D. Zha, P. Venugopal, N. Zou, and X. Hu, “Pyodds: An end-to-endoutlier detection system with automated machine learning,” in

WWW ,2020.[35] Y. Li, Z. Chen, D. Zha, K. Zhou, H. Jin, H. Chen, and X. Hu,“Autood: Automated outlier detection via curiosity-guided search andself-imitation learning,” arXiv preprint arXiv:2006.11321 , 2020.[36] L. Ruff, R. A. Vandermeulen, N. G¨ornitz, A. Binder, E. M¨uller, K.-R.M¨uller, and M. Kloft, “Deep semi-supervised anomaly detection,” arXivpreprint arXiv:1906.02694 , 2019.[37] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning withstatistical models,”

Journal of artiﬁcial intelligence research , vol. 4, pp.129–145, 1996.[38] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,”in

ICML , 2004.[39] J. He and J. G. Carbonell, “Nearest-neighbor-based active learning forrare category detection,” in

NeurIPS , 2008.[40] D. Zhou, J. He, H. Yang, and W. Fan, “Sparc: Self-paced networkrepresentation for few-shot rare category characterization,” in

KDD ,2018.[41] S. Das, M. R. Islam, N. K. Jayakodi, and J. R. Doppa, “Active anomalydetection via ensembles,” arXiv preprint arXiv:1809.06477 , 2018.[42] D. Zha, K.-H. Lai, Y. Cao, S. Huang, R. Wei, J. Guo, and X. Hu,“Rlcard: A toolkit for reinforcement learning in card games,” arXivpreprint arXiv:1910.04376 , 2019.[43] D. Zha, K.-H. Lai, K. Zhou, and X. Hu, “Experience replay optimiza-tion,” in

IJCAI , 2019.[44] Z. Xu, H. P. van Hasselt, and D. Silver, “Meta-gradient reinforcementlearning,” in

NeurIPS , 2018.[45] K.-H. Lai, D. Zha, Y. Li, and X. Hu, “Dual policy distillation,” in

IJCAI ,2020.[46] K.-H. Lai, D. Zha, K. Zhou, and X. Hu, “Policy-gnn: Aggregationoptimization for graph neural networks,” in

KDD , 2020.[47] L. Duong, H. Afshar, D. Estival, G. Pink, P. Cohen, and M. Johnson,“Active learning for deep semantic parsing,” in