Archive | 2021

Anomaly Mining - Past, Present and Future

 

Abstract


Anomaly mining is an important problem that finds numerous applications in various real world domains such as environmental monitoring, cybersecurity, finance, healthcare and medicine, to name a few. In this article, I focus on two areas, (1) point-cloud and (2) graph-based anomaly mining. I aim to present a broad view of each area, and discuss classes of main research problems, recent trends and future directions. I conclude with key take-aways and overarching open problems. Disclaimer. I try to provide an overview of past and recent trends in two areas within 4 pages. Undoubtedly, these are my personal view of the trends, which can be organized differently. For brevity, I omit all technical details and refer to corresponding papers. Again, due to space limit, it is not possible to include all (even most relevant) references, but a few representative examples. 1 Point-cloud Anomaly Mining Point-cloud data consists of points that reside in a feature space, each of which can be seen as a d dimensional vector. Anomalous points are typically referred to as outliers, and in this section I will adopt this terminology. Outlier mining has a very large literature, where most attention has been given to outlier detection (OD) under various settings [Aggarwal, 2013]. There exist a large pool of detectors that are distance-based, density-based, statistical-, cluster-, angle-, and depth-based, among many others [Chandola et al., 2009]. Most detection models assume outliers to be scattered isolate points, while some specifically aim to detect collective outliers that can be seen as micro-clusters [Han et al., 2012]. Another class of detectors target contextual outliers, which stand out within a specific context [Liang and Parthasarathy, 2016; Macha et al., 2018]. These can also be seen as conditional outliers [Song et al., 2007]. In addition, dynamic/streaming point-cloud OD has been studied at large [Gupta et al., 2013] as outliers may often arise in settings where data is collected and monitored over time. ∗A position paper for IJCAI 2021 Early Career Spotlight Talk In the rest of this section, I discuss some of the trending classes of problems in outlier mining, organized into four lines of work as (1) user-centric OD, (2) deep learning based OD, (3) automating OD, and (4) fairness-aware OD. 1.1 User-centric Outlier Detection User-centric outlier mining comprises two related topics: (i) explanations, and (ii) human-in-the-loop detection (HILD). Explaining the detected anomalies is crucial for settings in which outliers need to be vetted by human analysts. The purpose of vetting could be root-cause analysis/troubleshooting or sanity-checking/justification. An example to the former scenario is when the analyst identifies faults or inefficiencies in a production line or data center through OD and aims to fix the issues generating these outliers. The aim for the latter scenario is to distinguish statistical outliers from domainrelevant ones, where e.g. in claims auditing, not all outliers are necessarily associated with fraud. Related, HILD aims to leverage human feedback for sieving mere statistical outliers out of domain-relevant ones to eliminate false positives and thereby increase detection rate. These two problems are intertwined, since explanations could be presented to human analysts for acquiring effective feedback during HILD. Although the vast body of work on outlier explanations is recent, the earliest example dates back several decades [Knorr and Ng, 1999], which provided what is called “intensional knowledge” by identifying minimal subspaces in which outliers stand out. Most existing work in this area are discriminative, since explanation proceeds detection that outputs (outlier/inlier) labels, and aim to identify subspaces that well-separate the outliers from the inliers [Dang et al., 2014; Kuo and Davidson, 2016; Liu et al., 2018]. While these have focused on providing a separate explanation for each outlier, others aim to provide explanations for groups of outliers [Macha and Akoglu, 2018; Gupta et al., 2018] with the intent to reduce information overload on the analyst. On the other hand, interactive OD mainly aims to leverage the (ground-truth) labels provided by a human-analyst during an auditing process to maximize the total number of true anomalies shown within a given auditing budget [Das et al., 2016]. In addition to detection precision, others also factor human-effort in the overall objective [Lamba and Akoglu, 2019; Chai et al., 2020]. Some of the remaining challenges in user-centric OD inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Volume None
Pages 4932-4936
DOI 10.24963/ijcai.2021/697
Language English
Journal None

Full Text