Anomaly detection

In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.[1] Such examples may arouse suspicions of being generated by a different mechanism,[2] or appear inconsistent with the remainder of that set of data.[3]

Anomaly detection finds application in many domains including cybersecurity, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers.

Three broad categories of anomaly detection techniques exist.[1] Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not, the techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

  1. ^ a b Cite error: The named reference ChandolaSurvey was invoked but never defined (see the help page).
  2. ^ Hawkins, Douglas M. (1980). Identification of Outliers. Springer. ISBN 978-0-412-21900-9. OCLC 6912274.
  3. ^ Barnett, Vic; Lewis, Lewis (1978). Outliers in statistical data. Wiley. ISBN 978-0-471-99599-9. OCLC 1150938591.

Developed by StudentB