Anomaly Detection involves identifying rare or unusual data points that deviate significantly from the norm. It is an unsupervised learning task because we typically don't have labels indicating which examples are anomalies. It is used in scenarios where catching unexpected behavior is critical, such as fraud prevention, fault detection, and cybersecurity.
The Gaussian Distribution (Normal Distribution) is a continuous probability distribution that is symmetric around the mean. Most real-world datasets exhibit behavior that closely approximates this bell curve due to the Central Limit Theorem.
p(x) = (1 / √(2πσ²)) * exp( - (x - μ)² / (2σ²) )
We assume each feature in the dataset follows a Gaussian distribution. This allows us to model the likelihood of each feature value individually.
μ and σ² for each feature using training datap(x) for each examplep(x) < ε, mark it as an anomalyNote: For datasets with multiple correlated features, use a multivariate Gaussian distribution instead of assuming independence between variables.
Epsilon (ε) is the anomaly threshold. It is the cut-off value for p(x). If a data point's probability is lower than ε, it is considered anomalous.
p(x) < ε → data is likely anomalousp(x) ≥ ε → data is likely normalUse a labeled validation set to test different ε values. For each value, calculate the F1 score — a balance between precision and recall — and choose the one that maximizes this score.
Some practical applications of anomaly detection include:
| Topic | Description |
|---|---|
| Gaussian Distribution | Models feature probability using mean and variance |
| Epsilon Threshold | Used to classify data points as normal or anomalous |
| Use Cases | Applies to security, health, fraud, finance, and industry |
| Algorithm Steps | Estimate distribution → compute p(x) → choose ε → detect anomalies |