7. Anomaly Detection

Anomaly Detection involves identifying rare or unusual data points that deviate significantly from the norm. It is an unsupervised learning task because we typically don't have labels indicating which examples are anomalies. It is used in scenarios where catching unexpected behavior is critical, such as fraud prevention, fault detection, and cybersecurity.

7.1 Gaussian Distribution

📘 What is a Gaussian Distribution?

The Gaussian Distribution (Normal Distribution) is a continuous probability distribution that is symmetric around the mean. Most real-world datasets exhibit behavior that closely approximates this bell curve due to the Central Limit Theorem.

Formula:
p(x) = (1 / √(2πσ²)) * exp( - (x - μ)² / (2σ²) )

We assume each feature in the dataset follows a Gaussian distribution. This allows us to model the likelihood of each feature value individually.

🧮 How to compute p(x):

Example: If the average server CPU temperature is 70°C with a σ = 5, then 95°C is highly unlikely under normal conditions — indicating a potential anomaly.

Note: For datasets with multiple correlated features, use a multivariate Gaussian distribution instead of assuming independence between variables.

7.2 Threshold (Epsilon)

🧮 What is Epsilon?

Epsilon (ε) is the anomaly threshold. It is the cut-off value for p(x). If a data point's probability is lower than ε, it is considered anomalous.

📊 Choosing Epsilon:

Use a labeled validation set to test different ε values. For each value, calculate the F1 score — a balance between precision and recall — and choose the one that maximizes this score.

Trade-off:
- Lower ε → Detects more anomalies (but may cause more false alarms)
- Higher ε → More precision (but may miss real anomalies)

7.3 Use Cases

Some practical applications of anomaly detection include:

🚀 Steps to Implement Anomaly Detection:

  1. Collect normal (non-anomalous) training data
  2. Estimate μ and σ² for each feature
  3. Compute p(x) for each example
  4. Select ε using a validation set
  5. Predict anomalies: if p(x) < ε → anomaly
Topic Description
Gaussian Distribution Models feature probability using mean and variance
Epsilon Threshold Used to classify data points as normal or anomalous
Use Cases Applies to security, health, fraud, finance, and industry
Algorithm Steps Estimate distribution → compute p(x) → choose ε → detect anomalies