7. Anomaly Detection

Anomaly Detection involves identifying rare or unusual data points that deviate significantly from the norm. It is an unsupervised learning task because we typically don't have labels indicating which examples are anomalies. It is used in scenarios where catching unexpected behavior is critical, such as fraud prevention, fault detection, and cybersecurity.

7.1 Gaussian Distribution

📘 What is a Gaussian Distribution?

The Gaussian Distribution (Normal Distribution) is a continuous probability distribution that is symmetric around the mean. Most real-world datasets exhibit behavior that closely approximates this bell curve due to the Central Limit Theorem.

      Formula:

      p(x) = (1 / √(2πσ²)) * exp( - (x - μ)² / (2σ²) )

μ (mu): Mean (average value)
σ² (sigma squared): Variance (spread of data)

We assume each feature in the dataset follows a Gaussian distribution. This allows us to model the likelihood of each feature value individually.

🧮 How to compute p(x):

Estimate μ and σ² for each feature using training data
Compute probability p(x) for each example
If p(x) < ε, mark it as an anomaly

      Example: If the average server CPU temperature is 70°C with a σ = 5, then 95°C is highly unlikely under normal conditions — indicating a potential anomaly.
    

Note: For datasets with multiple correlated features, use a multivariate Gaussian distribution instead of assuming independence between variables.

7.2 Threshold (Epsilon)

🧮 What is Epsilon?

Epsilon (ε) is the anomaly threshold. It is the cut-off value for p(x). If a data point's probability is lower than ε, it is considered anomalous.

If p(x) < ε → data is likely anomalous
If p(x) ≥ ε → data is likely normal

📊 Choosing Epsilon:

Use a labeled validation set to test different ε values. For each value, calculate the F1 score — a balance between precision and recall — and choose the one that maximizes this score.

      Trade-off:

      - Lower ε → Detects more anomalies (but may cause more false alarms)

      - Higher ε → More precision (but may miss real anomalies)

7.3 Use Cases

Some practical applications of anomaly detection include:

Fraud Detection: Catch abnormal transactions in banking or e-commerce
Cybersecurity: Identify unusual login locations, traffic spikes, or access patterns
Manufacturing: Detect machine breakdowns or sensor anomalies
Healthcare: Identify unusual heart rate, temperature, or blood pressure patterns
Finance: Spot irregular market behavior or trading patterns

🚀 Steps to Implement Anomaly Detection:

Collect normal (non-anomalous) training data
Estimate μ and σ² for each feature
Compute p(x) for each example
Select ε using a validation set
Predict anomalies: if p(x) < ε → anomaly

Topic	Description
Gaussian Distribution	Models feature probability using mean and variance
Epsilon Threshold	Used to classify data points as normal or anomalous
Use Cases	Applies to security, health, fraud, finance, and industry
Algorithm Steps	Estimate distribution → compute p(x) → choose ε → detect anomalies