6. Unsupervised Learning
6.1 Clustering
Definition: Clustering is an unsupervised learning technique that groups similar data points into clusters without labeled data.
Explanation: It helps discover hidden patterns or groupings in data and is widely used in customer segmentation, social network analysis, and image compression.
- Grouping news articles by topic.
- Segmenting customers by purchasing behavior.
- No labeled data required.
- Output is groups/clusters with high intra-group similarity.
- Popular algorithms: K-Means, DBSCAN, Hierarchical.
6.1.1 K-Means
Definition: K-Means is a centroid-based clustering algorithm that partitions data into K clusters based on proximity to cluster centers.
Explanation: The algorithm minimizes intra-cluster distances by assigning each point to the nearest centroid and then recalculating centroids until convergence.
- Customer segmentation in marketing.
- Image color quantization.
- Requires K to be predefined.
- Sensitive to initial centroids.
- Fast and scalable.
6.1.1.1 Elbow Method
Definition: The Elbow Method is a technique to find the optimal number of clusters (K) by plotting within-cluster variance versus number of clusters.
- Used in K-Means to determine the ideal K in customer segmentation.
6.1.1.2 K-Means++
Definition: K-Means++ is an enhanced version of K-Means that improves the initial centroid selection to avoid poor clustering results.
- Used in clustering large-scale location data with better consistency.
6.1.2 Hierarchical Clustering
Definition: Hierarchical clustering builds a tree of clusters by either merging or splitting them recursively.
- Taxonomy of species.
- Document categorization.
6.1.2.1 Agglomerative Clustering
Definition: A bottom-up approach where each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Grouping species by genetic similarity in biology.
- Merging social network users based on mutual friends.
- Does not require specifying the number of clusters (K).
- Produces a dendrogram to visualize cluster merging.
- Slower than K-Means on large datasets due to repeated distance calculations.
6.1.2.2 Divisive Clustering
Definition: A top-down approach that starts with all data in one cluster and recursively splits it into smaller clusters.
- Separating academic departments from a university-wide dataset.
- Breaking down a customer base from general to specific segments.
- Opposite of agglomerative — starts big and splits downward.
- Less commonly used due to higher computational complexity.
- Useful when the natural grouping is known to start large.
6.1.2.3 Dendrograms
Definition: A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering.
- Visualizing how different types of wines group based on taste features.
- Showing user communities in a social media clustering analysis.
- Shows the entire clustering process and merging steps.
- Cutting the dendrogram at a specific height reveals K clusters.
- Helps evaluate natural grouping and hierarchy in the data.
6.2 Association
Definition: Association learning is a rule-based machine learning method for discovering interesting relations between variables in large datasets.
Explanation: It finds frequent patterns, associations, or correlations in data. Often used for market basket analysis to find products bought together.
- "People who buy bread often buy butter."
- Amazon recommendation rules.
- Uses support, confidence, and lift to evaluate rules.
- Apriori and FP-Growth are common algorithms.
- Highly interpretable rules.