Lecture24Handout

.pdf

School

University of California, Irvine**We aren't endorsed by this school

Course

DATA 140

Subject

Computer Science

Date

Dec 18, 2024

Pages

Uploaded by UndisputedChampDRC

DATA 140: Lecture 24 —Clustering and k-Means MethodT/Th 12:30 PM –1:45 PMGreenlaw Room 101Youzuo LinAssociate ProfessorSchool of Data Science and Society, UNC at Chapel Hill

Announcement2•Remaining Weeks Schedule DateContentThur (Nov 21)Regular Lecture (Clustering)Tue (Nov 26)Regular Lecture (Zoom)Titus will set up the meeting.Thur (Nov 28)No Lecture (Thanksgiving)Tue (Dec 3)Final Review (including a sample Exam)Thur (Dec 5)No Lecture (Reading Day Reserved)Mon (Dec 9)In-Person FINAL Exam(4 PM to 7 PM at GL 101)

Announcement3•Revised Grade Policy•Participation: 10%. 15% (Send us emails if you volunteered and shared in class)•Assignments: 15%•Quizzes: 20%. •Midterm Exam 1: 15%.20%•Midterm Exam 2: 15%•Final Exam: 25%. 30%

Review: Classification4•Definition: Classification is about modeling relationships between input features ?and a categoricaloutput ?. ?=𝑓(?)XYInput DataOutput LabelClassifier𝒇(∙)

Review: Steps in Building a Classification Model51. Data Collection2. Feature Selection4. Model Testing3. Model Design and Training5. Making PredictionGather labeled data for the taskChoose relevant featuresDesign classification model and use training data to teach the model to recognize patterns.Evaluate the model’s performance on unseen data (testing data).Use the trained model to predict the label of new, unlabeled input data.

Review: Decision Tree6Decision Tree Classifier: A tree-like structure where nodes split data based on features.Root (Stump): is the start of the decision. treeNode: is a condition with multiple outcomes.Leaf: is the final decision.

Review: Gini Impurity7Gini Impuritymeasures how “mixed” or “impure” a group of data is.•A pure grouphas all data points in the same category •A mixed grouphas multiple categoriesUnderstand Gini value:The smaller the value, the better the split; and 0 means pure group!Formula: G = 1 −σi=1Cpi2where 𝑝?: Proportion of items in category 𝑖,, and C is number of categoriesGoal in Decision Trees:Minimize impurity at each split to create more homogeneous groups.

Review: Entropy8Entropymeasures the level of disorder or uncertainty in a dataset.•High entropy = High uncertainty (data is very mixed).•Low entropy = Low uncertainty (data is more pure).Understand Entropy value:The smaller the value, the better the split; and 0 means pure group!Formula: H = −σi=1C𝑝?log2𝑝?where 𝑝?: Proportion of items in category 𝑖, and C is number of categoriesGoal in Decision Trees:Reduce entropy at each split to create groups that are more certain and homogenous.

CONTENTSIntroduction to Clustering 019Best Practices for ML Prompting (Scott)03Group Activities04K-Means Method02

Types of Machine Learning: Taxonomy10Anomaly detection

What is Clustering: Customer segmentation11Goal: To make 3 marketing strategies. Customers: Age (in years) and Program Engagement (in weeks)Question: How to group “similar” customers?

What is Clustering: Customer segmentation12Question: How to group “similar” customers using a computer?Engagement(Weeks)Age(Yrs)20304050123567

What is Clustering, and Why Use It?13Definition: Clustering is the task of dividing a dataset into groups, or “clusters,” where data points in the same cluster are more similar to each other than to those in other clusters. Use of Clustering:1.Exploratory Data Analysis: To uncover hidden structuresin data.2.Dimensionality Reduction: To preprocess data and reduce complexity.3.Anomalies Detection: To identify data points that don’t belongto any cluster, marking them as anomalies.

Types of Clustering Methods14Partitioning Clustering Methods: Divide data into non-overlapping groups, where each data point belongs to exactly one cluster.Strengths: Fast and easy to implement Weakness: •Requires the number of clusters (𝑘) to be specified beforehand.•Struggles with irregular or non-spherical clusters.Other Clustering Methods: Hierarchical Clustering, Density-Based Clustering, Model-based Clustering

How k-Means Works: Lin’s Noodle Haven Grand Opening15My Need: To determine the locations of threenew restaurants to serve the community best. What would you suggest? Lin 2Lin 1Lin 3

How k-Means Works: Lin’s Noodle Haven Grand Opening16AI Thought Process: To start with three random locations. Lin 1Lin 2Lin 3

How k-Means Works: Lin’s Noodle Haven Grand Opening17AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2

How k-Means Works: Lin’s Noodle Haven Grand Opening18AI Logic 2: Relocatethe restaurants to the centerof the local community. Lin 3Lin 1Lin 2

How k-Means Works: Lin’s Noodle Haven Grand Opening19AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2

How k-Means Works: Lin’s Noodle Haven Grand Opening20AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2

How k-Means Works: Lin’s Noodle Haven Grand Opening21AI Logic 2: Relocatethe restaurants to the centerof the local community. Lin 3Lin 1Lin 2

Clustering Concepts: Finding Patterns with k-Means22•Clusters (𝑪): Groups of data points that are more similar to each other than to points in other groups.•Centroid (𝝁): The center of a cluster, representing the “average” position of all points in the group.𝝁𝑪

How k-Means Works: Number of Clusters23Question: How many clusters? 𝑪?𝑪?𝑪?𝐕.𝐒.𝑪?𝑪?

How k-Means Works: Elbow Method24Definition: The elbow method is a graphical approach to finding the ideal number of clusters using the maximum distance between two points within the same cluster.Four Steps: 1.Perform Clustering•Apply k-means for 𝑘 = 1to 𝑘 𝑚𝑎𝑥(e.g., 𝑘max= 6). Here 𝑘𝑚𝑎𝑥depends on the complexity of the data2.Measure Compactness•For each clustering 𝑘, calculate the maximum pairwise distancewithin each cluster.3. Visualize4.Find the Elbow

How k-Means Works: Elbow Method25Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance1 Cluster

How k-Means Works: Elbow Method26Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance2 Clusters

How k-Means Works: Elbow Method27Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance3 Clusters

How k-Means Works: Elbow Method28Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance4 Clusters

How k-Means Works: Elbow Method29Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance5 Clusters

How k-Means Works: Elbow Method30Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance6 Clusters

How k-Means Works: Elbow Method31Steps 3 & 4: Visualize and find the elbowDistanceNumber of ClustersElbow!123456

Understanding the Elbow Method32DistanceNumber of ClustersElbow!123456Balancebetween Simplicity and Performance•Too many clusters may overfitthe data, splitting meaningful groups unnecessarily.•Too few clusters may underfitthe data, missing important patterns in the data.OverfittingUnderfittingQuestion: What is the most overfitted scenario?

How k-Means Works: Math Behind33For each data point 𝑥?, assign it to the cluster with the nearest centroid 𝜇?:𝐶?=𝑥?|𝑥?− 𝜇?≤𝑥?− 𝜇?, ∀𝑘𝜇1𝜇2𝜇3𝐶1𝐶2𝐶3𝐶1𝐶2𝐶3Assign each data point to the cluster of the nearest centroid by calculating and comparing distances.

How k-Means Works: Math Behind (Recompute Centroids)34Update each cluster’s centroid by computing the meanof all points assigned to it:𝜇?=1𝑐?σ𝑥?∈𝐶?𝑥?𝐶1𝐶2𝐶3Calculate the new centroid as the average position of all points in the cluster.𝜇1𝜇1𝜇2𝜇2𝜇3𝜇3𝐶1𝐶2𝐶3

K-Means Animation35

Summary: Key Steps in k-Means Clustering36Four Steps: 1. Initialization•Randomly choose 𝑘initial centroids from the dataset.2.Assign Points to Clusters•Assign each data point to the nearest centroid based on distance.3.Update Centroids•Calculate the new centroids as the average position of all points in each cluster.4.Repeat Until Convergence•Continue assigning points and updating centroids until centroids stabilize or a set iteration limit is reached.

Group Activity: Plan the Perfect Meal at Lin's Noodle Haven37Group Work You are a chef at Lin's Noodle Haven, a popular Chinese restaurant. This weekend, you need to create a special menu for a themed dining event. Use k-means clustering to group dishes and craft a balanced, delicious, and budget-friendly meal plan.Objective: Use k-means clustering to group Chinese dishes based on features like calories, spiciness, prep time, and price. They will then design a themed meal plan, balancing dietary preferences, budget, and other constraints.Step 1 -- Visualize the Dishes: Plot the dishes on a 2D scatterplot using two features, such as "Calories" (x-axis) and "Spiciness" (y-axis).Step 2 -- Perform Clustering: Use k-means clustering to group dishes into categories (e.g., k=3 for "Spicy Favorites," "Savory Classics," and "Quick Snacks").Step 3 -- Create Your Meal Plan: Select one or more dishes from each cluster to design a balanced meal. •The meal should fit a theme (e.g., "Sichuan Spice Night," "Quick Comfort Food").Step 4 -- Present Your Meal Plan: Groups share their meal plans and explain: how they chose dishes to fit the theme; and how clustering helped them categorize dishes and make decisions.

Group Activity: Plan the Perfect Chinese Meal with k-Means!38DishCaloriesSpiciness (1-10)Prep Time (mins)Price ($)Kung Pao Chicken45083012.99Mapo Tofu35092510.99Hot and Sour Soup2006206.99Peking Duck60024524.99Spring Rolls1503155.99Fried Rice5001258.99Sichuan Dumplings3007209.99Egg Tart2500103.99Sweet and Sour Pork55043011.99