Lecture24Handout

.pdf
School
University of California, Irvine**We aren't endorsed by this school
Course
DATA 140
Subject
Computer Science
Date
Dec 18, 2024
Pages
39
Uploaded by UndisputedChampDRC
DATA 140: Lecture 24 Clustering and k-Means MethodT/Th 12:30 PM 1:45 PMGreenlaw Room 101Youzuo LinAssociate ProfessorSchool of Data Science and Society, UNC at Chapel Hill
Background image
Announcement2Remaining Weeks Schedule DateContentThur (Nov 21)Regular Lecture (Clustering)Tue (Nov 26)Regular Lecture (Zoom)Titus will set up the meeting.Thur (Nov 28)No Lecture (Thanksgiving)Tue (Dec 3)Final Review (including a sample Exam)Thur (Dec 5)No Lecture (Reading Day Reserved)Mon (Dec 9)In-Person FINAL Exam(4 PM to 7 PM at GL 101)
Background image
Announcement3Revised Grade PolicyParticipation: 10%. 15% (Send us emails if you volunteered and shared in class)Assignments: 15%Quizzes: 20%. Midterm Exam 1: 15%.20%Midterm Exam 2: 15%Final Exam: 25%. 30%
Background image
Review: Classification4Definition: Classification is about modeling relationships between input features ?and a categoricaloutput ?. ?=𝑓(?)XYInput DataOutput LabelClassifier𝒇(∙)
Background image
Review: Steps in Building a Classification Model51. Data Collection2. Feature Selection4. Model Testing3. Model Design and Training5. Making PredictionGather labeled data for the taskChoose relevant featuresDesign classification model and use training data to teach the model to recognize patterns.Evaluate the model’s performance on unseen data (testing data).Use the trained model to predict the label of new, unlabeled input data.
Background image
Review: Decision Tree6Decision Tree Classifier: A tree-like structure where nodes split data based on features.Root (Stump): is the start of the decision. treeNode: is a condition with multiple outcomes.Leaf: is the final decision.
Background image
Review: Gini Impurity7Gini Impuritymeasures how “mixed” or “impure” a group of data is.A pure grouphas all data points in the same category A mixed grouphas multiple categoriesUnderstand Gini value:The smaller the value, the better the split; and 0 means pure group!Formula: G = 1 −σi=1Cpi2where 𝑝?: Proportion of items in category 𝑖,, and C is number of categoriesGoal in Decision Trees:Minimize impurity at each split to create more homogeneous groups.
Background image
Review: Entropy8Entropymeasures the level of disorder or uncertainty in a dataset.High entropy = High uncertainty (data is very mixed).Low entropy = Low uncertainty (data is more pure).Understand Entropy value:The smaller the value, the better the split; and 0 means pure group!Formula: H = −σi=1C𝑝?log2𝑝?where 𝑝?: Proportion of items in category 𝑖, and C is number of categoriesGoal in Decision Trees:Reduce entropy at each split to create groups that are more certain and homogenous.
Background image
CONTENTSIntroduction to Clustering 019Best Practices for ML Prompting (Scott)03Group Activities04K-Means Method02
Background image
Types of Machine Learning: Taxonomy10Anomaly detection
Background image
What is Clustering: Customer segmentation11Goal: To make 3 marketing strategies. Customers: Age (in years) and Program Engagement (in weeks)Question: How to group “similar” customers?
Background image
What is Clustering: Customer segmentation12Question: How to group “similar” customers using a computer?Engagement(Weeks)Age(Yrs)20304050123567
Background image
What is Clustering, and Why Use It?13Definition: Clustering is the task of dividing a dataset into groups, or clusters,” where data points in the same cluster are more similar to each other than to those in other clusters. Use of Clustering:1.Exploratory Data Analysis: To uncover hidden structuresin data.2.Dimensionality Reduction: To preprocess data and reduce complexity.3.Anomalies Detection: To identify data points that don’t belongto any cluster, marking them as anomalies.
Background image
Types of Clustering Methods14Partitioning Clustering Methods: Divide data into non-overlapping groups, where each data point belongs to exactly one cluster.Strengths: Fast and easy to implement Weakness: Requires the number of clusters (𝑘) to be specified beforehand.Struggles with irregular or non-spherical clusters.Other Clustering Methods: Hierarchical Clustering, Density-Based Clustering, Model-based Clustering
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening15My Need: To determine the locations of threenew restaurants to serve the community best. What would you suggest? Lin 2Lin 1Lin 3
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening16AI Thought Process: To start with three random locations. Lin 1Lin 2Lin 3
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening17AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening18AI Logic 2: Relocatethe restaurants to the centerof the local community. Lin 3Lin 1Lin 2
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening19AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening20AI Logic 1: People go to the closestNoodle Haven. Lin 3Lin 1Lin 2
Background image
How k-Means Works: Lin’s Noodle Haven Grand Opening21AI Logic 2: Relocatethe restaurants to the centerof the local community. Lin 3Lin 1Lin 2
Background image
Clustering Concepts: Finding Patterns with k-Means22Clusters (𝑪): Groups of data points that are more similar to each other than to points in other groups.Centroid (𝝁): The center of a cluster, representing the “average” position of all points in the group.𝝁𝑪
Background image
How k-Means Works: Number of Clusters23Question: How many clusters? 𝑪?𝑪?𝑪?𝐕.𝐒.𝑪?𝑪?
Background image
How k-Means Works: Elbow Method24Definition: The elbow method is a graphical approach to finding the ideal number of clusters using the maximum distance between two points within the same cluster.Four Steps: 1.Perform ClusteringApply k-means for 𝑘 = 1to 𝑘 𝑚𝑎𝑥(e.g., 𝑘max= 6). Here 𝑘𝑚𝑎𝑥depends on the complexity of the data2.Measure CompactnessFor each clustering 𝑘, calculate the maximum pairwise distancewithin each cluster.3. Visualize4.Find the Elbow
Background image
How k-Means Works: Elbow Method25Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance1 Cluster
Background image
How k-Means Works: Elbow Method26Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance2 Clusters
Background image
How k-Means Works: Elbow Method27Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance3 Clusters
Background image
How k-Means Works: Elbow Method28Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance4 Clusters
Background image
How k-Means Works: Elbow Method29Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance5 Clusters
Background image
How k-Means Works: Elbow Method30Steps 1 & 2: Perform Clustering (𝒌 = ?), Calculate the longest distance6 Clusters
Background image
How k-Means Works: Elbow Method31Steps 3 & 4: Visualize and find the elbowDistanceNumber of ClustersElbow!123456
Background image
Understanding the Elbow Method32DistanceNumber of ClustersElbow!123456Balancebetween Simplicity and PerformanceToo many clusters may overfitthe data, splitting meaningful groups unnecessarily.Too few clusters may underfitthe data, missing important patterns in the data.OverfittingUnderfittingQuestion: What is the most overfitted scenario?
Background image
How k-Means Works: Math Behind33For each data point 𝑥?, assign it to the cluster with the nearest centroid 𝜇?:𝐶?=𝑥?|𝑥?− 𝜇?𝑥?− 𝜇?, ∀𝑘𝜇1𝜇2𝜇3𝐶1𝐶2𝐶3𝐶1𝐶2𝐶3Assign each data point to the cluster of the nearest centroid by calculating and comparing distances.
Background image
How k-Means Works: Math Behind (Recompute Centroids)34Update each cluster’s centroid by computing the meanof all points assigned to it:𝜇?=1𝑐?σ𝑥?∈𝐶?𝑥?𝐶1𝐶2𝐶3Calculate the new centroid as the average position of all points in the cluster.𝜇1𝜇1𝜇2𝜇2𝜇3𝜇3𝐶1𝐶2𝐶3
Background image
K-Means Animation35
Background image
Summary: Key Steps in k-Means Clustering36Four Steps: 1. InitializationRandomly choose 𝑘initial centroids from the dataset.2.Assign Points to ClustersAssign each data point to the nearest centroid based on distance.3.Update CentroidsCalculate the new centroids as the average position of all points in each cluster.4.Repeat Until ConvergenceContinue assigning points and updating centroids until centroids stabilize or a set iteration limit is reached.
Background image
Group Activity: Plan the Perfect Meal at Lin's Noodle Haven37Group Work You are a chef at Lin's Noodle Haven, a popular Chinese restaurant. This weekend, you need to create a special menu for a themed dining event. Use k-means clustering to group dishes and craft a balanced, delicious, and budget-friendly meal plan.Objective: Use k-means clustering to group Chinese dishes based on features like calories, spiciness, prep time, and price. They will then design a themed meal plan, balancing dietary preferences, budget, and other constraints.Step 1 -- Visualize the Dishes: Plot the dishes on a 2D scatterplot using two features, such as "Calories" (x-axis) and "Spiciness" (y-axis).Step 2 -- Perform Clustering: Use k-means clustering to group dishes into categories (e.g., k=3 for "Spicy Favorites," "Savory Classics," and "Quick Snacks").Step 3 -- Create Your Meal Plan: Select one or more dishes from each cluster to design a balanced meal. The meal should fit a theme (e.g., "Sichuan Spice Night," "Quick Comfort Food").Step 4 -- Present Your Meal Plan: Groups share their meal plans and explain: how they chose dishes to fit the theme; and how clustering helped them categorize dishes and make decisions.
Background image
Group Activity: Plan the Perfect Chinese Meal with k-Means!38DishCaloriesSpiciness (1-10)Prep Time (mins)Price ($)Kung Pao Chicken45083012.99Mapo Tofu35092510.99Hot and Sour Soup2006206.99Peking Duck60024524.99Spring Rolls1503155.99Fried Rice5001258.99Sichuan Dumplings3007209.99Egg Tart2500103.99Sweet and Sour Pork55043011.99
Background image
Background image