Implementing Decision Trees and SVM Classifiers in Python

School

Boğaziçi University**We aren't endorsed by this school

Course

CMPE 462

Subject

Computer Science

Date

Dec 11, 2024

Pages

Uploaded by CaptainSnowDuck42

CMPE 462 Assignment 2Bo˘gazi¸ci University Department of Computer EngineeringDeadline: May 14th, 2024 by midnightSpring 2024In this assignment, you will implement the following models.You can uselibraries such as Numpy, scipy, and Matplotlib in your experiments. If the taskrequires implementation from scratch, you are not allowed to use a library.Iftraining and test splits are not provided in the datasets, please randomly splityour data into training and test. Please submit a PDF report containing the linkto our code, your answers, and references. Please cite all the resources used in theassignment. If you ever use an AI tool such as ChatGPT, please acknowledge. Eachgroup member should be able to answer questions regarding any of the sectionsbelow. Please submit one report per group.1Decision Trees (30 pts)In this task, please use the dataset you used in the Naive Bayes task of thefirst assignment,https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic.1. Train a decision tree using the scikit-learn’s function. Please tune the depthof the tree and visualize the learned tree using the library. You can use thedefault splitting criteria, which is gini.2. Compare its test performance with the Naive Bayes classifier you trained inthe first assignment.3. Using the decision tree, obtain the most significant features. Select the mostsignificant 5, 10, 15, and 20 features, train a linear classifier of your choice foreach, and compare the performance. Comment on the effect of this featureselection approach on the performance.1

4. Train a random forest using all the original features and compare its testperformance with the decision tree in 1. Please plot the change in test andtraining performances with the varying number of trees in the forest.2Support Vector Machines (40 pts)You will implement the SVM classifier for the MNIST1dataset in this task.MNIST has 50,000 training and 10,000 test images of 10 classes. Please considerthe digits 2, 3, 8, and 9 in this section. Thus, the total number of samples will be20,000 (5000 for each class) in the training set and 4,000 in the test set.1. Please flatten the gray-scale images and feed these vectors directly to yoursoft-margin SVM model.(a) Please train a 4-class linear SVM using one-vs-all.Please train theprimal formulation of SVM from scratch using a quadratic programmingsolver. Please clearly write the expressions you feed to the solver. Pleasetune the hyperparameters and report your training and test accuracy.(b) Please train a 4-class SVM using the scikit-learn’s soft margin primalSVM function with linear kernel. Please tune the hyperparameters andreport your training and test accuracy. Compare the results with part(a) regarding classification accuracy and training time.(c) Please train a 4-class non-linear SVM using one-vs-all. Please train thedual formulation of SVM from scratch using a quadratic programmingsolver. Please clearly write the expressions you feed to the solver. Youmay choose any kernel you like. Please tune the hyperparameters andreport your training and test accuracy.(d) Please train a 4-class SVM using the scikit-learn’s soft margin dualSVM function with a non-linear kernel.You may choose any kernelyou like. Please tune the hyperparameters and report your training andtest accuracy. Compare the results with part (c) regarding classificationaccuracy and training time.2. Please extract features from the images. You may try any feature extractiontechnique you like. However, please explain the reason behind your choice.Repeat the experiments in 1. a-d with the extracted features and comparethe performance in terms of accuracy and training time.1https://en.wikipedia.org/wiki/MNIST_database2

3. Please find the support vectors using one of the dual SVM models you trainedand inspect the images. Please discuss whether there is any visual differencebetween the support vectors and other images.3Clustering (30 pts)In this task, please use the 4-class MNIST data you use in the second task.1. Is normalizing the data points before running k-means important?Pleaseexplain.2. Please implement the k-means algorithm from scratch using Euclidean dis-tance. Find 4 clusters using the flattened images. Repeat this experimentfor the features you extracted. Please compare the clustering outputs usingthe external (clustering accuracy)and internal (SSE) metrics.3. Please repeat step 2 with cosine similarity instead of Euclidean distance. Didyou observe a significant difference in the clustering results?3