Predicting Gujarat Lions' Match Outcomes in IPL Data Analysis

School

Pak-Austria Fachhochschule Institute of Applied Sciences and Technology**We aren't endorsed by this school

Course

MANA 77A

Subject

Statistics

Date

Dec 11, 2024

Pages

Uploaded by ElderResolveDeer26

Name: Hoor-ul-ain WajidRegistration Number: B22F0068DS004BS Data ScienceMachine Learnig for Structured DataAssignment 3importpandas aspdfromsklearn.preprocessing importLabelEncoderfromsklearn.preprocessing importStandardScalerfromsklearn.model_selection importtrain_test_splitdata=pd.read_csv(r"C:\Users\PMLS\Downloads\matches.csv")data.head()id season city date team1 \0 1 2017 Hyderabad 2017-04-05 Sunrisers Hyderabad 1 2 2017 Pune 2017-04-06 Mumbai Indians 2 3 2017 Rajkot 2017-04-07 Gujarat Lions 3 4 2017 Indore 2017-04-08 Rising Pune Supergiant 4 5 2017 Bangalore 2017-04-08 Royal Challengers Bangalore team2 toss_winner toss_decision \0 Royal Challengers Bangalore Royal Challengers Bangalore field 1 Rising Pune Supergiant Rising Pune Supergiant field 2 Kolkata Knight Riders Kolkata Knight Riders field 3 Kings XI Punjab Kings XI Punjab field 4 Delhi Daredevils Royal Challengers Bangalore bat result dl_applied winner win_by_runs \0 normal 0 Sunrisers Hyderabad 35

1 normal 0 Rising Pune Supergiant 0 2 normal 0 Kolkata Knight Riders 0 3 normal 0 Kings XI Punjab 0 4 normal 0 Royal Challengers Bangalore 15 win_by_wickets player_of_match venue \0 0 Yuvraj Singh Rajiv Gandhi International Stadium, Uppal 1 7 SPD Smith Maharashtra Cricket Association Stadium 2 10 CA Lynn Saurashtra Cricket Association Stadium 3 6 GJ Maxwell Holkar Cricket Stadium 4 0 KM Jadhav M Chinnaswamy Stadium umpire1 umpire2 umpire3 0 AY Dandekar NJ Llong NaN 1 A Nand Kishore S Ravi NaN 2 Nitin Menon CK Nandan NaN 3 AK Chaudhary C Shamshuddin NaN 4 NaN NaN NaN data.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 636 entries, 0 to 635Data columns (total 18 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 636 non-null int64 1 season 636 non-null int64 2 city 629 non-null object 3 date 636 non-null object 4 team1 636 non-null object 5 team2 636 non-null object 6 toss_winner 636 non-null object 7 toss_decision 636 non-null object 8 result 636 non-null object 9 dl_applied 636 non-null int64 10 winner 633 non-null object 11 win_by_runs 636 non-null int64 12 win_by_wickets 636 non-null int64 13 player_of_match 633 non-null object 14 venue 636 non-null object 15 umpire1 635 non-null object 16 umpire2 635 non-null object 17 umpire3 0 non-null float64

dtypes: float64(1), int64(5), object(12)memory usage: 89.6+ KB# Fill missing valuesdata['city'].fillna(data['city'].mode()[0], inplace=True)data['player_of_match'].fillna("Unknown", inplace=True)data['umpire1'].fillna("Unknown", inplace=True)data['umpire2'].fillna("Unknown", inplace=True)# Drop umpire3 columndata.drop(['umpire3'], axis=1, inplace=True)Discussion:I have filetred dataset for matches of other treams with Gujrat lions. I want to predict whether against different teams gujrat lions win or loses. Basically I want to predict win or loss of gijrat lions with other teams. That is my reason for filtering the dataset.# Filter dataset for matches of other teams with Gujarat Lionsdataset_filtered =data[((data['team1'] =='Gujarat Lions') &(data['team2'] !='Gujarat Lions')) |((data['team1'] !='Gujarat Lions') &(data['team2'] =='Gujarat Lions'))]dataset_filtered.head()id season city date team1 \2 3 2017 Rajkot 2017-04-07 Gujarat Lions 5 6 2017 Hyderabad 2017-04-09 Gujarat Lions 12 13 2017 Rajkot 2017-04-14 Rising Pune Supergiant 15 16 2017 Mumbai 2017-04-16 Gujarat Lions 19 20 2017 Rajkot 2017-04-18 Royal Challengers Bangalore team2 toss_winner toss_decision result \2 Kolkata Knight Riders Kolkata Knight Riders field normal 5 Sunrisers Hyderabad Sunrisers Hyderabad field normal 12 Gujarat Lions Gujarat Lions field normal 15 Mumbai Indians Mumbai Indians field normal 19 Gujarat Lions Gujarat Lions field normal dl_applied winner win_by_runs

win_by_wickets \2 0 Kolkata Knight Riders 0 10 5 0 Sunrisers Hyderabad 0 9 12 0 Gujarat Lions 0 7 15 0 Mumbai Indians 0 6 19 0 Royal Challengers Bangalore 21 0 player_of_match venue umpire1 \2 CA Lynn Saurashtra Cricket Association Stadium Nitin Menon 5 Rashid Khan Rajiv Gandhi International Stadium, Uppal A Deshmukh 12 AJ Tye Saurashtra Cricket Association Stadium A Nand Kishore 15 N Rana Wankhede Stadium A Nand Kishore 19 CH Gayle Saurashtra Cricket Association Stadium S Ravi umpire2 2 CK Nandan 5 NJ Llong 12 S Ravi 15 S Ravi 19 VK Sharma dataset_filtered["team2"].unique()array(['Kolkata Knight Riders', 'Sunrisers Hyderabad', 'Gujarat Lions','Mumbai Indians', 'Rising Pune Supergiant', 'Delhi Daredevils','Royal Challengers Bangalore'], dtype=object)# Define the mapping dictionaryteam_mapping ={'Gujarat Lions': 0,'Rising Pune Supergiant':1,'Royal Challengers Bangalore':2, 'Kolkata Knight Riders':3,'Kings XI Punjab':4, 'Rising Pune Supergiants':5, 'Mumbai Indians':6,'Sunrisers Hyderabad':7,'Delhi Daredevils':8}# Apply mapping using .loc to avoid SettingWithCopyWarningdataset_filtered.loc[:, 'team1'] =

dataset_filtered['team1'].map(team_mapping)dataset_filtered.loc[:, 'team2'] = dataset_filtered['team2'].map(team_mapping)dataset_filtered.loc[:, 'toss_winner'] = dataset_filtered['toss_winner'].map(team_mapping)print(dataset_filtered.head())id season city date team1 team2 toss_winner toss_decision \2 3 2017 Rajkot 2017-04-07 0 3 3 field 5 6 2017 Hyderabad 2017-04-09 0 7 7 field 12 13 2017 Rajkot 2017-04-14 1 0 0 field 15 16 2017 Mumbai 2017-04-16 0 6 6 field 19 20 2017 Rajkot 2017-04-18 2 0 0 field result dl_applied winner win_by_runs \2 normal 0 Kolkata Knight Riders 0 5 normal 0 Sunrisers Hyderabad 0 12 normal 0 Gujarat Lions 0 15 normal 0 Mumbai Indians 0 19 normal 0 Royal Challengers Bangalore 21 win_by_wickets player_of_match venue \2 10 CA Lynn Saurashtra Cricket Association Stadium 5 9 Rashid Khan Rajiv Gandhi International Stadium, Uppal 12 7 AJ Tye Saurashtra Cricket Association Stadium 15 6 N Rana Wankhede Stadium 19 0 CH Gayle Saurashtra Cricket Association Stadium umpire1 umpire2 2 Nitin Menon CK Nandan 5 A Deshmukh NJ Llong 12 A Nand Kishore S Ravi 15 A Nand Kishore S Ravi 19 S Ravi VK Sharma dataset_filtered =dataset_filtered.copy()dataset_filtered['win_label'] =

dataset_filtered['winner'].apply(lambdax: 0ifx =='Gujarat Lions' else1)print(dataset_filtered['win_label'].value_counts())win_label1 170 13Name: count, dtype: int64# Create a copy of the filtered dataset to avoid modifying the original DataFramedataset_filtered =dataset_filtered.copy()# Now, drop the columnsdataset_filtered.drop(['id', 'date', 'player_of_match','winner','umpire1', 'umpire2'], axis=1, inplace=True)label_encoder =LabelEncoder()categorical_columns =["toss_decision",'city', 'venue', 'result']forcol incategorical_columns:dataset_filtered[col] = label_encoder.fit_transform(dataset_filtered[col])# Normalizing the datascaler =StandardScaler()numerical_columns =['win_by_runs', 'win_by_wickets', 'dl_applied']data[numerical_columns] = scaler.fit_transform(data[numerical_columns])X =dataset_filtered.drop('win_label', axis=1)y =dataset_filtered['win_label']Xseason city team1 team2 toss_winner toss_decision result dl_applied \2 2017 8 0 3 3 1 0 0 5 2017 3 0 7 7 1 0 0 12 2017 8 1 0 0 1 0 0 15 2017 6 0 6 6 1 0 0 19 2017 8 2 0 0 1 0 0 22 2017 5 3 0 0 1 0 0

25 2017 8 4 0 0 1 0 0 29 2017 0 2 0 0 1 0 0 33 2017 8 0 6 0 0 1 0 37 2017 7 0 1 1 1 0 0 40 2017 2 0 8 8 1 0 0 45 2017 1 4 0 0 1 0 0 48 2017 4 0 8 8 1 0 0 51 2017 4 0 7 7 1 0 0 578 2016 1 4 0 0 1 0 0 581 2016 8 5 0 5 0 0 0 584 2016 6 6 0 0 1 0 0 590 2016 8 0 7 7 1 0 0 594 2016 8 2 0 2 0 0 0 598 2016 2 0 8 8 1 0 0 600 2016 7 5 0 0 1 0 0 603 2016 8 4 0 0 1 0 0 606 2016 8 0 8 8 1 0 0 609 2016 3 0 7 7 1 0 0 613 2016 5 3 0 0 1 0 0 619 2016 0 2 0 0 1 0 0 626 2016 4 3 0 0 1 0 0 629 2016 4 6 0 0 1 0 0 632 2016 0 0 2 2 1 0 0 634 2016 2 0 7 7 1 0 0

win_by_runs win_by_wickets venue 2 0 10 7 5 0 9 6 12 0 7 7 15 0 6 8 19 21 0 7 22 0 4 0 25 26 0 7 29 0 7 3 33 0 0 7 37 0 5 4 40 0 7 1 45 0 6 5 48 0 2 2 51 0 8 2 578 0 5 5 581 0 7 7 584 0 3 8 590 0 10 7 594 0 6 7 598 1 0 1 600 0 3 4 603 23 0 7 606 0 8 7 609 0 5 6 613 0 5 0 619 144 0 3 626 0 6 2 629 0 6 2 632 0 4 3 634 0 4 1 y2 15 112 015 119 122 025 129 033 137 140 145 048 151 1578 0581 0

584 0590 1594 0598 0600 0603 1606 1609 1613 0619 1626 0629 0632 1634 1Name: win_label, dtype: int64fromsklearn.model_selection importtrain_test_splitX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=42)SVMfromsklearn.svm importSVCfromsklearn.metrics importaccuracy_score, classification_report, confusion_matrix# Initialize the SVM modelsvm_model =SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)# Train the modelsvm_model.fit(X_train, y_train)SVC(random_state=42)# Make predictionsy_pred =svm_model.predict(X_test)# Evaluate the modelaccuracy =accuracy_score(y_test, y_pred)print("SVM Accuracy:", accuracy)SVM Accuracy: 0.6666666666666666print("Classification Report:")print(classification_report(y_test, y_pred))Classification Report:precision recall f1-score support

0 0.00 0.00 0.00 21 0.67 1.00 0.80 4accuracy 0.67 6macro avg 0.33 0.50 0.40 6weighted avg 0.44 0.67 0.53 6C:\Users\PMLS\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior._warn_prf(average, modifier, msg_start, len(result))C:\Users\PMLS\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior._warn_prf(average, modifier, msg_start, len(result))C:\Users\PMLS\anaconda3\Lib\site-packages\sklearn\metrics\_classification.py:1469: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior._warn_prf(average, modifier, msg_start, len(result))Discussion:The accuracy of svm is greatly influenced by class imbalancedness.There are few instances where Gujarat Lions win, leading to class imbalance, which makes it harder for the model to predict class 0.This can be dealt by oversampling minority class,undersampling majority class or by using weight classfromsklearn.tree importDecisionTreeClassifierfromsklearn.metrics importclassification_report, confusion_matrix, accuracy_scoreimportmatplotlib.pyplot aspltfromsklearn.tree importplot_tree# Initialize the Decision Tree Classifierdt_model =DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=42)# Train the modeldt_model.fit(X_train, y_train)# Make predictionsy_pred =dt_model.predict(X_test)

print("\nAccuracy Score:", accuracy_score(y_test, y_pred))print("Classification Report:")print(classification_report(y_test, y_pred))Accuracy Score: 0.8333333333333334Classification Report:precision recall f1-score support0 0.67 1.00 0.80 21 1.00 0.75 0.86 4accuracy 0.83 6macro avg 0.83 0.88 0.83 6weighted avg 0.89 0.83 0.84 6DiscussionStrengths: The model performs well in predicting the outcome for class 1 (any other team win) with perfect precision, and it identifies all the instances of Gujarat Lions wins (class 0) with perfect recall. Weaknesses: The model has a low precision (67%) for class 0, meaning that when it predicts Gujarat Lions' win, it is not as reliable. However, the recall for class 0 is perfect, indicating it correctly identifies all instances where Gujarat Lions win.Naive Bayesfromsklearn.naive_bayes importGaussianNBfromsklearn.metrics importclassification_report, confusion_matrix, accuracy_score# Initialize the Gaussian Naive Bayes modelnb_model =GaussianNB()# Train the modelnb_model.fit(X_train, y_train)# Make predictionsy_pred =nb_model.predict(X_test)print("\nAccuracy Score:", accuracy_score(y_test, y_pred))print("Classification Report:")print(classification_report(y_test, y_pred))

Accuracy Score: 0.8333333333333334Classification Report:precision recall f1-score support0 0.67 1.00 0.80 21 1.00 0.75 0.86 4accuracy 0.83 6macro avg 0.83 0.88 0.83 6weighted avg 0.89 0.83 0.84 6DiscussionNaive Bayes achieved good performance overall with 83% accuracy. The model is particularly strong for predicting class 0 (Gujarat Lions wins), as evidenced by its perfect recall (1.00) for this class, but its precision (0.67) is not as high due to some false positives. For class 1 (other teams win), the model is very precise (1.00), with a slight trade-off in recall (0.75). This means it correctly predicts the class when it does, but misses some instances. The weighted average F1-score (0.84) suggests that Naive Bayes is doing quite well at overall performance, especially in situations with class imbalance.KNNfromsklearn.neighbors importKNeighborsClassifier# Initialize the KNN classifier with k=5knn_model =KNeighborsClassifier(n_neighbors=1)# Train the modelknn_model.fit(X_train, y_train)# Make predictionsy_pred =knn_model.predict(X_test)print("\nAccuracy Score:", accuracy_score(y_test, y_pred))print("Classification Report:")print(classification_report(y_test, y_pred))Accuracy Score: 0.8333333333333334Classification Report:precision recall f1-score support0 0.67 1.00 0.80 2

1 1.00 0.75 0.86 4accuracy 0.83 6macro avg 0.83 0.88 0.83 6weighted avg 0.89 0.83 0.84 6fromsklearn.neighbors importKNeighborsClassifier# Initialize the KNN classifier with k=5knn_model =KNeighborsClassifier(n_neighbors=3)# Train the modelknn_model.fit(X_train, y_train)# Make predictionsy_pred =knn_model.predict(X_test)print("\nAccuracy Score:", accuracy_score(y_test, y_pred))print("Classification Report:")print(classification_report(y_test, y_pred))Accuracy Score: 0.6666666666666666Classification Report:precision recall f1-score support0 0.50 1.00 0.67 21 1.00 0.50 0.67 4accuracy 0.67 6macro avg 0.75 0.75 0.67 6weighted avg 0.83 0.67 0.67 6DiscussionWith k=1:The model performs very well on class 0 (Gujarat Lions win) with perfect recall, but its performance on class 1 (other teams win) is good (high precision and recall). k=1 seems to be a better choice for this dataset, as it yields higher accuracy and better recall for class 0.With k=3:The model starts to over-smooth predictions, leading to lower recall for class 1 (other teams win) and lower overall accuracy. While precision for class 1 remains perfect, the recall and F1-score drop due to the model missing some true instances. k=3 results in worse performance, likely due to an increase in bias as the model considers more neighbors, causing it to make overly generalized predictions.