Abstract
The sinking of the RMS Titanic caused the death of thousands of passengers and crew is one of the deadliest maritime disasters in history. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there were some elements of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. The objective is to apply different machine learning models to complete the analysis of what sorts of people were likely to survive. The result of applying machine learning algorithms are compared and analysed on the basis of accuracy.
Keywords- Titanic, Logistic Regression, Random
…show more content…
The iceberg collision ripped open Titanic’s hull in several places. Titanic carried thousands of people of all ages, genders and class that fateful night, and only a few hundred escaped in lifeboats and rest died in the icy water. The dead included a large number of men whose place was given to the many women and children on board. The dead primarily consisted of men in the ship’s second class.
Machine learning techniques are applied to predict which passengers survived the sinking of the Titanic. Features like ticket fare, age, sex, class will be used to make the predictions. Predictive analysis is a procedure that incorporates the use of computational methods to determine important and useful patterns in large data. Using the machine learning algorithms, survival is predicted on different combinations of features.
The objective is to perform exploratory data analytics to mine various information in the dataset available at kaggle and to know effect of each field on survival of passengers by applying analytics between every field of dataset with “Survival” field. The prediction the output for newer data sets by applying machine learning algorithm is done. The data analysis will be done on applied algorithms and accuracy will be checked. The different algorithms are compared on the basis of accuracy and the best performing model is suggested with respect to used dataset.
Data
…show more content…
Logistic regression is used to describe data and explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. It is used to solve binary classification problem, some of the real life examples are spam detection- predicting if an email is spam or not, health-Predicting if a given mass of tissue is benign or malignant, marketing- predicting if a given user will buy an insurance product or not.
Random Forest
Random forest algorithm is supervised classification algorithm. As the name suggest, this algorithm creates the forest with a number of trees. The higher the number of trees in the forest gives the higher accuracy results. Random forest algorithm can be used for both classification and regression problems. For instance, it will take random samples of 100 observation and 5 randomly chosen initial variables to build a model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function (mean) of each