Please see Github Repository
This repository presents my submission in the Titanic: Machine Learning from Disaster, Kaggle Competition.
In this competition, the goal is to perform a 2-label classification problem: predict which passengers survived the tragedy.
Kaggle offers two datasets. One training (the labels are known) and one testing (the labels are unknown). The goal is to submit a file with our predicted labels saying who survived or not.
We have access to a bunch of 9 features (numerical, text, categorical). The big challenge with this competition is the size of the data we have. The training set is composed of only 891 samples. The testing set is composed of 418 samples.
Therefore, the main issue is to design an algorithm which generalizes enough in order to avoid over-fitting. To do so, a bunch of features is generated. Then, an ensemble modeling method with voting is used in order to get the most generalized model.
This is a multi-label classification, with 2 labels:
Kaggle offers 2 datasets:
Goal: For each passenger, predict the label (0 or 1).
The evaluation metric is accuracy score.
The project is decomposed in 3 parts:
The framework of this notebook is:
For this competition, the current Kaggle Leaderboard accuracy I reached is 0.79904.
In the training dataframe, we observe that the 2 label are slightly balanced (61% labeled as 0). We also see we have access to 16 different features per passengers.
4 of the features have missing values:
The notebook details how the NaNs are treated.
For this type of feature, we can observe the average survival of passengers within each categories. The observed features are:
Conclusion:
For this type of feature, we can observe the distribution of the passengers given the survival. The observed features are:
Conclusion:
For this type of features, we don’t directly do analysis. A first transformation is needed. The observed features are:
Engineering:
In this part, I designed the features following the previous part. I ended with the following features:
###Simple Models & Selection
I chosed several classifiers and compared them using k-fold cross validation.
K-fold is important here (in the case of a small dataset) because it enables us to train/test the model 10 times and then we can reduce the chance of over-fitting or even luck.
The models tried for this bi-label classification problem will be:
The different algorithms perform differently. We need to look at the average, but also at the standard deviation of the accuracy of these algorithms. Indeed, we want our model to perform well on average, but we don’t want it to perform really poorly sometimes or really greatly on other times. I decided to use a hand-crafted criterion in order to take into account both average and deviation. The criterion is defined as the average minus the half of the deviation. So we take the algorithm with the best accuracy modulated by a not to big deviation.
Gradient Boosting produces the best outcome in terms of criterion. We can directly use it to produce the submission file.
Or we can decide to use different models and ensemble the results in order to produce a better outcome: this method is so-called ‘Ensemble modeling’.
To select the best ones, I use a criterion which is the mean of the accuracy minus the half of the standard deviation of the accuracy.
But first, let’s select the 5 best algorithms in order to perform hyper-parameters optimization:
###Hyper-Parameters Optimization
In this part, I tried to improve the accuracy of the selected classifiers by hyper-parameters optimization. To do so, I used a Scikit-Learn implemented tool in order to perform a grid search.
###Ensemble Modeling
The classifier are quiet correlated. This is a good sign! However we observe some differences between the 5 classifiers. This is good for you because we can leverage the use of a voting system to improve our prediction.
My submitted file is: titanic-predictions.csv.