Kaggle challenges us to learn data analysis and machine learning from the data the Titanic shipwreck, and try predict survival and get familiar with ML basics.
So, this material is intended to cover most of the techniques of data analysis and ML in Python, than to properly compete in Kaggle. That is why it following the natural flow of ML and contains many texts and links regarding the techniques, made your conference and references easy. as it can be extended over time.
In this way the material can be used for consultation and apply the methods to other similar classification cases, but for its application in the competition, or even to a real case, it will be necessary to make some choices and changes.
Competition Description:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, Kaggle ask you to complete the analysis of what sorts of people were likely to survive. In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Table of Contents
- 1 Preparing environment and uploading data
- 2 Exploratory Data Analysis (EDA) & Feature Engineering
- 3 Select Features
- 4 Additional Feature Engineering: Feature transformation
- 5 Modeling – Hyper Parametrization
- 5.1 Simplify Get Results
- 5.2 Logistic Regression
- 5.3 SGDClassifier
- 5.4 Linear Support Vector Classification
- 5.5 Gaussian Process Classifier (GPC)
- 5.6 Random Forest Classifier
- 5.7 AdaBoost classifier
- 5.8 K-Nearest Neighbors
- 5.9 Multi-layer Perceptron classifier
- 5.10 Gradient Boosting for Classification
- 5.11 XGBoost (eXtreme Gradient Boosting)
- 6 Finalize The Model: Stacking the Models
- 7 Conclusion
Preparing environment and uploading data¶
You can download the this python notebook and data from my github repository. The data can download on Kaggle here.
Import Packages¶
import os
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
import numpy as np
import pandas as pd
import pylab
import seaborn as sns
sns.set(style="ticks", color_codes=True, font_scale=1.5)
from matplotlib import pyplot as plt
from matplotlib.ticker import FormatStrFormatter
from matplotlib.colors import ListedColormap
%matplotlib inline
import mpl_toolkits
from mpl_toolkits.mplot3d import Axes3D
import model_evaluation_utils as meu
from scipy.stats import skew, norm, probplot, boxcox
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectFromModel
from boruta import BorutaPy
from rfpimp import *
from sklearn.decomposition import PCA, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, cross_val_predict, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
#from sklearn.base import BaseEstimator, TransformerMixin, clone, ClassifierMixin
from sklearn.ensemble import VotingClassifier
from itertools import combinations
Load Datasets¶
I start with load the datasets with pandas, and concatenate them.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
Test_ID = test.PassengerId
test.insert(loc=1, column='Survived', value=-1)
data = pd.concat([train, test], ignore_index=True)
Exploratory Data Analysis (EDA) & Feature Engineering¶
Take a First Look of our Data:¶
I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str
function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.
If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the independent variable to the pred
parameter.
Also, if your return is stored in a variable you can evaluate it in more detail, specific of a field, or sort them from different perspectives
def rstr(df, pred=None):
obs = df.shape[0]
types = df.dtypes
counts = df.apply(lambda x: x.count())
uniques = df.apply(lambda x: [x.unique()])
nulls = df.apply(lambda x: x.isnull().sum())
distincts = df.apply(lambda x: x.unique().shape[0])
missing_ration = (df.isnull().sum()/ obs) * 100
skewness = df.skew()
kurtosis = df.kurt()
print('Data shape:', df.shape)
if pred is None:
cols = ['types', 'counts', 'distincts', 'nulls', 'missing ration', 'uniques', 'skewness', 'kurtosis']
str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis], axis = 1)
else:
corr = df.corr()[pred]
str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
corr_col = 'corr ' + pred
cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
str.columns = cols
dtypes = str.types.value_counts()
print('___________________________\nData types:\n',str.types.value_counts())
print('___________________________')
return str
details = rstr(data.loc[: ,'Survived' : 'Embarked'], 'Survived')
details.sort_values(by='corr Survived', ascending=False)
Data Dictionary
- Survived: 0 = No, 1 = Yes. I use -1 to can separate test data from training data.
- Fare: The passenger fare
- Parch: # of parents / children aboard the Titanic
- SibSp: # of siblings / spouses aboard the Titanic
- Age: Age in years
- Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Name: Name of the passenger
- Sex: Sex of the passenger male and female
- Ticket: Ticket number
- Cabin: Cabin number
- Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
The points of attention we have here are:
- Fare: Kaggle affirm that is the passenger fare, but with some data inspection in group of Tickets we discover that is the total amount fare paid for a ticket, and the existence of tickets for a group of passengers.
- Parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children traveled only with a nanny, therefore parch=0 for them.
- SibSP: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- Age: Have 20% of nulls, so we have to find a more efficient way of filling them out with just a single value, like the median for example.
- Name: is a categorical data with a high distinct values as expect. The first reaction is drop this column, but we can use it to training different data engineering techniques, to see if we can get some valuable data. Besides that, notice that has two pairs of passengers with the same name?
- Ticket Other categorical data, but in this case it have only 71% of distinct values and don’t have nulls. So, it is possible that some passenger voyage in groups and use the same ticket. Beyond that, we can check to if we can extract other interesting thought form it.
- Cabin: the high number of distinct values (187) and nulls (77.5%).
This is a categorical data of which I use to train different techniques to extract some information of value and null input, but with given the high rate of null the recommended would be to simplify the filling or even to exclude this attribute.
First see of some stats of Numeric Data¶
So, for the main statistics of our numeric data describe the function (like the summary of R)
print('Data is not balanced! Has {:2.2%} survives'.format(train.Survived.describe()[1]))
display(data.loc[: ,'Pclass' : 'Embarked'].describe().transpose())
print('Survived: [1] Survived; [0] Died; [-1] Test Data set:\n',data.Survived.value_counts())
def charts(feature, df):
print('\n ____________________________ Plots of', feature, 'per Survived and Dead: ____________________________')
# Pie of all Data
fig = plt.figure(figsize=(20,5))
f1 = fig.add_subplot(131)
cnt = df[feature].value_counts()
g = plt.pie(cnt, labels=cnt.index, autopct='%1.1f%%', shadow=True, startangle=90)
# Count Plot By Survived and Dead
f = fig.add_subplot(132)
g = sns.countplot(x=feature, hue='Survived', hue_order=[1,0], data=df, ax=f)
# Percent stacked Plot
survived = df[df['Survived']==1][feature].value_counts()
dead = df[df['Survived']==0][feature].value_counts()
df2 = pd.DataFrame([survived,dead])
df2.index = ['Survived','Dead']
df2 = df2.T
df2 = df2.fillna(0)
df2['Total'] = df2.Survived + df2.Dead
df2.Survived = df2.Survived/df2.Total
df2.Dead = df2.Dead/df2.Total
df2.drop(['Total'], axis=1, inplace=True)
f = fig.add_subplot(133)
df2.plot(kind='bar', stacked=True, ax=f)
del df2, g, f, cnt, dead, fig
Ticket¶
Since Ticket is a transaction and categorical data, the first insight is drop this feature, but we may note that it has some hidden value information. At first look, safe few cases, we could affirms that:
- families and group of persons that traveled together bought the same ticket.
- People with alphanumerics Tickets has some special treatment (crew family, employees, VIP, free tickets, etc.)
So, we start by a new feature creation to quantify the number of passengers by ticket, and join this
quantity to each passenger with the same ticket.
same_ticket = data.Ticket.value_counts()
data['qtd_same_ticket'] = data.Ticket.apply(lambda x: same_ticket[x])
del same_ticket
charts('qtd_same_ticket', data[data.Survived>=0])
As we can see above:
- the majority (54%) bought only one ticket per passenger, and have lower survival rate than passengers that bought tickets for 2, 3, 4, 5 and 8 people.
- the survival rate is growing between 1 and 4, dropped a lot at 5. From the bar chart we can see that after 5 the number of samples is too low (84 out of 891, 9.4%, 1/4 of this is 5), and this data is skewed with a long tail to right. We can reduce this tail by binning all data after 4 in the same ordinal, its better to prevent overfitting, but we lose some others interesting case, see the next bullet. As alternative we can apply a box cox at this measure.
- The case of 11 people with same ticket probably is a huge family that all samples on the training data died. Let’s check this below.
data[(data.qtd_same_ticket==11)]
We confirm our hypothesis, and we notice that Fare is not the price of each passenger, but the price of each ticket, its total amount. Since our data is per passenger, this information has some distortion, because only one passenger that bought a ticket alone of 69.55 pounds is different from 11 passenger that bought a special ticket, with discount for group, by 6.32 pounds per passenger. It suggest to create a new feature that represents the real fare by passenger.
Back to the quantity of persons with same ticket, if we keep this and the model can capture this pattern, you’ll probably predict that the respective test samples also died! However, even if true, can be a sign of overfitting, because we only have 1.2% of these cases in the training samples.
In order to increase representativeness and lose the minimum of information, since we have only 44 (4.9%) training samples that bought tickets for 4 people and 101 (11.3%) of 3, we binning the quantity of 3 and 4 together as 3 (16,3%, over than 5 as 5 (84 samples). Let’s see the results below.
data['qtd_same_ticket_bin'] = data.qtd_same_ticket.apply(lambda x: 3 if (x>2 and x<5) else (5 if x>4 else x))
charts('qtd_same_ticket_bin', data[data.Survived>=0])
Other option, is create a binary feature that indicates if the passenger use a same ticket or not (not share his ticket)
print('Percent. survived from unique ticket: {:3.2%}'.\
format(data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].sum()/
data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].count()))
print('Percent. survived from same tickets: {:3.2%}'.\
format(data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].sum()/
data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].count()))
data['same_tckt'] = data.qtd_same_ticket.apply(lambda x: 1 if (x> 1) else 0)
charts('same_tckt', data[data.Survived>=0])
In this case we lose information that the chances of survival increase from 1 to 4, and fall from 5. In addition, cases 1 and 0 of the two measures are exactly the same. Then we will not use this option, and go work on Fare.
Finally, we have one more information to extract directly from Ticket, and check the possible special treatment!
data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x))).value_counts().head(7)
data['distinction_in_tikect'] =\
(data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x)).strip('[]')))
data.distinction_in_tikect = data.distinction_in_tikect.\
apply(lambda y: 'Without' if y=='' else y if (y in ['PC', 'CA', 'A', 'SOTONOQ', 'STONO', 'WC', 'SCPARIS']) else 'Others')
charts('distinction_in_tikect', data[(data.Survived>=0)])
By the results, passengers with PC distinction in their tickets had best survival rate. Without, Others and CA is very close and can be grouped in one category and the we can do the same between STONO and SCAPARIS, and between A, SOTONOQ and WC.
data.distinction_in_tikect = data.distinction_in_tikect.\
apply(lambda y: 'Others' if (y in ['Without', 'Others', 'CA']) else\
'Low' if (y in ['A', 'SOTONOQ', 'WC']) else\
'High' if (y in ['STONO', 'SCPARIS']) else y)
charts('distinction_in_tikect', data[(data.Survived>=0)])
Fare¶
First, we treat the unique null fare case, then we take a look of the distribution of Fare (remember that is the total amount Fare of the Ticket).
# Fill null with median of most likely type passenger
data.loc[data.Fare.isnull(), 'Fare'] = data.Fare[(data.Pclass==3) & (data.qtd_same_ticket==1) & (data.Age>60)].median()
fig = plt.figure(figsize=(20,5))
f = fig.add_subplot(121)
g = sns.distplot(data[(data.Survived>=0)].Fare)
f = fig.add_subplot(122)
g = sns.boxplot(y='Fare', x='Survived', data=data[data.Survived>=0], notch = True)
Let’s take a look at how the fare per passenger is and how much it differs from the total
data['passenger_fare'] = data.Fare / data.qtd_same_ticket
fig = plt.figure(figsize=(20,6))
a = fig.add_subplot(141)
g = sns.distplot(data[(data.Survived>=0)].passenger_fare)
a = fig.add_subplot(142)
g = sns.boxplot(y='passenger_fare', x='Survived', data=data[data.Survived>=0], notch = True)
a = fig.add_subplot(143)
g = pd.qcut(data.Fare[(data.Survived==0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Died')
a = fig.add_subplot(144)
g = pd.qcut(data.Fare[(data.Survived>0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Survived')
plt.tight_layout(); plt.show()
From the comparison, we can see that:
- the distributions are not exactly the same, with two spouts slightly apart on passenger fare.
- Class and how much paid per passenger make differences!
Although the number of survivors among the quartiles is approximately the same as expected, when we look at passenger fares, it is more apparent that the mortality rate is higher in the lower Fares, since the top of Q4 died is at the same height as the median plus a confidence interval of the fare paid by survivors.
- the number of outliers is lower in the fare per passenger, especially among survivors.
We can not rule out these outliers if there are cases of the same type in the test data set. In addition, these differences in values may be due to probably first class with additional fees for certain exclusives and cargo.
Below, you can see that the largest outlier all survival in the train data set, and has one case (1235 Passenger Id,
the matriarch of one son and two companions) to predict. Among all outlier cases of survivors, we see that all cases are first class, and different from the largest outlier, 27% actually died, and we have 18 cases to predict.
print('Passengers with higets passenger fare:')
display(data[data.passenger_fare>120])
print('\nSurivived of passenger fare more than 50:\n',
pd.pivot_table(data.loc[data.passenger_fare>50, ['Pclass', 'Survived']], aggfunc=np.count_nonzero,
columns=['Survived'] , index=['Pclass']))
Note that if we leave this way, if the model succeeds in capturing this pattern of largest outlier we are again thinking of a model that is at risk of overfitting (0.03% of cases).
Pclass¶
Notwithstanding the fact that class 3 presents greater magnitude, as we see with Fare by passenger, we notice that survival rate is greater with greater fare by passenger. Its make to think that has some socioeconomic discrimination. It is confirmed when we saw the data distribution over the classes, and see the percent bar has a clearer aggressive decreasing survival rate through the first to the third classes.
charts('Pclass', data[(data.Survived>=0)])
SibSp¶
charts('SibSp', data[(data.Survived>=0)])
Since more than 2 siblings has too few cases and lowest survival rate, we can aggregate all this case into unique bin in order to increase representativeness and lose the minimum of information.
data['SibSp_bin'] = data.SibSp.apply(lambda x: 6 if x > 2 else x)
charts('SibSp_bin', data[(data.Survived>=0)])
Parch¶
charts('Parch', data[data.Survived>=0])
As we did with siblings, we will aggregate the Parch cases with more than 3, even with the highest survival rate with 3 Parch.
data['Parch_bin'] = data.Parch.apply(lambda x: x if x< 3 else 4)
charts('Parch_bin', data[(data.Survived>=0)])
Family and non-relatives¶
If you investigate the data, you will notice that total family members It can be obtained by the sum of Parch and SibSp plus 1 (1 for the person of respective record). So, let’s create the Family and see what we get.
data['family'] = data.SibSp + data.Parch + 1
charts('family', data[data.Survived>=0])
As we can see, family groups of up to 4 people were more likely to survive than people without relatives on board.
However from 5 family members we see a drastic fall and the leveling of the 7-member cases with the unfamiliar ones.
You may be led to think that this distortion clearly has some relation to the social condition. Better see the right data!
charts('Pclass', data[(data.family>4) & (data.Survived>=0)])
Yes, we have more cases in the third class, but on the other hand, what we see is that the numbers of cases with more than 4 relatives were rarer. n a more careful look, you will see that from 6 family members we only have third class (25 in training, 10 in test). So we confirmed that a large number of family members made a difference, yes, if you were from upper classes
You must have a feeling of déjà vu, and yes, this metric is very similar to the one we have already created, the amount of passengers with the same ticket.
So what’s the difference. At first you have only the amount of people aboard with family kinship plus herself, in the previous you have people reportedly grouped, family members or not. So, in cases where relatives bought tickets separately we see the family considering them, but the ticket separating them. On the other hand, as a family we do not consider travelers with their non-family companions, employees or friends, while in the other yes.
With this, we can now obtain the number of fellows or companions per passenger. This is the number of non-relatives who traveled with the passenger
data['non_relatives'] = data.qtd_same_ticket - data.family
charts('non_relatives', data[data.Survived>=0])
Here you see negative numbers because there are groups of travelers with the number of unrelated members larger than those with kinship.
Sex¶
As everybody knows, in that case women has more significant survival rate than men.
charts('Sex', data[(data.Survived>=0)])
Embarked¶
First, we check the 2 embarked null cases to find the most likely pattern to considerate to fill with the respective mode.
In sequence, we take a look at the Embarked data. As we can see, the passengers that embarked from Cherbourg had best survival rates and most of the passengers embarked from Southampton and had the worst survival rate.
display(data[data.Embarked.isnull()])
data.loc[data.Embarked=='NA', 'Embarked'] = data[(data.Cabin.str.match('B2')>0) & (data.Pclass==1)].Embarked.mode()[0]
charts('Embarked', data[(data.Survived>=0)])
Name¶
Name feature has too much variance and is not significant, but has some value information to extracts and checks, like:
- Personal Titles
- Existence of nicknames
- Existence of references to another person
- Family names
def Personal_Titles(df):
df['Personal_Titles'] = df.Name.str.findall('Mrs\.|Mr\.|Miss\.|Maste[r]|Dr\.|Lady\.|Countess\.|'
+'Sir\.|Rev\.|Don\.|Major\.|Col\.|Jonkheer\.|'
+ 'Capt\.|Ms\.|Mme\.|Mlle\.').apply(lambda x: ''.join(map(str, x)).strip('[]'))
df.Personal_Titles[df.Personal_Titles=='Mrs.'] = 'Mrs'
df.Personal_Titles[df.Personal_Titles=='Mr.'] = 'Mr'
df.Personal_Titles[df.Personal_Titles=='Miss.'] = 'Miss'
df.Personal_Titles[df.Personal_Titles==''] = df[df.Personal_Titles==''].Sex.apply(lambda x: 'Mr' if (x=='male') else 'Mrs')
df.Personal_Titles[df.Personal_Titles=='Mme.'] = 'Mrs'
df.Personal_Titles[df.Personal_Titles=='Ms.'] = 'Mrs'
df.Personal_Titles[df.Personal_Titles=='Lady.'] = 'Royalty'
df.Personal_Titles[df.Personal_Titles=='Mlle.'] = 'Miss'
df.Personal_Titles[(df.Personal_Titles=='Miss.') & (df.Age>-1) & (df.Age<15)] = 'Kid'
df.Personal_Titles[df.Personal_Titles=='Master'] = 'Kid'
df.Personal_Titles[df.Personal_Titles=='Don.'] = 'Royalty'
df.Personal_Titles[df.Personal_Titles=='Jonkheer.'] = 'Royalty'
df.Personal_Titles[df.Personal_Titles=='Capt.'] = 'Technical'
df.Personal_Titles[df.Personal_Titles=='Rev.'] = 'Technical'
df.Personal_Titles[df.Personal_Titles=='Sir.'] = 'Royalty'
df.Personal_Titles[df.Personal_Titles=='Countess.'] = 'Royalty'
df.Personal_Titles[df.Personal_Titles=='Major.'] = 'Technical'
df.Personal_Titles[df.Personal_Titles=='Col.'] = 'Technical'
df.Personal_Titles[df.Personal_Titles=='Dr.'] = 'Technical'
Personal_Titles(data)
display(pd.pivot_table(data[['Personal_Titles', 'Survived']], aggfunc=np.count_nonzero,
columns=['Survived'] , index=['Personal_Titles']).T)
charts('Personal_Titles', data[(data.Survived>=0)])
As you can see above, I opted to add some titles, but at first keep 2 small sets (Technical and Royalty), Because there are interesting survival rate variations.
Next, we identify the names with mentions to other people or with nicknames and create a boolean feature.
data['distinction_in_name'] =\
((data.Name.str.findall('\(').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='(')
| (data.Name.str.findall(r'"[A-z"]*"').apply(lambda x: ''.join(map(str, x)).strip('""'))!=''))
data.distinction_in_name = data.distinction_in_name.apply(lambda x: 1 if x else 0)
charts('distinction_in_name', data[(data.Survived>=0)])
It is interesting to note that those who have some type of reference or distinction in their names had a higher survival rate.
Next, we find 872 surnames in this dataset. Even adding loners in a single category, we have 229 with more than one member. It’s a huge categorical data to work, and it is to much sparse. The most of then has too few samples to really has significances to almost of algorithms, without risk to occurs overfitting. In addition, there are 18 surnames cases with more than one member exclusively in the test data set.
So, we create this feature with aggregation of unique member into one category and use this at models that could work on it to check if we get better results. Alternatively, we can use dimensionality reduction methods.
print('Total of differents surnames aboard:',
((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).shape[0])
print('More then one persons aboard with smae surnames:',
((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).sum())
surnames = (data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()
data['surname'] = (data.Name.str.findall(r'[A-z]*\,').\
apply(lambda x: ''.join(map(str, x)).strip(','))).apply(lambda x: x if surnames.get_value(x)>1 else 'Alone')
test_surnames = set(data.surname[data.Survived>=0].unique().tolist())
print('Surnames with more than one member aboard that happens only in the test data set:',
240-len(test_surnames))
train_surnames = set(data.surname[data.Survived<0].unique().tolist())
print('Surnames with more than one member aboard that happens only in the train data set:',
240-len(train_surnames))
both_surnames = test_surnames.intersection(train_surnames)
data.surname = data.surname.apply(lambda x : x if test_surnames.issuperset(set([x])) else 'Exclude')
del surnames, both_surnames, test_surnames, train_surnames
CabinByTicket = data.loc[~data.Cabin.isnull(), ['Ticket', 'Cabin']].groupby(by='Ticket').agg(min)
before = data.Cabin.isnull().sum()
data.loc[data.Cabin.isnull(), 'Cabin'] = data.loc[data.Cabin.isnull(), 'Ticket'].\
apply(lambda x: CabinByTicket[CabinByTicket.index==x].min())
print('Cabin nulls reduced:', (before - data.Cabin.isnull().sum()))
del CabinByTicket, before
data.Cabin[data.Cabin.isnull()] = 'N999'
data['Cabin_Letter'] = data.Cabin.str.findall('[^a-z]\d\d*')
data['Cabin_Number'] = data.apply(lambda x: 0 if len(str(x.Cabin))== 1 else np.int(np.int(x.Cabin_Letter[0][1:])/10), axis=1)
data.Cabin_Letter = data.apply(lambda x: x.Cabin if len(str(x.Cabin))== 1 else x.Cabin_Letter[0][0], axis=1)
display(data[['Fare', 'Cabin_Letter']].groupby(['Cabin_Letter']).agg([np.median, np.mean, np.count_nonzero, np.max, np.min]))
Doesn’t exist Cabin T in test dataset. This passenger is from first class and his passenger fare is the same from others 5 first class passengers. So, changed to ‘C’ to made same distribution between the six.
display(data[data.Cabin=='T'])
display(data.Cabin_Letter[data.passenger_fare==35.5].value_counts())
data.Cabin_Letter[data.Cabin_Letter=='T'] = 'C'
Fill Cabins letters NAs of third class with most common patterns of the same passenger fare range with one or lessen possible cases.
data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()
data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()
data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()
data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
Fill Cabins letters NAs of second class with most common patterns of the same passenger fare range with one or lessen possible cases.
data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
Fill Cabins letters NAs of first class with most common patterns of the same passenger fare range with one or lessen possible cases.
data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
See below that we conquered a good results after filling nulls, but we need attention since they have too many nulls originally. In addition, the cabin may actually have made more difference in the deaths caused by the impact and not so much among those who drowned.
charts('Cabin_Letter', data[(data.Survived>=0)])
Rescue of family relationships¶
After some work, we notice that is difficult to understand SibSp and Patch isolated, and is difficult to extract directly families relationships from this data without a closer look.
So, in that configuration we not have clearly families relationships, and this information is primary to use for apply ages to ages with null with better distribution and accuracy.
Let’s start to rescue:
The first treatment, I discovered when check the results and noticed that I didn’t apply any relationship to a one case. Look the details below, we can see that is a case of a family with more than one ticket and the son has no age. So, I just manually applied this one case as a son, since the others member the father and the mother, and the son has the pattern 0 in SibSp and Parch 2
display(data[data.Name.str.findall('Bourke').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='Bourke'])
family_w_age = data.Ticket[(data.Parch>0) & (data.SibSp>0) & (data.Age==-1)].unique().tolist()
data[data.Ticket.isin(family_w_age)].sort_values('Ticket')
data['sons'] = data.apply(lambda x : \
1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp>0)) else 0, axis=1)
data.sons += data.apply(lambda x : \
1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp>1)) else 0, axis=1)
data.sons += data.apply(lambda x : \
1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles not in (['Mr', 'Mrs']))) else 0, axis=1)
data.sons += data.apply(lambda x: 1 if ((x.Parch>0) & (x.Age>=0) & (x.Age<20)) else 0, axis=1)
data.sons.loc[data.PassengerId==594] = 1 # Sun with diferente pattern (family with two tickets)
data.sons.loc[data.PassengerId==1252] = 1 # Case of 'CA. 2343' and last rule
data.sons.loc[data.PassengerId==1084] = 1 # Case of 'A/5. 851' and last rule
data.sons.loc[data.PassengerId==1231] = 1 # Case of 'A/5. 851' and last rule
charts('sons', data[(data.Survived>=0)])
We observe that has only 12.1% of sons, and their had better survival rate than others.
Next, we rescue the parents, and check cases where we have both (mother and father), and cases where we have only one aboard.
data['parents'] = data.apply(lambda x : \
1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp==0)) else 0, axis=1)
data.parents += data.apply(lambda x : \
1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp==1)) else 0, axis=1)
data.parents += data.apply(lambda x : 1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles in (['Mr', 'Mrs']))) \
else 0, axis=1)
# Identify parents and care nulls ages
data.parents += data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
charts('parents', data[(data.Survived>=0)])
data['parent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
charts('parent_alone', data[(data.Survived>=0)])
We can notice that the both cases are to similar and it is not significant to has this two information separately.
Before I put them together, as I had learned in assembling the sons, I made a visual inspection and discovered some cases of sons and parents that required different rules for assigning them. As I did visually and this is not a rule for a pipeline, I proceeded with the settings manually.
t_p_alone = data.Ticket[data.parent_alone==1].tolist()
data[data.Ticket.isin(t_p_alone)].sort_values('Ticket')[96:]
data.parent_alone.loc[data.PassengerId==141] = 1
data.parent_alone.loc[data.PassengerId==541] = 0
data.sons.loc[data.PassengerId==541] = 1
data.parent_alone.loc[data.PassengerId==1078] = 0
data.sons.loc[data.PassengerId==1078] = 1
data.parent_alone.loc[data.PassengerId==98] = 0
data.sons.loc[data.PassengerId==98] = 1
data.parent_alone.loc[data.PassengerId==680] = 0
data.sons.loc[data.PassengerId==680] = 1
data.parent_alone.loc[data.PassengerId==915] = 0
data.sons.loc[data.PassengerId==915] = 1
data.parent_alone.loc[data.PassengerId==333] = 0
data.sons.loc[data.PassengerId==333] = 1
data.parent_alone.loc[data.PassengerId==119] = 0
data.sons[data.PassengerId==119] = 1
data.parent_alone.loc[data.PassengerId==319] = 0
data.sons.loc[data.PassengerId==319] = 1
data.parent_alone.loc[data.PassengerId==103] = 0
data.sons.loc[data.PassengerId==103] = 1
data.parents.loc[data.PassengerId==154] = 0
data.sons.loc[data.PassengerId==1084] = 1
data.parents.loc[data.PassengerId==581] = 0
data.sons.loc[data.PassengerId==581] = 1
data.parent_alone.loc[data.PassengerId==881] = 0
data.sons.loc[data.PassengerId==881] = 1
data.parent_alone.loc[data.PassengerId==1294] = 0
data.sons.loc[data.PassengerId==1294] = 1
data.parent_alone.loc[data.PassengerId==378] = 0
data.sons.loc[data.PassengerId==378] = 1
data.parent_alone.loc[data.PassengerId==167] = 1
data.parent_alone.loc[data.PassengerId==357] = 0
data.sons.loc[data.PassengerId==357] = 1
data.parent_alone.loc[data.PassengerId==918] = 0
data.sons.loc[data.PassengerId==918] = 1
data.parent_alone.loc[data.PassengerId==1042] = 0
data.sons.loc[data.PassengerId==1042] = 1
data.parent_alone.loc[data.PassengerId==540] = 0
data.sons.loc[data.PassengerId==540] = 1
data.parents += data.parent_alone
charts('parents', data[(data.Survived>=0)])
Next, we rescue the grandparents and grandparents alone. We found the same situations with less cases and decided put all parents and grandparents in one feature and leave to age distinguish them.
data['grandparents'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age>45) ) else 0, axis=1)
charts('grandparents', data[(data.Survived>=0)])
data['grandparent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>45) ) else 0, axis=1)
charts('grandparent_alone', data[(data.Survived>=0)])
data.parents += data.grandparent_alone + data.grandparents
charts('parents', data[(data.Survived>=0)])
Next, we identify the relatives aboard:
data['relatives'] = data.apply(lambda x: 1 if ((x.SibSp>0) & (x.Parch==0)) else 0, axis=1)
charts('relatives', data[(data.Survived>=0)])
And then, the companions, persons who traveled with a family but do not have family relationship with them.
data['companions'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==1)) else 0, axis=1)
charts('companions', data[(data.Survived>=0)])
Finally, we rescue the passengers that traveled alone.
data['alone'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==0)) else 0, axis=1)
charts('alone', data[(data.Survived>=0)])
As we can see, people with a family relationship, even if only as companions, had better survival rates and very close, than people who traveled alone.
Now we can work on issues of nulls ages and then on own information of age.
Age¶
We start with the numbers of nulls case by survived to remember that is too high.
Then, we plot the distributions of Ages, to check how is fit into the normal and see the distortions when apply a unique value (zero) to the nulls cases.
Next, we made the scatter plot of Ages and siblings, and see hat age decreases with the increase in the number of siblings, but with a great range
fig = plt.figure(figsize=(20, 10))
fig1 = fig.add_subplot(221)
g = sns.distplot(data.Age.fillna(0), fit=norm, label='Nulls as Zero')
g = sns.distplot(data[~data.Age.isnull()].Age, fit=norm, label='Withou Nulls')
plt.legend(loc='upper right')
print('Survived without Age:')
display(data[data.Age.isnull()].Survived.value_counts())
fig2 = fig.add_subplot(222)
g = sns.scatterplot(data = data[(~data.Age.isnull())], y='Age', x='SibSp', hue='Survived')
From the tables below, we can see that our enforce to get Personal Titles and rescue family relationships produce better medians to apply on nulls ages.
print('Mean and median ages by siblings:')
data.loc[data.Age.isnull(), 'Age'] = -1
display(data.loc[(data.Age>=0), ['SibSp', 'Age']].groupby('SibSp').agg([np.mean, np.median]).T)
print('\nMedian ages by Personal_Titles:')
Ages = { 'Age' : {'median'}}
display(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
groupby('Personal_Titles').agg(Ages).T)
print('\nMedian ages by Personal Titles and Family Relationships:')
display(pd.pivot_table(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents',
'sons', 'relatives', 'companions','alone']],
aggfunc=np.median,
index=['parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone'] ,
columns=['Personal_Titles']))
print('\nNulls ages by Personal Titles and Family Relationships:')
display(data[data.Age<0][['Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
groupby('Personal_Titles').agg([sum]))
So, we apply to the nulls ages the respectively median of same personal title and same family relationship, but before, we create a binary feature to maintain the information of the presence of nulls.
data['Without_Age'] = data.Age.apply(lambda x: 0 if x>0 else 1)
data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Miss')] = \
data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Miss')].median()
data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mr')] = \
data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mr')].median()
data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mrs')] = \
data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mrs')].median()
data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Kid')] = \
data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Kid')].median()
data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Technical')] = \
data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Technical')].median()
data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Miss')] = \
data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Miss')].median()
data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mr')] = \
data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mr')].median()
data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mrs')] = \
data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mrs')].median()
data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mr')] = \
data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mr')].median()
data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mrs')] = \
data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mrs')].median()
data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()
data.Age.loc[(data.Age.isnull()) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()
data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Miss')] = \
data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Miss')].median()
data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mr')] = \
data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()
data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mrs')] = \
data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mrs')].median()
data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Miss')] = \
data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Miss')].median()
data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mr')] = \
data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()
data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')] = \
data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')].median()
Finally, we check how age distribution lines after fill the nulls.
print('Age correlation with survived:',data.corr()['Survived'].Age)
g = sns.distplot(data.Age, fit=norm, label='With nulls filled')
plt.legend(loc='upper right')
plt.show()
To have better understanding of age, its proportion and its relation to survival ration, we binning it as follow
def binningAge(df):
# Binning Age based on custom ranges
bin_ranges = [0, 1.7, 8, 15, 18, 25, 55, 65, 100]
bin_names = [0, 1, 2, 3, 4, 5, 6, 7]
df['Age_bin_custom_range'] = pd.cut(np.array(df.Age), bins=bin_ranges)
df['Age_bin_custom_label'] = pd.cut(np.array(df.Age), bins=bin_ranges, labels=bin_names)
return df
data = binningAge(data)
display(data[['Age', 'Age_bin_custom_range', 'Age_bin_custom_label']].sample(5))
display(pd.pivot_table(data[['Age_bin_custom_range', 'Survived']], aggfunc=np.count_nonzero,
index=['Survived'] , columns=['Age_bin_custom_range']))
charts('Age_bin_custom_label', data[(data.Survived>=0)])
One hot encode and drop provisory and useless features¶
One hot encode categorical and non ordinal data and drop useless features.
data['genre'] = data.Sex.apply(lambda x: 1 if x=='male' else 0)
data.drop(['Name', 'Cabin', 'Ticket', 'Sex', 'same_tckt', 'qtd_same_ticket', 'parent_alone', 'grandparents',
'grandparent_alone', 'Age_bin_custom_range'], axis=1, inplace=True) # , 'Age', 'Parch', 'SibSp',
data = pd.get_dummies(data, columns = ['Cabin_Letter', 'Personal_Titles', 'Embarked', 'distinction_in_tikect'])
data = pd.get_dummies(data, columns = ['surname']) # 'Age_bin_custom_label'
data.drop(['surname_Exclude'], axis=1, inplace=True)
Scipy‘s pearson method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.
import numpy as np
from scipy.stats import pearson
pearson(data.loc[:, ‘Pclass’], data.Survived)
Select Features¶
All of the features we find in the dataset might not be useful in building a machine learning model to make the necessary prediction. Using some of the features might even make the predictions worse.
Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits.
- It reduces the variance of the model, and therefore overfitting.
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- Finally, it reduces the computational cost (and time) of training a model.
So, an alternative way to reduce the complexity of the model and avoid overfitting is dimensionality reduction via feature selection, which is especially useful for unregularized models. There are two main categories of dimensionality reduction techniques: feature selection and feature extraction. Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace.
Exist various methodologies and techniques that you can use to subset your feature space and help your models perform better and efficiently. So, let’s get started.
First check for any correlations between features¶
Correlation is a statistical term which in common usage refers to how close two features are to having a linear relationship with each other. The Pearson’s correlation which measures linear correlation between two features, the resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one feature increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two features.
Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.
There are five assumptions that are made with respect to Pearson’s correlation:
- The feature must be either interval or ratio measurements.
- The variables must be approximately normally distributed.
- There is a linear relationship between the two variables.
- Outliers are either kept to a minimum or are removed entirely
- There is homoscedasticity of the data. Homoscedasticity basically means that the variances along the line of best fit remain similar as you move along the line.
One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables. For example, correlation between x and x2 is zero, when x is centered on 0.
Furthermore, relying only on the correlation value on interpreting the relationship of two variables can be highly misleading, so it is always worth plotting the data as we did on the EDA phase.
The following guidelines interpreting Pearson’s correlation coefficient (r):
Strength of Association | r Positive | r Negative |
---|---|---|
Small | .1 to .3 | -0.1 to -0.3 |
Medium | .3 to .5 | -0.3 to -0.5 |
Large | .5 to 1.0 | -0.5 to -1.0 |
The correlation matrix is identical to a covariance matrix computed from standardized data. The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson’s r), which measure the linear dependence between pairs of features. Pearson’s correlation coefficient can simply be calculated as the covariance between two features x and y (numerator) divided by the product of their standard deviations (denominator):
The covariance between standardized features is in fact equal to their linear correlation coefficient.
Let’s check what are the highest correlations with survived, I will now create a correlation matrix to quantify the linear relationship between the features. To do this I use NumPy’s corrcoef and seaborn’s heatmap function to plot the correlation matrix array as a heat map.
corr = data.loc[:, 'Survived':].corr()
top_corr_cols = corr[abs(corr.Survived)>=0.06].Survived.sort_values(ascending=False).keys()
top_corr = corr.loc[top_corr_cols, top_corr_cols]
dropSelf = np.zeros_like(top_corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize=(15, 15))
sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)
sns.set(font_scale=0.8)
plt.show()
Let’s see we have more surnames between 0.05 and 0.06 of correlation wit Survived.
display(corr[(abs(corr.Survived)>=0.05) & (abs(corr.Survived)<0.06)].Survived.sort_values(ascending=False).keys())
del corr, dropSelf, top_corr
Drop the features with highest correlations to other Features:¶
Colinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. And as you see above, it is easy to find highest collinearities (Personal_Titles_Mrs, Personal_Titles_Mr and Fare.
You should always be concerned about the collinearity, regardless of the model/method being linear or not, or the main task being prediction or classification.
Assume a number of linearly correlated covariates/features present in the data set and Random Forest as the method. Obviously, random selection per node may pick only (or mostly) collinear features which may/will result in a poor split, and this can happen repeatedly, thus negatively affecting the performance.
Now, the collinear features may be less informative of the outcome than the other (non-collinear) features and as such they should be considered for elimination from the feature set anyway. However, assume that the features are ranked high in the ‘feature importance’ list produced by RF. As such they would be kept in the data set unnecessarily increasing the dimensionality. So, in practice, I’d always, as an exploratory step (out of many related) check the pairwise association of the features, including linear correlation.
Identify and treat multicollinearity:¶
Multicollinearity is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model, leading to unreliable and unstable estimates of regression coefficients. To make matters worst multicollinearity can emerge even when isolated pairs of variables are not collinear.
To identify, we need start with the coefficient of determination, r2, is the square of the Pearson correlation coefficient r. The coefficient of determination, with respect to correlation, is the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model (the correlation is the model). It is sometimes expressed as a percentage (e.g., 36% instead of 0.36) when we discuss the proportion of variance explained by the correlation. However, you should not write r2 = 36%, or any other percentage. You should write it as a proportion (e.g., r2 = 0.36).
Already the Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression. It is may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors, and then obtaining the R2 from that regression. It is calculated by taking the the ratio of the variance of all a given model’s betas divide by the variance of a single beta if it were fit alone [1/(1-R2)]. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors. The VIF has a lower bound of 1 but no upper bound. Authorities differ on how high the VIF has to be to constitute a problem (e.g.: 2.50 (R2 equal to 0.6), sometimes 5 (R2 equal to .8), or greater than 10 (R2 equal to 0.9) and so on).
But there are several situations in which multicollinearity can be safely ignored:
- Interaction terms and higher-order terms (e.g., squared and cubed predictors) are correlated with main effect terms because they include the main effects terms. Ops! Sometimes we use polynomials to solve problems, indeed! But keep calm, in these cases, standardizing the predictors can removed the multicollinearity.
- Indicator, like dummy or one-hot-encode, that represent a categorical variable with three or more categories. If the proportion of cases in the reference category is small, the indicator will necessarily have high VIF’s, even if the categorical is not associated with other variables in the regression model. But, you need check if some dummy is collinear or has multicollinearity with other features outside of their dummies.
- *Control feature if the feature of interest do not have high VIF’s. Here’s the thing about multicollinearity: it’s only a problem for the features that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear feature are only used as control feature, and they are not collinear with your feature of interest, there’s no problem. The coefficients of the features of interest are not affected, and the performance of the control feature as controls is not impaired.
So, generally, we could run the same model twice, once with severe multicollinearity and once with moderate multicollinearity. This provides a great head-to-head comparison and it reveals the classic effects of multicollinearity. However, when standardizing your predictors doesn’t work, you can try other solutions such as:
- removing highly correlated predictors
- linearly combining predictors, such as adding them together
- running entirely different analyses, such as partial least squares regression or principal components analysis
When considering a solution, keep in mind that all remedies have potential drawbacks. If you can live with less precise coefficient estimates, or a model that has a high R-squared but few significant predictors, doing nothing can be the correct decision because it won’t impact the fit.
Given the potential for correlation among the predictors, we’ll have display the variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression analysis. Hence such variables need to be removed from the model. Deleting one variable at a time and then again checking the VIF for the model is the best way to do this.
So, I start the analysis removed the 3 features with he highest collinearities and the surnames different from my control surname_Alone and correlation with survived less then 0.05, and run VIF.
#Step 1: Remove the higest correlations and run a multiple regression
cols = [ 'family',
'non_relatives',
'surname_Alone',
'surname_Baclini',
'surname_Carter',
'surname_Richards',
'surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
'surname_Beane', 'surname_Hippach', 'surname_Bishop',
'surname_Mellinger', 'surname_Yarred',
'Pclass',
'Age',
'SibSp',
'Parch',
#'Fare',
'qtd_same_ticket_bin',
'passenger_fare',
#'SibSp_bin',
#'Parch_bin',
'distinction_in_name',
'Cabin_Number',
'sons',
'parents',
'relatives',
'companions',
'alone',
'Without_Age',
'Age_bin_custom_label',
'genre',
'Cabin_Letter_A',
'Cabin_Letter_B',
'Cabin_Letter_C',
'Cabin_Letter_D',
'Cabin_Letter_E',
'Cabin_Letter_F',
'Cabin_Letter_G',
'Personal_Titles_Kid',
'Personal_Titles_Miss',
#'Personal_Titles_Mr',
#'Personal_Titles_Mrs',
'Personal_Titles_Royalty',
'Personal_Titles_Technical',
'Embarked_C',
'Embarked_Q',
'Embarked_S',
'distinction_in_tikect_High',
'distinction_in_tikect_Low',
'distinction_in_tikect_Others',
'distinction_in_tikect_PC'
]
y_train = data.Survived[data.Survived>=0]
scale = StandardScaler(with_std=False)
df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
features = "+".join(cols)
df2 = pd.concat([y_train, df], axis=1)
# get y and X dataframes based on this regression:
y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')
#Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
#Step 3: Inspect VIF Factors
display(vif.sort_values('VIF Factor'))
From the results, I conclude that can safe maintain the dummies of Embarked, but need work in the remaining features where’s the VIF stated as inf. You can see that surname Alone has a VIF of 2.2. We’re going to treat it as our baseline and exclude it from our fit. This is done to prevent multicollinearity, or the dummy variable trap caused by including a dummy variable for every single category. let’s try remove the dummy alone, that is pretty similar, and check if it solves the other dummies from its category:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
cols.remove('alone')
y_train = data.Survived[data.Survived>=0]
scale = StandardScaler(with_std=False)
df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
features = "+".join(cols)
df2 = pd.concat([y_train, df], axis=1)
# get y and X dataframes based on this regression:
y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')
#Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
#Step 3: Inspect VIF Factors
display(vif.sort_values('VIF Factor'))
To solve Cabin Letter, we can try remove only the lowest frequency ‘A’, and see if we can accept the VIF’s of others Cabins:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
cols.remove('Cabin_Letter_A')
y_train = data.Survived[data.Survived>=0]
scale = StandardScaler(with_std=False)
df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
features = "+".join(cols)
df2 = pd.concat([y_train, df], axis=1)
# get y and X dataframes based on this regression:
y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')
#Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
#Step 3: Inspect VIF Factors
display(vif.sort_values('VIF Factor'))
Now our focus is on distinct in name, since “High” has less observations, let’s try dropped it and drop the bins of Parch and SibSp.
cols.remove('distinction_in_tikect_High')
y_train = data.Survived[data.Survived>=0]
scale = StandardScaler(with_std=False)
df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
features = "+".join(cols)
df2 = pd.concat([y_train, df], axis=1)
# get y and X dataframes based on this regression:
y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')
#Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
#Step 3: Inspect VIF Factors
display(vif.sort_values('VIF Factor'))
As we can see, we now have to remove one between family, parch and SibSp. Note that non_relatives qtd_same_ticket_bin are already with relatively acceptable VIF’s. The first is directly calculated from the family and the second is very close as we have seen. So let’s discard the family.
cols.remove('family')
y_train = data.Survived[data.Survived>=0]
scale = StandardScaler(with_std=False)
df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
features = "+".join(cols)
df2 = pd.concat([y_train, df], axis=1)
# get y and X dataframes based on this regression:
y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')
#Step 2: Calculate VIF Factors
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
#Step 3: Inspect VIF Factors
display(vif.sort_values('VIF Factor'))
Yea, we can accept, and we can proceed to the next step.
Feature Selection by Filter Methods¶
Filter methods use statistical methods for evaluation of a subset of features, they are generally used as a preprocessing step. These methods are also known as univariate feature selection, they examines each feature individually to determine the strength of the relationship of the feature with the dependent variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data, but not necessarily for optimizing the feature set for better generalization.
So, the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation co-efficients.
Feature/Response | Continuous | Categorical |
---|---|---|
Continuous | Pearson’s Correlation | LDA |
Categorical | Anova | Chi-Square |
One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.
There are lot of different options for univariate selection. Some examples are:
- Model Based Ranking
- Mutual information and maximal information coefficient (MIC).
I did not approach the latter, because there has been some critique about MIC’s statistical power, i.e. the ability to reject the null hypothesis when the null hypothesis is false. This may or may not be a concern, depending on the particular dataset and its noisiness. If you have interest on this, in python, MIC is available in the minepy library.
Feature Selection by Model based ranking¶
We can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model.
In fact, this is already put to use with Pearson’s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression. But this method it is not good to select features with non-linear relation to dependent variable. For this, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don’t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.
rf = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
scores = []
for i in range(df.shape[1]):
score = cross_val_score(rf, df.iloc[:, i:i+1], y_train, scoring="accuracy", cv=10)
scores.append((round(np.mean(score), 3), cols[i]))
MBR = pd.DataFrame(sorted(scores, reverse=True), columns=['Score', 'Feature'])
g = MBR.iloc[:15, :].plot(x='Feature', kind='barh', figsize=(20,10), fontsize=12, grid=True)
plt.show()
MBR = MBR.iloc[:15, 1]
Feature Selection by SelectKBest:¶
On scikit-learn we find variety of implementation oriented to classifications tasks to select features according to the k highest scores, see below some of that:
- f_classif compute the ANOVA F-value for the provided sample.
- chi2 compute chi-squared stats between each non-negative feature and class.
- mutual_info_classif estimate mutual information for a discrete target variable.
The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
Other important point is if you use sparse data, for example if we continue consider hot-encode of surnames, chi2 and mutual_info_classif will deal with the data without making it dense.
Let’s see the SelectKBest of f_classif and chi2 for our data:
cols = pd.Index(cols)
skb = SelectKBest(score_func=f_classif, k=10)
skb.fit(df, y_train)
select_features_kbest = skb.get_support()
feature_f_clas = cols[select_features_kbest]
feature_f_clas_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
print('Total features slected by f_classif Statistical Methods',len(feature_f_clas))
fig = plt.figure(figsize=(20,7))
f1 = fig.add_subplot(121)
g = pd.DataFrame(sorted(feature_f_clas_scores, key=lambda x: -x[1])[:len(feature_f_clas)], columns=['Feature','F-Calss Score']).\
plot(x='Feature', kind='barh', title= 'F Class Score', fontsize=18, ax=f1, grid=True)
scale = MinMaxScaler()
df2 = scale.fit_transform(data.loc[data.Survived>=0, cols])
skb = SelectKBest(score_func=chi2, k=10)
skb.fit(df2, y_train)
select_features_kbest = skb.get_support()
feature_chi2 = cols[select_features_kbest]
feature_chi2_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
print('Total features slected by chi2 Statistical Methods',len(feature_chi2))
f2 = fig.add_subplot(122)
g = pd.DataFrame(sorted(feature_chi2_scores, key=lambda x: -x[1])[:len(feature_chi2)], columns=['Feature','Chi2 Score']).\
plot(x='Feature', kind='barh', title= 'Chi2 Score', fontsize=18, ax=f2, grid=True)
SMcols = set(feature_f_clas).union(set(feature_chi2))
print("Extra features select by f_class:\n", set(feature_f_clas).difference(set(feature_chi2)), '\n')
print("Extra features select by chi2:\n", set(feature_chi2).difference(set(feature_f_clas)), '\n')
print("Intersection features select by f_class and chi2:\n",set(feature_f_clas).intersection(set(feature_chi2)), '\n')
print('Total number of features selected:', len(SMcols))
print(SMcols)
plt.tight_layout(); plt.show()
Wrapper Methods¶
In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem.
The two main disadvantages of these methods are :
- The increasing overfitting risk when the number of observations is insufficient.
- These methods are usually computationally very expensive.
Backward Elimination¶
In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
We will see below row implementation of backward elimination, one to select by P-values and other based on the accuracy of a model the we submitted to it.
Backward Elimination By P-values¶
The P-value, or probability value, or asymptotic significance, is a probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations more commonly known as the statistical summary is greater than or equal in magnitude to the observed results.
The null hypothesis is a general statement that there is no relationship between two measured phenomena.
For example, if the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.
But you need to be careful how you interpret the statistical significance of a correlation. If your correlation coefficient has been determined to be statistically significant this does not mean that you have a strong association. It simply tests the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that states that there is a relationship, but with no information about the strength of the relationship or its importance.
Since removal of different features from the dataset will have different effects on the p-value for the dataset, we can remove different features and measure the p-value in each case. These measured p-values can be used to decide whether to keep a feature or not.
Next we make the test of a logit regression to check the result and select features based on its the P-value:
logit_model=sm.Logit(y_train,df)
result=logit_model.fit(method='bfgs', maxiter=2000)
print(result.summary())
As expect, P-values of dummies is high. Like before, I excluded one by one of the features with the highest P-value and run again until get only P-values below to 0.1, but here I use a backward elimination process.
pv_cols = cols.values
def backwardElimination(x, Y, sl, columns):
numVars = x.shape[1]
for i in range(0, numVars):
regressor = sm.Logit(Y, x).fit(method='bfgs', maxiter=2000, disp=False)
maxVar = max(regressor.pvalues) #.astype(float)
if maxVar > sl:
for j in range(0, numVars - i):
if (regressor.pvalues[j].astype(float) == maxVar):
columns = np.delete(columns, j)
x = x.loc[:, columns]
print(regressor.summary())
print('\nSelect {:d} features from {:d} by best p-values.'.format(len(columns), len(pv_cols)))
print('The max p-value from the features selecte is {:.3f}.'.format(maxVar))
# odds ratios and 95% CI
conf = np.exp(regressor.conf_int())
conf['Odds Ratios'] = np.exp(regressor.params)
conf.columns = ['2.5%', '97.5%', 'Odds Ratios']
display(conf)
return columns, regressor
SL = 0.1
df2 = scale.fit_transform(data.loc[data.Survived>=0, pv_cols])
df2 = pd.DataFrame(df2, columns = pv_cols)
pv_cols, Logit = backwardElimination(df2, y_train, SL, pv_cols)
From the results, we can highlight:
-
we’re very confident about some relationship between the probability of being survived:
- there is an inverse relationship with class, Age and genre of the passenger.
- there is an positive relationship, from greater to low, with non relatives, Kids, passenger fare and distinction in name of the passenger.
-
From the coefficient:
- Non relatives, we confirm that fellows or companions of families had better chances to survived.
- As we saw in the EDA, passengers who embarked on S had a higher fatality rate.
- The model is only able to translate a single straight line by variable, obviously pondered by the others. Then he points out that the greater the number of passengers with the same ticket, the greater the likelihood of death. As seen through the ticket EDA, this is partly true, however, this coefficient does not adequately capture the fact that the chances of survival increase from 1 to 3 passengers with the same ticket. As you can see above, the model balances this through the variables of family relationships. However, this is a good example of why you should not ignore the EDA phase, and above all do not rely on a conclusion based on a single fact or point of view. Remember, models are not full truths, and possess precision, confident and accuracy!
Take the exponential of each of the coefficients to generate the odds ratios. This tells you how a 1 unit increase or decrease in a variable affects the odds of being survived. For example, we can expect the odds of being survived to decrease by about 69.5% if the passenger embarked on S. Go back to the embarked EDA and see that this hits the stacked bar chart.
You must be wondering, after all, this model has how much of accuracy?
Although we did not do cross validation or even split, let’s take a look, redeem the probabilities generated by it and take advantage to see how we can plot the results of models that return probabilities, so we can refine our perception much more than simply evaluate p-value, coefficients, accuracy, etc.
pred = Logit.predict(df2[pv_cols])
train = data[data.Survived>=0]
train['proba'] = pred
train['Survived'] = y_train
y_pred = pred.apply(lambda x: 1 if x > 0.5 else 0)
print('Accurancy: {0:2.2%}'.format(accuracy_score(y_true=y_train, y_pred=y_pred)))
def plot_proba(continous, predict, discret, data):
grouped = pd.pivot_table(data, values=[predict], index=[continous, discret], aggfunc=np.mean)
colors = 'rbgyrbgy'
for col in data[discret].unique():
plt_data = grouped.ix[grouped.index.get_level_values(1)==col]
plt.plot(plt_data.index.get_level_values(0), plt_data[predict], color=colors[int(col)])
plt.xlabel(continous)
plt.ylabel("Probabilities")
plt.legend(np.sort(data[discret].unique()), loc='upper left', title=discret)
plt.title("Probabilities with " + continous + " and " + discret)
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(231)
plot_proba('non_relatives', 'Survived', 'Pclass', train)
ax = fig.add_subplot(232)
plot_proba('non_relatives', 'Survived', 'genre', train)
ax = fig.add_subplot(233)
plot_proba('non_relatives', 'Survived', 'qtd_same_ticket_bin', train)
ax = fig.add_subplot(234)
plot_proba('qtd_same_ticket_bin', 'Survived', 'distinction_in_name', train)
ax = fig.add_subplot(235)
plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
ax = fig.add_subplot(235)
plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
ax = fig.add_subplot(236)
plot_proba('qtd_same_ticket_bin', 'Survived', 'parents', train)
plt.show()
Backward Elimination By Accuracy – A Sequential Backward Selection¶
Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The motivation behind feature selection algorithms is to automatically select a subset of features that are most relevant to the problem to improve computational efficiency or reduce the generalization error of the model by removing irrelevant features or noise, which can be useful for algorithms that don’t
support regularization.
Greedy algorithms make locally optimal choices at each stage of a combinatorial search problem and generally yield a suboptimal solution to the problem in contrast to exhaustive search algorithms, which evaluate all possible combinations and are guaranteed to find the optimal solution. However, in practice, an exhaustive search is often computationally not feasible, whereas greedy algorithms allow for a less complex, computationally more efficient solution.
SBS aims to reduce the dimensionality of the initial feature subspace with a minimum decay in performance of the classifier to improve upon computational efficiency. In certain cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.
SBS sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, we need to define criterion function J that we want to minimize. The criterion calculated by the criterion function can simply be the difference in performance of the classifier after and before the removal of a particular feature. Then the feature to be removed at each stage can simply be defined as the feature that maximizes this criterion.
From interactive executions, I already know that I need remove the surnames with less than 0.06 of correlation.
So, let’s see a example of SBS in our data,
class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=101):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state
def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)
dim = X_train.shape[1]
self.indices_ = list(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)
self.scores_ = [score]
while dim > self.k_features:
scores = []
subsets = []
for p in combinations(self.indices_, r=dim-1):
score = self._calc_score(X_train, y_train, X_test, y_test, list(p))
scores.append(score)
subsets.append(list(p))
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1
self.scores_.append(scores[best])
self.k_score_ = self.scores_[-1]
return self
def transform(self, X):
return X.iloc[:, self.indices_]
def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train.iloc[:, indices], y_train)
y_pred = self.estimator.predict(X_test.iloc[:, indices])
score = self.scoring(y_test, y_pred)
return score
knn = KNeighborsClassifier(n_neighbors=3)
sbs = SBS(knn, k_features=1)
df2 = df.drop(['surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
'surname_Beane', 'surname_Hippach', 'surname_Bishop',
'surname_Mellinger', 'surname_Yarred'], axis = 1)
sbs.fit(df2, y_train)
print('Best Score:',max(sbs.scores_))
k_feat = [len(k) for k in sbs.subsets_]
fig = plt.figure(figsize=(10,5))
plt.plot(k_feat, sbs.scores_, marker='o')
#plt.ylim([0.7, max(sbs.scores_)+0.01])
plt.xlim([1, len(sbs.subsets_)])
plt.xticks(np.arange(1, len(sbs.subsets_)+1))
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid(b=1)
plt.show()
print('First best accuracy with:\n',list(df.columns[sbs.subsets_[np.argmax(sbs.scores_)]]))
SBS = list(df.columns[list(sbs.subsets_[max(np.arange(0, len(sbs.scores_))[(sbs.scores_==max(sbs.scores_))])])])
print('\nBest accuracy with {0:2d} features:\n{1:}'.format(len(SBS), SBS))
Select Features by Recursive Feature Elimination¶
The goal of Recursive Feature Elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.
RFE is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted.
Other option is sequential Feature Selector (SFS) from mlxtend, a separate Python library that is designed to work well with scikit-learn, also provides a S that works a bit differently.
RFE is computationally less complex using the feature’s weight coefficients (e.g., linear models) or feature importances (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.
from sklearn.feature_selection import RFE
lr = LogisticRegression()
rfe = RFE(estimator=lr, step=1)
rfe.fit(df, y_train)
FRFE = cols[rfe.ranking_==1]
print('\nFeatures selected:\n',FRFE)
print('\n Total Features selected:',len(FRFE))
Select Features by Embedded Methods¶
In addition to the return of the performance itself, some models has in their internal process some step to features select that best fit their proposal, and returns the features importance too. Thus, they provide two straightforward methods for feature selection and combine the qualities’ of filter and wrapper methods.
Some of the most popular examples of these methods are LASSO, RIDGE, SVM, Regularized trees, Memetic algorithm, and Random multinomial logit.
In the case of Random Forest, some other models base on trees, we have two basic approaches implemented in the packages:
- Gini/Entropy Importance or Mean Decrease in Impurity (MDI)
- Permutation Importance or Mean Decrease in Accuracy
- Permutation with Shadow Features
- Gradient Boosting
Others models has concerns om multicollinearity problem and adding additional constraints or penalty to regularize. When there are multiple correlated features, as is the case with very many real life datasets, the model becomes unstable, meaning that small changes in the data can cause large changes in the model, making model interpretation very difficult on the regularization terms.
This applies to regression models like LASSO and RIDGE. In classifier cases, you can use SGDClassifier where you can set the loss parameter to ‘log’ for Logistic Regression or ‘hinge’ for SVM. In SGDClassifier you can set the penalty to either of ‘l1’, ‘l2’ or ‘elasticnet’ which is a combination of both.
Let’s start with more details and examples:
Feature Selection by Mean Decrease Impurity¶
There are a two things to keep in mind when using the impurity based ranking:
- Feature selection based on impurity reduction is biased towards preferring variables with more categories.
- It can lead to the incorrect conclusion that one of the features is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the independent variable.
The second one refers to when the dataset has two or more correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data. The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely.
The Random Forests is one of them, the reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees, wheres for classification, it is typically either Gini impurity or information gain/entropy and for *regression it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.
Them, we run a quick Random Forest to select the features most importants:
rfc = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
rfc.fit(df, y_train)
feature_importances = [(feature, score) for feature, score in zip(cols, rfc.feature_importances_)]
MDI = cols[rfc.feature_importances_>0.010]
print('Total features slected by Random Forest:',len(MDI))
g = pd.DataFrame(sorted(feature_importances, key=lambda x: -x[1])[:len(MDI)], columns=['Feature','Importance']).\
plot(x='Feature', kind='barh', figsize=(20,7), fontsize=18, grid=True)
plt.show()
Feature Selection by Mean Decrease Accuracy¶
The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant feature, the permutation should have little to no effect on model accuracy, while permuting important feature should significantly decrease it.
This method is not directly exposed in sklearn, but it is straightforward to implement it. Start record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.
The rfpimp package in the src dir provided it For Random Forest, let’s see:
X_train, X_test, y, y_test = train_test_split(df, y_train , test_size=0.20, random_state=101)
# Add column of random numbers
X_train['random'] = np.random.random(size=len(X_train))
X_test['random'] = np.random.random(size=len(X_test))
rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
rf.fit(X_train, y)
imp = importances(rf, X_test, y_test, n_samples=-1) # permutation
MDA = imp[imp!=0].dropna().index
if 'random' in MDA:
MDA = MDA.drop('random')
print('%d features are selected.' % len(MDA))
plot_importances(imp[imp!=0].dropna(), figsize=(20,7))
Feature Selection by Permutation with Shadow Features¶
Boruta randomly permutes variables like Permutation Importance does, but performs on all variables at the same time and concatenates the shuffled features with the original ones. The concatenated result is used to fit the model.
Daniel Homola, who also wrote the Python version of Boruta, BorutaPy, gave an wonderful overview of the Boruta algorithm in his blog post
“The shuffled features (a.k.a. shadow features) are basically noises with identical marginal distribution w.r.t the original feature. We count the times a variable performs better than the ‘best’ noise and calculate the confidence towards it being better than noise (the p-value) or not. Features which are confidently better are marked ‘confirmed’, and those which are confidently on par with noises are marked ‘rejected’. Then we remove those marked features and repeat the process until all features are marked or a certain number of iteration is reached.”
Although Boruta is a feature selection algorithm, we can use the order of confirmation/rejection as a way to rank the importance of features.
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
X = df.values
y = y_train.values.ravel()
# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
#rf = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
rf = ExtraTreesClassifier(n_estimators=100, max_depth=4, n_jobs=-1, oob_score=True, bootstrap=True, random_state=101)
# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=0, random_state=101)
# find all relevant features - 5 features should be selected
feat_selector.fit(X, y)
shadow = cols[feat_selector.support_]
# check selected features - first 5 features are selected
print('Features selected:',shadow)
# call transform() on X to filter it down to selected features
print('Data transformaded has %d features' % feat_selector.n_features_) #feat_selector.transform(X).shape[1])
print('Check the selector ranking:')
display(pd.concat([pd.DataFrame(cols, columns=['Columns']),
pd.DataFrame(feat_selector.ranking_, columns=['Rank'])], axis=1).sort_values(by=['Rank']))
Feature Selection by Gradient Boosting¶
The LightGBM model the importance is calculated from, if ‘split’, result contains numbers of times the feature is used in a model, if ‘gain’, result contains total gains of splits which use the feature.
On the XGBoost model the importance is calculated by:
- ‘weight’: the number of times a feature is used to split the data across all trees.
- ‘gain’: the average gain across all splits the feature is used in.
- ‘cover’: the average coverage across all splits the feature is used in.
- ‘total_gain’: the total gain across all splits the feature is used in.
- ‘total_cover’: the total coverage across all splits the feature is used in.
First measure is split-based and is very similar with the one given by for Gini Importance. But it doesn’t take the number of samples into account.
The second measure is gain-based. It’s basically the same as the Gini Importance implemented in R packages and in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model.
The cover, implemented exclusively in XGBoost, is counting the number of samples affected by the splits based on a feature.
get_score(fmap=”, importance_type=’weight’)
Get feature importance of each feature. Importance type can be defined as:
The default measure of both XGBoost and LightGBM is the split-based one. I think this measure will be problematic if there are one or two feature with strong signals and a few features with weak signals. The model will exploit the strong features in the first few trees and use the rest of the features to improve on the residuals. The strong features will look not as important as they actually are. While setting lower learning rate and early stopping should alleviate the problem, also checking gain-based measure may be a good idea.
Note that these measures are purely calculated using training data, so there’s a chance that a split creates no improvement on the objective in the holdout set. This problem is more severe than in the random forest since gradient boosting models are more prone to over-fitting.
Feature importance scores can be used for feature selection in scikit-learn.
This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.
This class can take a previous trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
# split data into train and test sets
X_train, X_test, y, y_test = train_test_split(df, y_train, test_size=0.30, random_state=101)
# fit model on all training data
model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
model.fit(X_train, y)
fig=plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
g = plot_importance(model, height=0.5, ax=ax)
# Using each unique importance as a threshold
thresholds = np.sort(np.unique(model.feature_importances_)) #np.sort(model.feature_importances_[model.feature_importances_>0])
best = 0
colsbest = 31
my_model = model
threshold = 0
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
selection_model.fit(select_X_train, y)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, prediction