Kaggle challenges us to learn data analysis and machine learning from the data the Titanic shipwreck, and try predict survival and get familiar with ML basics.

So, this material is intended to cover most of the techniques of data analysis and ML in Python, than to properly compete in Kaggle. That is why it following the natural flow of ML and contains many texts and links regarding the techniques, made your conference and references easy. as it can be extended over time.

In this way the material can be used for consultation and apply the methods to other similar classification cases, but for its application in the competition, or even to a real case, it will be necessary to make some choices and changes.

Competition Description:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, Kaggle ask you to complete the analysis of what sorts of people were likely to survive. In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Table of Contents

Preparing environment and uploading data

You can download the this python notebook and data from my github repository. The data can download on Kaggle here.

Import Packages

In [1]:
import os
    import warnings
    warnings.simplefilter(action = 'ignore', category=FutureWarning)
    warnings.filterwarnings('ignore')
    def ignore_warn(*args, **kwargs):
        pass

    warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

    import numpy as np
    import pandas as pd
    import pylab 
    import seaborn as sns
    sns.set(style="ticks", color_codes=True, font_scale=1.5)
    from matplotlib import pyplot as plt
    from matplotlib.ticker import FormatStrFormatter
    from matplotlib.colors import ListedColormap
    %matplotlib inline
    import mpl_toolkits
    from mpl_toolkits.mplot3d import Axes3D
    import model_evaluation_utils as meu

    from scipy.stats import skew, norm, probplot, boxcox
    from patsy import dmatrices
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectFromModel
    from boruta import BorutaPy
    from rfpimp import *

    from sklearn.decomposition import PCA, KernelPCA
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
    from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
    from sklearn.pipeline import Pipeline, make_pipeline
    from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, cross_val_predict, train_test_split
    from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score

    from sklearn.linear_model import LogisticRegression, SGDClassifier
    from sklearn.svm import SVC, LinearSVC
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import RBF
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
    from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.neighbors import KNeighborsClassifier
    import xgboost as xgb
    from xgboost import XGBClassifier
    from xgboost import plot_importance

    #from sklearn.base import BaseEstimator, TransformerMixin, clone, ClassifierMixin
    from sklearn.ensemble import VotingClassifier
    from itertools import combinations
    

Load Datasets

I start with load the datasets with pandas, and concatenate them.

In [2]:
train = pd.read_csv('train.csv') 

    test = pd.read_csv('test.csv') 
    Test_ID = test.PassengerId
    test.insert(loc=1, column='Survived', value=-1)

    data = pd.concat([train, test], ignore_index=True)
    

Exploratory Data Analysis (EDA) & Feature Engineering

Take a First Look of our Data:

I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.

If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the independent variable to the pred parameter.

Also, if your return is stored in a variable you can evaluate it in more detail, specific of a field, or sort them from different perspectives

In [3]:
def rstr(df, pred=None): 
        obs = df.shape[0]
        types = df.dtypes
        counts = df.apply(lambda x: x.count())
        uniques = df.apply(lambda x: [x.unique()])
        nulls = df.apply(lambda x: x.isnull().sum())
        distincts = df.apply(lambda x: x.unique().shape[0])
        missing_ration = (df.isnull().sum()/ obs) * 100
        skewness = df.skew()
        kurtosis = df.kurt() 
        print('Data shape:', df.shape)
        
        if pred is None:
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing ration', 'uniques', 'skewness', 'kurtosis']
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis], axis = 1)

        else:
            corr = df.corr()[pred]
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
            corr_col = 'corr '  + pred
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
        
        str.columns = cols
        dtypes = str.types.value_counts()
        print('___________________________\nData types:\n',str.types.value_counts())
        print('___________________________')
        return str
    

In [4]:
details = rstr(data.loc[: ,'Survived' : 'Embarked'], 'Survived')
    details.sort_values(by='corr Survived', ascending=False)
    

Data shape: (1309, 11)
    ___________________________
    Data types:
     object     5
    int64      4
    float64    2
    Name: types, dtype: int64
    ___________________________
    

Out[4]:
types counts distincts nulls missing_ration uniques skewness kurtosis corr Survived
Survived int64 1309 3 0 0.000000 [[0, 1, -1]] 0.097425 -1.263022 1.000000
Fare float64 1308 282 1 0.076394 [[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51… 4.367709 27.027986 0.081545
Parch int64 1309 8 0 0.000000 [[0, 1, 2, 5, 3, 4, 6, 9]] 3.669078 21.541079 0.028196
SibSp int64 1309 7 0 0.000000 [[1, 0, 3, 4, 2, 5, 8]] 3.844220 20.043251 0.012470
Age float64 1046 99 263 20.091673 [[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0… 0.407675 0.146948 -0.049620
Pclass int64 1309 3 0 0.000000 [[3, 1, 2]] -0.598647 -1.315079 -0.126769
Name object 1309 1307 0 0.000000 [[Braund, Mr. Owen Harris, Cumings, Mrs. John … NaN NaN NaN
Sex object 1309 2 0 0.000000 [[male, female]] NaN NaN NaN
Ticket object 1309 929 0 0.000000 [[A/5 21171, PC 17599, STON/O2. 3101282, 11380… NaN NaN NaN
Cabin object 295 187 1014 77.463713 [[nan, C85, C123, E46, G6, C103, D56, A6, C23 … NaN NaN NaN
Embarked object 1307 4 2 0.152788 [[S, C, Q, nan]] NaN NaN NaN

Data Dictionary

  • Survived: 0 = No, 1 = Yes. I use -1 to can separate test data from training data.
  • Fare: The passenger fare
  • Parch: # of parents / children aboard the Titanic
  • SibSp: # of siblings / spouses aboard the Titanic
  • Age: Age in years
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name: Name of the passenger
  • Sex: Sex of the passenger male and female
  • Ticket: Ticket number
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The points of attention we have here are:

  • Fare: Kaggle affirm that is the passenger fare, but with some data inspection in group of Tickets we discover that is the total amount fare paid for a ticket, and the existence of tickets for a group of passengers.
  • Parch: The dataset defines family relations in this way…
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children traveled only with a nanny, therefore parch=0 for them.
  • SibSP: The dataset defines family relations in this way…
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
  • Age: Have 20% of nulls, so we have to find a more efficient way of filling them out with just a single value, like the median for example.
  • Name: is a categorical data with a high distinct values as expect. The first reaction is drop this column, but we can use it to training different data engineering techniques, to see if we can get some valuable data. Besides that, notice that has two pairs of passengers with the same name?
  • Ticket Other categorical data, but in this case it have only 71% of distinct values and don’t have nulls. So, it is possible that some passenger voyage in groups and use the same ticket. Beyond that, we can check to if we can extract other interesting thought form it.
  • Cabin: the high number of distinct values (187) and nulls (77.5%).
    This is a categorical data of which I use to train different techniques to extract some information of value and null input, but with given the high rate of null the recommended would be to simplify the filling or even to exclude this attribute.

First see of some stats of Numeric Data

So, for the main statistics of our numeric data describe the function (like the summary of R)

In [5]:
print('Data is not balanced! Has {:2.2%} survives'.format(train.Survived.describe()[1]))
    display(data.loc[: ,'Pclass' : 'Embarked'].describe().transpose())
    print('Survived: [1] Survived; [0] Died; [-1] Test Data set:\n',data.Survived.value_counts())
    

Data is not balanced! Has 38.38% survives
    

count mean std min 25% 50% 75% max
Pclass 1309.0 2.294882 0.837836 1.00 2.0000 3.0000 3.000 3.0000
Age 1046.0 29.881138 14.413493 0.17 21.0000 28.0000 39.000 80.0000
SibSp 1309.0 0.498854 1.041658 0.00 0.0000 0.0000 1.000 8.0000
Parch 1309.0 0.385027 0.865560 0.00 0.0000 0.0000 0.000 9.0000
Fare 1308.0 33.295479 51.758668 0.00 7.8958 14.4542 31.275 512.3292

Survived: [1] Survived; [0] Died; [-1] Test Data set:
      0    549
    -1    418
     1    342
    Name: Survived, dtype: int64
    

EDA and Feature Engineering by Feature

Standard Data Visualization for Discrete or Binning Data

In [6]:
def charts(feature, df):
        print('\n           ____________________________ Plots of', feature, 'per Survived and Dead: ____________________________')
        # Pie of all Data
        fig = plt.figure(figsize=(20,5))
        f1 = fig.add_subplot(131)
        cnt = df[feature].value_counts()
        g = plt.pie(cnt, labels=cnt.index, autopct='%1.1f%%', shadow=True, startangle=90)
        
        # Count Plot By Survived and Dead
        f = fig.add_subplot(132)
        g = sns.countplot(x=feature, hue='Survived', hue_order=[1,0], data=df, ax=f)

        # Percent stacked Plot
        survived = df[df['Survived']==1][feature].value_counts()
        dead = df[df['Survived']==0][feature].value_counts()
        df2 = pd.DataFrame([survived,dead])
        df2.index = ['Survived','Dead']
        df2 = df2.T
        df2 = df2.fillna(0)
        df2['Total'] = df2.Survived + df2.Dead
        df2.Survived = df2.Survived/df2.Total
        df2.Dead = df2.Dead/df2.Total
        df2.drop(['Total'], axis=1, inplace=True)
        f = fig.add_subplot(133)
        df2.plot(kind='bar', stacked=True, ax=f)
        del df2, g, f, cnt, dead, fig
    

Ticket

Since Ticket is a transaction and categorical data, the first insight is drop this feature, but we may note that it has some hidden value information. At first look, safe few cases, we could affirms that:

  • families and group of persons that traveled together bought the same ticket.
  • People with alphanumerics Tickets has some special treatment (crew family, employees, VIP, free tickets, etc.)

So, we start by a new feature creation to quantify the number of passengers by ticket, and join this
quantity to each passenger with the same ticket.

In [7]:
same_ticket = data.Ticket.value_counts()
    data['qtd_same_ticket'] = data.Ticket.apply(lambda x: same_ticket[x])
    del same_ticket
    charts('qtd_same_ticket', data[data.Survived>=0])
    

               ____________________________ Plots of qtd_same_ticket per Survived and Dead: ____________________________
    

As we can see above:

  • the majority (54%) bought only one ticket per passenger, and have lower survival rate than passengers that bought tickets for 2, 3, 4, 5 and 8 people.
  • the survival rate is growing between 1 and 4, dropped a lot at 5. From the bar chart we can see that after 5 the number of samples is too low (84 out of 891, 9.4%, 1/4 of this is 5), and this data is skewed with a long tail to right. We can reduce this tail by binning all data after 4 in the same ordinal, its better to prevent overfitting, but we lose some others interesting case, see the next bullet. As alternative we can apply a box cox at this measure.
  • The case of 11 people with same ticket probably is a huge family that all samples on the training data died. Let’s check this below.

In [8]:
data[(data.qtd_same_ticket==11)]
    

Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked qtd_same_ticket
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S 11
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S 11
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S 11
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S 11
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S 11
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S 11
863 864 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.55 NaN S 11
1079 1080 -1 3 Sage, Miss. Ada female NaN 8 2 CA. 2343 69.55 NaN S 11
1233 1234 -1 3 Sage, Mr. John George male NaN 1 9 CA. 2343 69.55 NaN S 11
1251 1252 -1 3 Sage, Master. William Henry male 14.5 8 2 CA. 2343 69.55 NaN S 11
1256 1257 -1 3 Sage, Mrs. John (Annie Bullen) female NaN 1 9 CA. 2343 69.55 NaN S 11

We confirm our hypothesis, and we notice that Fare is not the price of each passenger, but the price of each ticket, its total amount. Since our data is per passenger, this information has some distortion, because only one passenger that bought a ticket alone of 69.55 pounds is different from 11 passenger that bought a special ticket, with discount for group, by 6.32 pounds per passenger. It suggest to create a new feature that represents the real fare by passenger.

Back to the quantity of persons with same ticket, if we keep this and the model can capture this pattern, you’ll probably predict that the respective test samples also died! However, even if true, can be a sign of overfitting, because we only have 1.2% of these cases in the training samples.

In order to increase representativeness and lose the minimum of information, since we have only 44 (4.9%) training samples that bought tickets for 4 people and 101 (11.3%) of 3, we binning the quantity of 3 and 4 together as 3 (16,3%, over than 5 as 5 (84 samples). Let’s see the results below.

In [9]:
data['qtd_same_ticket_bin'] = data.qtd_same_ticket.apply(lambda x: 3 if (x>2 and x<5) else (5 if x>4 else x))
    charts('qtd_same_ticket_bin', data[data.Survived>=0])
    

               ____________________________ Plots of qtd_same_ticket_bin per Survived and Dead: ____________________________
    

Other option, is create a binary feature that indicates if the passenger use a same ticket or not (not share his ticket)

In [10]:
print('Percent. survived from unique ticket: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].count()))
    print('Percent. survived from same tickets: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].count()))

    data['same_tckt'] = data.qtd_same_ticket.apply(lambda x: 1 if (x> 1) else 0)
    charts('same_tckt', data[data.Survived>=0])
    

Percent. survived from unique ticket: 27.03%
    Percent. survived from same tickets: 51.71%

               ____________________________ Plots of same_tckt per Survived and Dead: ____________________________
    

In this case we lose information that the chances of survival increase from 1 to 4, and fall from 5. In addition, cases 1 and 0 of the two measures are exactly the same. Then we will not use this option, and go work on Fare.

Finally, we have one more information to extract directly from Ticket, and check the possible special treatment!

In [11]:
data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x))).value_counts().head(7)
    

Out[11]:
           957
    PC          92
    CA          68
    A           39
    SOTONOQ     24
    STONO       21
    WC          15
    Name: Ticket, dtype: int64

In [12]:
data['distinction_in_tikect'] =\
       (data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x)).strip('[]')))

    data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Without' if y=='' else y if (y in ['PC', 'CA', 'A', 'SOTONOQ', 'STONO', 'WC', 'SCPARIS']) else 'Others')

    charts('distinction_in_tikect', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________
    

By the results, passengers with PC distinction in their tickets had best survival rate. Without, Others and CA is very close and can be grouped in one category and the we can do the same between STONO and SCAPARIS, and between A, SOTONOQ and WC.

In [13]:
data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Others' if (y in ['Without', 'Others', 'CA']) else\
            'Low' if (y in ['A', 'SOTONOQ', 'WC']) else\
            'High' if (y in ['STONO', 'SCPARIS']) else y)

    charts('distinction_in_tikect', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________
    

Fare

First, we treat the unique null fare case, then we take a look of the distribution of Fare (remember that is the total amount Fare of the Ticket).

In [14]:
# Fill null with median of most likely type passenger
    data.loc[data.Fare.isnull(), 'Fare'] = data.Fare[(data.Pclass==3) & (data.qtd_same_ticket==1) & (data.Age>60)].median()

    fig = plt.figure(figsize=(20,5))
    f = fig.add_subplot(121)
    g = sns.distplot(data[(data.Survived>=0)].Fare)
    f = fig.add_subplot(122)
    g = sns.boxplot(y='Fare', x='Survived', data=data[data.Survived>=0], notch = True)
    

Let’s take a look at how the fare per passenger is and how much it differs from the total

In [15]:
data['passenger_fare'] = data.Fare / data.qtd_same_ticket

    fig = plt.figure(figsize=(20,6))
    a = fig.add_subplot(141)
    g = sns.distplot(data[(data.Survived>=0)].passenger_fare)
    a = fig.add_subplot(142)
    g = sns.boxplot(y='passenger_fare', x='Survived', data=data[data.Survived>=0], notch = True)
    a = fig.add_subplot(143)
    g = pd.qcut(data.Fare[(data.Survived==0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Died')
    a = fig.add_subplot(144)
    g = pd.qcut(data.Fare[(data.Survived>0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Survived')
    plt.tight_layout(); plt.show()
    


From the comparison, we can see that:

  • the distributions are not exactly the same, with two spouts slightly apart on passenger fare.
  • Class and how much paid per passenger make differences!

    Although the number of survivors among the quartiles is approximately the same as expected, when we look at passenger fares, it is more apparent that the mortality rate is higher in the lower Fares, since the top of Q4 died is at the same height as the median plus a confidence interval of the fare paid by survivors.

  • the number of outliers is lower in the fare per passenger, especially among survivors.

    We can not rule out these outliers if there are cases of the same type in the test data set. In addition, these differences in values may be due to probably first class with additional fees for certain exclusives and cargo.

Below, you can see that the largest outlier all survival in the train data set, and has one case (1235 Passenger Id,
the matriarch of one son and two companions) to predict. Among all outlier cases of survivors, we see that all cases are first class, and different from the largest outlier, 27% actually died, and we have 18 cases to predict.

In [16]:
print('Passengers with higets passenger fare:')
    display(data[data.passenger_fare>120])
    print('\nSurivived of passenger fare more than 50:\n',
        pd.pivot_table(data.loc[data.passenger_fare>50, ['Pclass', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Pclass']))
    

Passengers with higets passenger fare:
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked qtd_same_ticket qtd_same_ticket_bin same_tckt distinction_in_tikect passenger_fare
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C 4 3 1 PC 128.0823
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C 4 3 1 PC 128.0823
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C 4 3 1 PC 128.0823
1234 1235 -1 1 Cardeza, Mrs. James Warburton Martinez (Charlo… female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C 4 3 1 PC 128.0823

    Surivived of passenger fare more than 50:
     Survived  -1   0   1
    Pclass              
    1         18   4  22
    

Note that if we leave this way, if the model succeeds in capturing this pattern of largest outlier we are again thinking of a model that is at risk of overfitting (0.03% of cases).

Pclass

Notwithstanding the fact that class 3 presents greater magnitude, as we see with Fare by passenger, we notice that survival rate is greater with greater fare by passenger. Its make to think that has some socioeconomic discrimination. It is confirmed when we saw the data distribution over the classes, and see the percent bar has a clearer aggressive decreasing survival rate through the first to the third classes.

In [17]:
charts('Pclass', data[(data.Survived>=0)])
    

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________
    

SibSp

In [18]:
charts('SibSp', data[(data.Survived>=0)])
    

               ____________________________ Plots of SibSp per Survived and Dead: ____________________________
    

Since more than 2 siblings has too few cases and lowest survival rate, we can aggregate all this case into unique bin in order to increase representativeness and lose the minimum of information.

In [19]:
data['SibSp_bin'] = data.SibSp.apply(lambda x: 6 if x > 2 else x)
    charts('SibSp_bin', data[(data.Survived>=0)])
    

               ____________________________ Plots of SibSp_bin per Survived and Dead: ____________________________
    

Parch

In [20]:
charts('Parch', data[data.Survived>=0])
    

               ____________________________ Plots of Parch per Survived and Dead: ____________________________
    

As we did with siblings, we will aggregate the Parch cases with more than 3, even with the highest survival rate with 3 Parch.

In [21]:
data['Parch_bin'] = data.Parch.apply(lambda x: x if x< 3 else 4)
    charts('Parch_bin', data[(data.Survived>=0)])
    

               ____________________________ Plots of Parch_bin per Survived and Dead: ____________________________
    

Family and non-relatives

If you investigate the data, you will notice that total family members It can be obtained by the sum of Parch and SibSp plus 1 (1 for the person of respective record). So, let’s create the Family and see what we get.

In [22]:
data['family'] = data.SibSp + data.Parch + 1
    charts('family', data[data.Survived>=0])
    

               ____________________________ Plots of family per Survived and Dead: ____________________________
    

As we can see, family groups of up to 4 people were more likely to survive than people without relatives on board.
However from 5 family members we see a drastic fall and the leveling of the 7-member cases with the unfamiliar ones.
You may be led to think that this distortion clearly has some relation to the social condition. Better see the right data!

In [23]:
charts('Pclass', data[(data.family>4) & (data.Survived>=0)])
    

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________
    


Yes, we have more cases in the third class, but on the other hand, what we see is that the numbers of cases with more than 4 relatives were rarer. n a more careful look, you will see that from 6 family members we only have third class (25 in training, 10 in test). So we confirmed that a large number of family members made a difference, yes, if you were from upper classes

You must have a feeling of déjà vu, and yes, this metric is very similar to the one we have already created, the amount of passengers with the same ticket.

So what’s the difference. At first you have only the amount of people aboard with family kinship plus herself, in the previous you have people reportedly grouped, family members or not. So, in cases where relatives bought tickets separately we see the family considering them, but the ticket separating them. On the other hand, as a family we do not consider travelers with their non-family companions, employees or friends, while in the other yes.

With this, we can now obtain the number of fellows or companions per passenger. This is the number of non-relatives who traveled with the passenger

In [24]:
data['non_relatives'] = data.qtd_same_ticket - data.family
    charts('non_relatives', data[data.Survived>=0])
    

               ____________________________ Plots of non_relatives per Survived and Dead: ____________________________
    

Here you see negative numbers because there are groups of travelers with the number of unrelated members larger than those with kinship.

Sex

As everybody knows, in that case women has more significant survival rate than men.

In [25]:
charts('Sex', data[(data.Survived>=0)])
    

               ____________________________ Plots of Sex per Survived and Dead: ____________________________
    

Embarked

First, we check the 2 embarked null cases to find the most likely pattern to considerate to fill with the respective mode.

In sequence, we take a look at the Embarked data. As we can see, the passengers that embarked from Cherbourg had best survival rates and most of the passengers embarked from Southampton and had the worst survival rate.

In [26]:
display(data[data.Embarked.isnull()])
    data.loc[data.Embarked=='NA', 'Embarked'] = data[(data.Cabin.str.match('B2')>0) & (data.Pclass==1)].Embarked.mode()[0]
    charts('Embarked', data[(data.Survived>=0)])
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked qtd_same_ticket qtd_same_ticket_bin same_tckt distinction_in_tikect passenger_fare SibSp_bin Parch_bin family non_relatives
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 NaN 2 2 1 Others 40.0 0 0 1 1
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 NaN 2 2 1 Others 40.0 0 0 1 1

2 rows × 21 columns

               ____________________________ Plots of Embarked per Survived and Dead: ____________________________
    

Name


Name feature has too much variance and is not significant, but has some value information to extracts and checks, like:

  • Personal Titles
  • Existence of nicknames
  • Existence of references to another person
  • Family names

In [27]:
def Personal_Titles(df):
        df['Personal_Titles'] = df.Name.str.findall('Mrs\.|Mr\.|Miss\.|Maste[r]|Dr\.|Lady\.|Countess\.|'
                                                    +'Sir\.|Rev\.|Don\.|Major\.|Col\.|Jonkheer\.|'
                                                    + 'Capt\.|Ms\.|Mme\.|Mlle\.').apply(lambda x: ''.join(map(str, x)).strip('[]'))

        df.Personal_Titles[df.Personal_Titles=='Mrs.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Mr.'] = 'Mr'
        df.Personal_Titles[df.Personal_Titles=='Miss.'] = 'Miss'
        df.Personal_Titles[df.Personal_Titles==''] = df[df.Personal_Titles==''].Sex.apply(lambda x: 'Mr' if (x=='male') else 'Mrs')
        df.Personal_Titles[df.Personal_Titles=='Mme.'] = 'Mrs' 
        df.Personal_Titles[df.Personal_Titles=='Ms.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Lady.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Mlle.'] = 'Miss'
        df.Personal_Titles[(df.Personal_Titles=='Miss.') & (df.Age>-1) & (df.Age<15)] = 'Kid' 
        df.Personal_Titles[df.Personal_Titles=='Master'] = 'Kid'
        df.Personal_Titles[df.Personal_Titles=='Don.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Jonkheer.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Capt.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Rev.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Sir.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Countess.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Major.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Col.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Dr.'] = 'Technical'

    Personal_Titles(data)
    display(pd.pivot_table(data[['Personal_Titles', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Personal_Titles']).T)

    charts('Personal_Titles', data[(data.Survived>=0)])
    

Personal_Titles Kid Miss Mr Mrs Royalty Technical
Survived
-1 42.0 156.0 480.0 148.0 NaN 10.0
0 17.0 55.0 436.0 26.0 2.0 13.0
1 46.0 258.0 162.0 202.0 6.0 10.0

               ____________________________ Plots of Personal_Titles per Survived and Dead: ____________________________
    

As you can see above, I opted to add some titles, but at first keep 2 small sets (Technical and Royalty), Because there are interesting survival rate variations.

Next, we identify the names with mentions to other people or with nicknames and create a boolean feature.

In [28]:
data['distinction_in_name'] =\
       ((data.Name.str.findall('\(').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='(')
        | (data.Name.str.findall(r'"[A-z"]*"').apply(lambda x: ''.join(map(str, x)).strip('""'))!=''))

    data.distinction_in_name = data.distinction_in_name.apply(lambda x: 1 if x else 0)

    charts('distinction_in_name', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_name per Survived and Dead: ____________________________
    

It is interesting to note that those who have some type of reference or distinction in their names had a higher survival rate.

Next, we find 872 surnames in this dataset. Even adding loners in a single category, we have 229 with more than one member. It’s a huge categorical data to work, and it is to much sparse. The most of then has too few samples to really has significances to almost of algorithms, without risk to occurs overfitting. In addition, there are 18 surnames cases with more than one member exclusively in the test data set.

So, we create this feature with aggregation of unique member into one category and use this at models that could work on it to check if we get better results. Alternatively, we can use dimensionality reduction methods.

In [29]:
print('Total of differents surnames aboard:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).shape[0])
    print('More then one persons aboard with smae surnames:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).sum())

    surnames = (data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()

    data['surname'] = (data.Name.str.findall(r'[A-z]*\,').\
     apply(lambda x: ''.join(map(str, x)).strip(','))).apply(lambda x: x if surnames.get_value(x)>1 else 'Alone')

    test_surnames = set(data.surname[data.Survived>=0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the test data set:', 
          240-len(test_surnames))

    train_surnames = set(data.surname[data.Survived<0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the train data set:', 
          240-len(train_surnames))

    both_surnames = test_surnames.intersection(train_surnames)

    data.surname = data.surname.apply(lambda x : x if test_surnames.issuperset(set([x])) else 'Exclude')

    del surnames, both_surnames, test_surnames, train_surnames
    

Total of differents surnames aboard: 872
    More then one persons aboard with smae surnames: 239
    Surnames with more than one member aboard that happens only in the test data set: 18
    Surnames with more than one member aboard that happens only in the train data set: 76
    

Cabin

This information has to many nulls, but when it exist we can know what is the deck of the passenger, and some distinguish passengers from the same class.

Let’s start applying the same cabin to null cases where there are samples with cabins for the same ticket.

In [30]:
CabinByTicket = data.loc[~data.Cabin.isnull(), ['Ticket', 'Cabin']].groupby(by='Ticket').agg(min)
    before = data.Cabin.isnull().sum()
    data.loc[data.Cabin.isnull(), 'Cabin'] = data.loc[data.Cabin.isnull(), 'Ticket'].\
       apply(lambda x: CabinByTicket[CabinByTicket.index==x].min())
    print('Cabin nulls reduced:', (before - data.Cabin.isnull().sum()))
    del CabinByTicket, before
    

Cabin nulls reduced: 16
    

In [31]:
data.Cabin[data.Cabin.isnull()] = 'N999'
    data['Cabin_Letter'] = data.Cabin.str.findall('[^a-z]\d\d*')
    data['Cabin_Number'] = data.apply(lambda x: 0 if len(str(x.Cabin))== 1 else np.int(np.int(x.Cabin_Letter[0][1:])/10), axis=1)
    data.Cabin_Letter = data.apply(lambda x: x.Cabin if len(str(x.Cabin))== 1 else x.Cabin_Letter[0][0], axis=1)

    display(data[['Fare', 'Cabin_Letter']].groupby(['Cabin_Letter']).agg([np.median, np.mean, np.count_nonzero, np.max, np.min]))
    

Fare
median mean count_nonzero amax amin
Cabin_Letter
A 35.07710 41.244314 21.0 81.8583 0.0000
B 84.38335 131.484438 65.0 512.3292 0.0000
C 90.00000 112.161300 104.0 263.0000 25.7000
D 52.55420 53.007339 46.0 113.2750 12.8750
E 51.86250 49.557091 47.0 134.5000 7.2292
F 19.50000 22.303571 14.0 39.0000 7.7500
G 10.46250 11.291667 9.0 16.7000 7.6500
N 10.50000 16.861973 985.0 133.6500 0.0000
T 35.50000 35.500000 1.0 35.5000 35.5000

Doesn’t exist Cabin T in test dataset. This passenger is from first class and his passenger fare is the same from others 5 first class passengers. So, changed to ‘C’ to made same distribution between the six.

In [32]:
display(data[data.Cabin=='T'])

    display(data.Cabin_Letter[data.passenger_fare==35.5].value_counts())

    data.Cabin_Letter[data.Cabin_Letter=='T'] = 'C'
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare passenger_fare SibSp_bin Parch_bin family non_relatives Personal_Titles distinction_in_name surname Cabin_Letter Cabin_Number
339 340 0 1 Blackwell, Mr. Stephen Weart male 45.0 0 0 113784 35.5 35.5 0 0 1 0 Mr 0 Alone T 0

1 rows × 26 columns

B    2
    A    2
    C    1
    T    1
    Name: Cabin_Letter, dtype: int64

Fill Cabins letters NAs of third class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [33]:
data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

Fill Cabins letters NAs of second class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [34]:
data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

Fill Cabins letters NAs of first class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [35]:
data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

See below that we conquered a good results after filling nulls, but we need attention since they have too many nulls originally. In addition, the cabin may actually have made more difference in the deaths caused by the impact and not so much among those who drowned.

In [36]:
charts('Cabin_Letter', data[(data.Survived>=0)])
    

               ____________________________ Plots of Cabin_Letter per Survived and Dead: ____________________________
    

Rescue of family relationships

After some work, we notice that is difficult to understand SibSp and Patch isolated, and is difficult to extract directly families relationships from this data without a closer look.

So, in that configuration we not have clearly families relationships, and this information is primary to use for apply ages to ages with null with better distribution and accuracy.

Let’s start to rescue:

The first treatment, I discovered when check the results and noticed that I didn’t apply any relationship to a one case. Look the details below, we can see that is a case of a family with more than one ticket and the son has no age. So, I just manually applied this one case as a son, since the others member the father and the mother, and the son has the pattern 0 in SibSp and Parch 2

In [37]:
display(data[data.Name.str.findall('Bourke').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='Bourke'])
    family_w_age = data.Ticket[(data.Parch>0) & (data.SibSp>0) & (data.Age==-1)].unique().tolist()
    data[data.Ticket.isin(family_w_age)].sort_values('Ticket')
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare passenger_fare SibSp_bin Parch_bin family non_relatives Personal_Titles distinction_in_name surname Cabin_Letter Cabin_Number
188 189 0 3 Bourke, Mr. John male 40.0 1 1 364849 15.50 7.75 1 1 3 -1 Mr 0 Bourke F 3
593 594 0 3 Bourke, Miss. Mary female NaN 0 2 364848 7.75 7.75 0 2 3 -2 Miss 0 Bourke F 3
657 658 0 3 Bourke, Mrs. John (Catherine) female 32.0 1 1 364849 15.50 7.75 1 1 3 -1 Mrs 1 Bourke F 3

3 rows × 26 columns

Out[37]:

In [38]:
data['sons'] = data.apply(lambda x : \
                              1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp>0)) else 0, axis=1)

    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp>1)) else 0, axis=1)


    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles not in (['Mr', 'Mrs']))) else 0, axis=1)

    data.sons += data.apply(lambda x: 1 if ((x.Parch>0) & (x.Age>=0) & (x.Age<20)) else 0, axis=1)
    data.sons.loc[data.PassengerId==594] = 1 # Sun with diferente pattern (family with two tickets)
    data.sons.loc[data.PassengerId==1252] = 1 # Case of 'CA. 2343' and last rule
    data.sons.loc[data.PassengerId==1084] = 1 # Case of 'A/5. 851' and last rule
    data.sons.loc[data.PassengerId==1231] = 1 # Case of 'A/5. 851' and last rule

    charts('sons', data[(data.Survived>=0)])
    

               ____________________________ Plots of sons per Survived and Dead: ____________________________
    

We observe that has only 12.1% of sons, and their had better survival rate than others.

Next, we rescue the parents, and check cases where we have both (mother and father), and cases where we have only one aboard.

In [39]:
data['parents'] = data.apply(lambda x : \
                                  1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp==0)) else 0, axis=1)

    data.parents += data.apply(lambda x : \
                                  1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp==1)) else 0, axis=1)

    data.parents += data.apply(lambda x : 1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles in (['Mr', 'Mrs']))) \
                                    else 0, axis=1)

    # Identify parents and care nulls ages
    data.parents += data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

In [40]:
data['parent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parent_alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of parent_alone per Survived and Dead: ____________________________
    

We can notice that the both cases are to similar and it is not significant to has this two information separately.

Before I put them together, as I had learned in assembling the sons, I made a visual inspection and discovered some cases of sons and parents that required different rules for assigning them. As I did visually and this is not a rule for a pipeline, I proceeded with the settings manually.

In [41]:
t_p_alone = data.Ticket[data.parent_alone==1].tolist()

    data[data.Ticket.isin(t_p_alone)].sort_values('Ticket')[96:]

    data.parent_alone.loc[data.PassengerId==141] = 1

    data.parent_alone.loc[data.PassengerId==541] = 0
    data.sons.loc[data.PassengerId==541] = 1

    data.parent_alone.loc[data.PassengerId==1078] = 0
    data.sons.loc[data.PassengerId==1078] = 1

    data.parent_alone.loc[data.PassengerId==98] = 0
    data.sons.loc[data.PassengerId==98] = 1

    data.parent_alone.loc[data.PassengerId==680] = 0
    data.sons.loc[data.PassengerId==680] = 1

    data.parent_alone.loc[data.PassengerId==915] = 0
    data.sons.loc[data.PassengerId==915] = 1

    data.parent_alone.loc[data.PassengerId==333] = 0
    data.sons.loc[data.PassengerId==333] = 1

    data.parent_alone.loc[data.PassengerId==119] = 0
    data.sons[data.PassengerId==119] = 1

    data.parent_alone.loc[data.PassengerId==319] = 0
    data.sons.loc[data.PassengerId==319] = 1

    data.parent_alone.loc[data.PassengerId==103] = 0
    data.sons.loc[data.PassengerId==103] = 1

    data.parents.loc[data.PassengerId==154] = 0
    data.sons.loc[data.PassengerId==1084] = 1

    data.parents.loc[data.PassengerId==581] = 0
    data.sons.loc[data.PassengerId==581] = 1

    data.parent_alone.loc[data.PassengerId==881] = 0
    data.sons.loc[data.PassengerId==881] = 1

    data.parent_alone.loc[data.PassengerId==1294] = 0
    data.sons.loc[data.PassengerId==1294] = 1

    data.parent_alone.loc[data.PassengerId==378] = 0
    data.sons.loc[data.PassengerId==378] = 1

    data.parent_alone.loc[data.PassengerId==167] = 1
    data.parent_alone.loc[data.PassengerId==357] = 0
    data.sons.loc[data.PassengerId==357] = 1

    data.parent_alone.loc[data.PassengerId==918] = 0
    data.sons.loc[data.PassengerId==918] = 1

    data.parent_alone.loc[data.PassengerId==1042] = 0
    data.sons.loc[data.PassengerId==1042] = 1

    data.parent_alone.loc[data.PassengerId==540] = 0
    data.sons.loc[data.PassengerId==540] = 1

    data.parents += data.parent_alone 
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

Next, we rescue the grandparents and grandparents alone. We found the same situations with less cases and decided put all parents and grandparents in one feature and leave to age distinguish them.

In [42]:
data['grandparents'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age>45) ) else 0, axis=1)
    charts('grandparents', data[(data.Survived>=0)])
    

               ____________________________ Plots of grandparents per Survived and Dead: ____________________________
    

In [43]:
data['grandparent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>45) ) else 0, axis=1)
    charts('grandparent_alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of grandparent_alone per Survived and Dead: ____________________________
    

In [44]:
data.parents += data.grandparent_alone + data.grandparents
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

Next, we identify the relatives aboard:

In [45]:
data['relatives'] = data.apply(lambda x: 1 if ((x.SibSp>0) & (x.Parch==0)) else 0, axis=1)
    charts('relatives', data[(data.Survived>=0)])
    

               ____________________________ Plots of relatives per Survived and Dead: ____________________________
    

And then, the companions, persons who traveled with a family but do not have family relationship with them.

In [46]:
data['companions'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==1)) else 0, axis=1)
    charts('companions', data[(data.Survived>=0)])
    

               ____________________________ Plots of companions per Survived and Dead: ____________________________
    

Finally, we rescue the passengers that traveled alone.

In [47]:
data['alone'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==0)) else 0, axis=1)
    charts('alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of alone per Survived and Dead: ____________________________
    

As we can see, people with a family relationship, even if only as companions, had better survival rates and very close, than people who traveled alone.

Now we can work on issues of nulls ages and then on own information of age.

Age

We start with the numbers of nulls case by survived to remember that is too high.

Then, we plot the distributions of Ages, to check how is fit into the normal and see the distortions when apply a unique value (zero) to the nulls cases.

Next, we made the scatter plot of Ages and siblings, and see hat age decreases with the increase in the number of siblings, but with a great range

In [48]:
fig = plt.figure(figsize=(20, 10))
    fig1 = fig.add_subplot(221)
    g = sns.distplot(data.Age.fillna(0), fit=norm, label='Nulls as Zero')
    g = sns.distplot(data[~data.Age.isnull()].Age, fit=norm, label='Withou Nulls')
    plt.legend(loc='upper right')
    print('Survived without Age:')
    display(data[data.Age.isnull()].Survived.value_counts())
    fig2 = fig.add_subplot(222)
    g = sns.scatterplot(data = data[(~data.Age.isnull())], y='Age', x='SibSp',  hue='Survived')
    

Survived without Age:
    

 0    125
    -1     86
     1     52
    Name: Survived, dtype: int64

From the tables below, we can see that our enforce to get Personal Titles and rescue family relationships produce better medians to apply on nulls ages.

In [49]:
print('Mean and median ages by siblings:')
    data.loc[data.Age.isnull(), 'Age'] = -1
    display(data.loc[(data.Age>=0), ['SibSp', 'Age']].groupby('SibSp').agg([np.mean, np.median]).T)

    print('\nMedian ages by Personal_Titles:')
    Ages = { 'Age' : {'median'}}
    display(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg(Ages).T)

    print('\nMedian ages by Personal Titles and Family Relationships:')
    display(pd.pivot_table(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 
                                              'sons', 'relatives', 'companions','alone']],
                           aggfunc=np.median, 
                           index=['parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone'] , 
                           columns=['Personal_Titles']))

    print('\nNulls ages by Personal Titles and Family Relationships:')
    display(data[data.Age<0][['Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg([sum]))
    

Mean and median ages by siblings:
    

SibSp 0 1 2 3 4 5 8
Age mean 30.921766 31.058071 23.569444 16.3125 8.772727 10.166667 14.5
median 28.000000 30.000000 21.500000 14.5000 7.000000 10.500000 14.5

    Median ages by Personal_Titles:
    

Personal_Titles Kid Miss Mr Mrs Royalty Technical
Age median 4.0 22.0 29.0 35.0 40.0 49.5

    Median ages by Personal Titles and Family Relationships:
    

Age
Personal_Titles Kid Miss Mr Mrs Royalty Technical
parents grandparents sons relatives companions alone
0 0 0 0 0 1 NaN 23.0 28.75 36.0 39.0 50.0
1 0 NaN 30.0 28.00 40.0 33.0 NaN
1 0 0 12.0 23.0 30.00 33.0 48.5 48.5
1 0 0 0 4.0 8.5 18.00 19.0 NaN NaN
1 0 0 0 0 0 NaN 24.0 38.00 35.5 NaN 29.0
1 0 0 0 0 NaN NaN 56.00 50.0 NaN 61.5

    Nulls ages by Personal Titles and Family Relationships:
    

parents grandparents sons relatives companions alone
sum sum sum sum sum sum
Personal_Titles
Kid 0 0 8 0 0 1
Miss 0 0 10 7 3 30
Mr 2 0 3 16 20 135
Mrs 7 0 0 11 3 7
Technical 0 0 0 0 0 1

So, we apply to the nulls ages the respectively median of same personal title and same family relationship, but before, we create a binary feature to maintain the information of the presence of nulls.

In [50]:
data['Without_Age'] = data.Age.apply(lambda x: 0 if x>0 else 1)

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Technical')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Technical')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()
    data.Age.loc[(data.Age.isnull()) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')].median()
    

Finally, we check how age distribution lines after fill the nulls.

In [51]:
print('Age correlation with survived:',data.corr()['Survived'].Age)
    g = sns.distplot(data.Age, fit=norm, label='With nulls filled')
    plt.legend(loc='upper right')
    plt.show()
    

Age correlation with survived: -0.04046792444732172
    

To have better understanding of age, its proportion and its relation to survival ration, we binning it as follow

In [52]:
def binningAge(df):
        # Binning Age based on custom ranges
        bin_ranges = [0, 1.7, 8, 15, 18, 25, 55, 65, 100] 
        bin_names = [0, 1, 2, 3, 4, 5, 6, 7]
        df['Age_bin_custom_range'] = pd.cut(np.array(df.Age), bins=bin_ranges)
        df['Age_bin_custom_label'] = pd.cut(np.array(df.Age), bins=bin_ranges, labels=bin_names)
        return df

    data = binningAge(data)
    display(data[['Age', 'Age_bin_custom_range', 'Age_bin_custom_label']].sample(5))
    display(pd.pivot_table(data[['Age_bin_custom_range', 'Survived']], aggfunc=np.count_nonzero, 
                           index=['Survived'] , columns=['Age_bin_custom_range']))
    

Age Age_bin_custom_range Age_bin_custom_label
485 8.50 (8.0, 15.0] 2
1077 21.00 (18.0, 25.0] 4
754 48.00 (25.0, 55.0] 5
1282 51.00 (25.0, 55.0] 5
1275 28.75 (25.0, 55.0] 5

Age_bin_custom_range (0.0, 1.7] (1.7, 8.0] (8.0, 15.0] (15.0, 18.0] (18.0, 25.0] (25.0, 55.0] (55.0, 65.0] (65.0, 100.0]
Survived
-1 16 28 30 58 200 466 34 4
0 2 18 24 46 114 317 21 7
1 24 52 28 44 146 366 22 2

In [53]:
charts('Age_bin_custom_label', data[(data.Survived>=0)])
    

               ____________________________ Plots of Age_bin_custom_label per Survived and Dead: ____________________________
    

One hot encode and drop provisory and useless features

One hot encode categorical and non ordinal data and drop useless features.

In [54]:
data['genre'] = data.Sex.apply(lambda x: 1 if x=='male' else 0)
    data.drop(['Name', 'Cabin', 'Ticket', 'Sex', 'same_tckt', 'qtd_same_ticket', 'parent_alone', 'grandparents', 
               'grandparent_alone', 'Age_bin_custom_range'], axis=1, inplace=True) # , 'Age', 'Parch', 'SibSp',
    data = pd.get_dummies(data, columns = ['Cabin_Letter', 'Personal_Titles', 'Embarked', 'distinction_in_tikect'])

    data = pd.get_dummies(data, columns = ['surname']) # 'Age_bin_custom_label'
    data.drop(['surname_Exclude'], axis=1, inplace=True)
    

Scipy‘s pearson method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.
import numpy as np
from scipy.stats import pearson
pearson(data.loc[:, ‘Pclass’], data.Survived)

Select Features

All of the features we find in the dataset might not be useful in building a machine learning model to make the necessary prediction. Using some of the features might even make the predictions worse.

Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits.

  1. It reduces the variance of the model, and therefore overfitting.
  2. It reduces the complexity of a model and makes it easier to interpret.
  3. It improves the accuracy of a model if the right subset is chosen.
  4. Finally, it reduces the computational cost (and time) of training a model.

So, an alternative way to reduce the complexity of the model and avoid overfitting is dimensionality reduction via feature selection, which is especially useful for unregularized models. There are two main categories of dimensionality reduction techniques: feature selection and feature extraction. Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace.

Exist various methodologies and techniques that you can use to subset your feature space and help your models perform better and efficiently. So, let’s get started.

First check for any correlations between features

Correlation is a statistical term which in common usage refers to how close two features are to having a linear relationship with each other. The Pearson’s correlation which measures linear correlation between two features, the resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one feature increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two features.

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

There are five assumptions that are made with respect to Pearson’s correlation:

  • The feature must be either interval or ratio measurements.
  • The variables must be approximately normally distributed.
  • There is a linear relationship between the two variables.
  • Outliers are either kept to a minimum or are removed entirely
  • There is homoscedasticity of the data. Homoscedasticity basically means that the variances along the line of best fit remain similar as you move along the line.

One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables. For example, correlation between x and x2 is zero, when x is centered on 0.

Furthermore, relying only on the correlation value on interpreting the relationship of two variables can be highly misleading, so it is always worth plotting the data as we did on the EDA phase.

The following guidelines interpreting Pearson’s correlation coefficient (r):

Strength of Association r Positive r Negative
Small .1 to .3 -0.1 to -0.3
Medium .3 to .5 -0.3 to -0.5
Large .5 to 1.0 -0.5 to -1.0

The correlation matrix is identical to a covariance matrix computed from standardized data. The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson’s r), which measure the linear dependence between pairs of features. Pearson’s correlation coefficient can simply be calculated as the covariance between two features x and y (numerator) divided by the product of their standard deviations (denominator):

The covariance between standardized features is in fact equal to their linear correlation coefficient.

Let’s check what are the highest correlations with survived, I will now create a correlation matrix to quantify the linear relationship between the features. To do this I use NumPy’s corrcoef and seaborn’s heatmap function to plot the correlation matrix array as a heat map.

In [55]:
corr = data.loc[:, 'Survived':].corr()
    top_corr_cols = corr[abs(corr.Survived)>=0.06].Survived.sort_values(ascending=False).keys()
    top_corr = corr.loc[top_corr_cols, top_corr_cols]
    dropSelf = np.zeros_like(top_corr)
    dropSelf[np.triu_indices_from(dropSelf)] = True
    plt.figure(figsize=(15, 15))
    sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)
    sns.set(font_scale=0.8)
    plt.show()
    

Let’s see we have more surnames between 0.05 and 0.06 of correlation wit Survived.

In [56]:
display(corr[(abs(corr.Survived)>=0.05) & (abs(corr.Survived)<0.06)].Survived.sort_values(ascending=False).keys())
    del corr, dropSelf, top_corr
    

Index(['surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
           'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
           'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
           'surname_Beane', 'surname_Hippach', 'surname_Bishop',
           'surname_Mellinger', 'surname_Yarred', 'Personal_Titles_Royalty'],
          dtype='object')

Drop the features with highest correlations to other Features:

Colinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. And as you see above, it is easy to find highest collinearities (Personal_Titles_Mrs, Personal_Titles_Mr and Fare.

You should always be concerned about the collinearity, regardless of the model/method being linear or not, or the main task being prediction or classification.

Assume a number of linearly correlated covariates/features present in the data set and Random Forest as the method. Obviously, random selection per node may pick only (or mostly) collinear features which may/will result in a poor split, and this can happen repeatedly, thus negatively affecting the performance.

Now, the collinear features may be less informative of the outcome than the other (non-collinear) features and as such they should be considered for elimination from the feature set anyway. However, assume that the features are ranked high in the ‘feature importance’ list produced by RF. As such they would be kept in the data set unnecessarily increasing the dimensionality. So, in practice, I’d always, as an exploratory step (out of many related) check the pairwise association of the features, including linear correlation.

Identify and treat multicollinearity:

Multicollinearity is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model, leading to unreliable and unstable estimates of regression coefficients. To make matters worst multicollinearity can emerge even when isolated pairs of variables are not collinear.

To identify, we need start with the coefficient of determination, r2, is the square of the Pearson correlation coefficient r. The coefficient of determination, with respect to correlation, is the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model (the correlation is the model). It is sometimes expressed as a percentage (e.g., 36% instead of 0.36) when we discuss the proportion of variance explained by the correlation. However, you should not write r2 = 36%, or any other percentage. You should write it as a proportion (e.g., r2 = 0.36).

Already the Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression. It is may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors, and then obtaining the R2 from that regression. It is calculated by taking the the ratio of the variance of all a given model’s betas divide by the variance of a single beta if it were fit alone [1/(1-R2)]. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors. The VIF has a lower bound of 1 but no upper bound. Authorities differ on how high the VIF has to be to constitute a problem (e.g.: 2.50 (R2 equal to 0.6), sometimes 5 (R2 equal to .8), or greater than 10 (R2 equal to 0.9) and so on).

But there are several situations in which multicollinearity can be safely ignored:

  • Interaction terms and higher-order terms (e.g., squared and cubed predictors) are correlated with main effect terms because they include the main effects terms. Ops! Sometimes we use polynomials to solve problems, indeed! But keep calm, in these cases, standardizing the predictors can removed the multicollinearity.
  • Indicator, like dummy or one-hot-encode, that represent a categorical variable with three or more categories. If the proportion of cases in the reference category is small, the indicator will necessarily have high VIF’s, even if the categorical is not associated with other variables in the regression model. But, you need check if some dummy is collinear or has multicollinearity with other features outside of their dummies.
  • *Control feature if the feature of interest do not have high VIF’s. Here’s the thing about multicollinearity: it’s only a problem for the features that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear feature are only used as control feature, and they are not collinear with your feature of interest, there’s no problem. The coefficients of the features of interest are not affected, and the performance of the control feature as controls is not impaired.

So, generally, we could run the same model twice, once with severe multicollinearity and once with moderate multicollinearity. This provides a great head-to-head comparison and it reveals the classic effects of multicollinearity. However, when standardizing your predictors doesn’t work, you can try other solutions such as:

  • removing highly correlated predictors
  • linearly combining predictors, such as adding them together
  • running entirely different analyses, such as partial least squares regression or principal components analysis

When considering a solution, keep in mind that all remedies have potential drawbacks. If you can live with less precise coefficient estimates, or a model that has a high R-squared but few significant predictors, doing nothing can be the correct decision because it won’t impact the fit.

Given the potential for correlation among the predictors, we’ll have display the variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression analysis. Hence such variables need to be removed from the model. Deleting one variable at a time and then again checking the VIF for the model is the best way to do this.

So, I start the analysis removed the 3 features with he highest collinearities and the surnames different from my control surname_Alone and correlation with survived less then 0.05, and run VIF.

In [57]:
#Step 1: Remove the higest correlations and run a multiple regression
    cols = [ 'family',
             'non_relatives',
             'surname_Alone',
             'surname_Baclini',
             'surname_Carter',
             'surname_Richards',
             'surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
             'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
             'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
             'surname_Beane', 'surname_Hippach', 'surname_Bishop',
             'surname_Mellinger', 'surname_Yarred', 
             'Pclass',
             'Age',
             'SibSp',
             'Parch',
             #'Fare',
             'qtd_same_ticket_bin',
             'passenger_fare',
             #'SibSp_bin',
             #'Parch_bin',
             'distinction_in_name',
             'Cabin_Number',
             'sons',
             'parents',
             'relatives',
             'companions',
             'alone',
             'Without_Age',
             'Age_bin_custom_label',
             'genre',
             'Cabin_Letter_A',
             'Cabin_Letter_B',
             'Cabin_Letter_C',
             'Cabin_Letter_D',
             'Cabin_Letter_E',
             'Cabin_Letter_F',
             'Cabin_Letter_G',
             'Personal_Titles_Kid',
             'Personal_Titles_Miss',
             #'Personal_Titles_Mr',
             #'Personal_Titles_Mrs',
             'Personal_Titles_Royalty',
             'Personal_Titles_Technical',
             'Embarked_C',
             'Embarked_Q',
             'Embarked_S',
             'distinction_in_tikect_High',
             'distinction_in_tikect_Low',
             'distinction_in_tikect_Others',
             'distinction_in_tikect_PC'
    ]

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
48 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
35 1.382636 Without_Age
29 1.594563 Cabin_Number
47 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
45 2.156407 Personal_Titles_Kid
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
27 3.750101 passenger_fare
22 4.436090 Pclass
23 4.742809 Age
46 5.008117 Personal_Titles_Miss
37 6.403387 genre
36 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
26 20.666651 qtd_same_ticket_bin
50 39.868704 Embarked_Q
49 74.967792 Embarked_C
51 97.344161 Embarked_S
54 inf distinction_in_tikect_Others
53 inf distinction_in_tikect_Low
52 inf distinction_in_tikect_High
1 inf family
44 inf Cabin_Letter_G
43 inf Cabin_Letter_F
41 inf Cabin_Letter_D
40 inf Cabin_Letter_C
39 inf Cabin_Letter_B
38 inf Cabin_Letter_A
34 inf alone
33 inf companions
32 inf relatives
31 inf parents
30 inf sons
24 inf SibSp
25 inf Parch
42 inf Cabin_Letter_E
55 inf distinction_in_tikect_PC

From the results, I conclude that can safe maintain the dummies of Embarked, but need work in the remaining features where’s the VIF stated as inf. You can see that surname Alone has a VIF of 2.2. We’re going to treat it as our baseline and exclude it from our fit. This is done to prevent multicollinearity, or the dummy variable trap caused by including a dummy variable for every single category. let’s try remove the dummy alone, that is pretty similar, and check if it solves the other dummies from its category:

In [58]:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
    cols.remove('alone')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
47 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
46 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
44 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
27 3.750101 passenger_fare
22 4.436090 Pclass
31 4.540579 parents
23 4.742809 Age
45 5.008117 Personal_Titles_Miss
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
26 20.666651 qtd_same_ticket_bin
49 39.868704 Embarked_Q
48 74.967792 Embarked_C
50 97.344161 Embarked_S
52 inf distinction_in_tikect_Low
51 inf distinction_in_tikect_High
1 inf family
41 inf Cabin_Letter_E
42 inf Cabin_Letter_F
53 inf distinction_in_tikect_Others
40 inf Cabin_Letter_D
39 inf Cabin_Letter_C
38 inf Cabin_Letter_B
37 inf Cabin_Letter_A
24 inf SibSp
25 inf Parch
43 inf Cabin_Letter_G
54 inf distinction_in_tikect_PC

To solve Cabin Letter, we can try remove only the lowest frequency ‘A’, and see if we can accept the VIF’s of others Cabins:

In [59]:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
    cols.remove('Cabin_Letter_A')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
46 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
45 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
43 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
27 3.750101 passenger_fare
22 4.436090 Pclass
39 4.520122 Cabin_Letter_D
31 4.540579 parents
23 4.742809 Age
37 4.871771 Cabin_Letter_B
44 5.008117 Personal_Titles_Miss
38 5.836347 Cabin_Letter_C
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
42 9.446441 Cabin_Letter_G
41 16.107676 Cabin_Letter_F
40 17.938593 Cabin_Letter_E
26 20.666651 qtd_same_ticket_bin
48 39.868704 Embarked_Q
47 74.967792 Embarked_C
49 97.344161 Embarked_S
25 inf Parch
24 inf SibSp
1 inf family
50 inf distinction_in_tikect_High
51 inf distinction_in_tikect_Low
52 inf distinction_in_tikect_Others
53 inf distinction_in_tikect_PC

Now our focus is on distinct in name, since “High” has less observations, let’s try dropped it and drop the bins of Parch and SibSp.

In [60]:
cols.remove('distinction_in_tikect_High')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
46 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
45 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
43 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
50 3.271151 distinction_in_tikect_Low
27 3.750101 passenger_fare
52 4.229762 distinction_in_tikect_PC
22 4.436090 Pclass
39 4.520122 Cabin_Letter_D
31 4.540579 parents
23 4.742809 Age
37 4.871771 Cabin_Letter_B
44 5.008117 Personal_Titles_Miss
51 5.273404 distinction_in_tikect_Others
38 5.836347 Cabin_Letter_C
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
42 9.446441 Cabin_Letter_G
41 16.107676 Cabin_Letter_F
40 17.938593 Cabin_Letter_E
26 20.666651 qtd_same_ticket_bin
48 39.868704 Embarked_Q
47 74.967792 Embarked_C
49 97.344161 Embarked_S
24 inf SibSp
25 inf Parch
1 inf family

As we can see, we now have to remove one between family, parch and SibSp. Note that non_relatives qtd_same_ticket_bin are already with relatively acceptable VIF’s. The first is directly calculated from the family and the second is very close as we have seen. So let’s discard the family.

In [61]:
cols.remove('family')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
16 1.027080 surname_Beane
19 1.039424 surname_Mellinger
14 1.048763 surname_Doling
11 1.056317 surname_Hamalainen
13 1.061651 surname_Taylor
10 1.063774 surname_Chambers
8 1.071730 surname_Goldenberg
9 1.072810 surname_Moor
7 1.081285 surname_Beckwith
12 1.082148 surname_Dick
17 1.083090 surname_Hippach
18 1.091573 surname_Bishop
5 1.091714 surname_Richards
6 1.097315 surname_Harper
20 1.101053 surname_Yarred
4 1.107265 surname_Carter
45 1.121313 Personal_Titles_Technical
3 1.133671 surname_Baclini
33 1.382636 Without_Age
28 1.594563 Cabin_Number
44 1.768952 Personal_Titles_Royalty
15 1.865907 surname_Gordon
42 2.156407 Personal_Titles_Kid
32 2.185638 companions
2 2.201757 surname_Alone
27 2.954703 distinction_in_name
31 3.021953 relatives
49 3.271151 distinction_in_tikect_Low
26 3.750101 passenger_fare
51 4.229762 distinction_in_tikect_PC
21 4.436090 Pclass
38 4.520122 Cabin_Letter_D
30 4.540579 parents
22 4.742809 Age
36 4.871771 Cabin_Letter_B
23 4.913654 SibSp
43 5.008117 Personal_Titles_Miss
50 5.273404 distinction_in_tikect_Others
37 5.836347 Cabin_Letter_C
29 6.128749 sons
35 6.403387 genre
24 6.585030 Parch
34 7.591511 Age_bin_custom_label
1 7.644149 non_relatives
41 9.446441 Cabin_Letter_G
40 16.107676 Cabin_Letter_F
39 17.938593 Cabin_Letter_E
25 20.666651 qtd_same_ticket_bin
47 39.868704 Embarked_Q
46 74.967792 Embarked_C
48 97.344161 Embarked_S

Yea, we can accept, and we can proceed to the next step.

Feature Selection by Filter Methods

Filter methods use statistical methods for evaluation of a subset of features, they are generally used as a preprocessing step. These methods are also known as univariate feature selection, they examines each feature individually to determine the strength of the relationship of the feature with the dependent variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data, but not necessarily for optimizing the feature set for better generalization.

So, the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation co-efficients.

Feature/Response Continuous Categorical
Continuous Pearson’s Correlation LDA
Categorical Anova Chi-Square

One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.

There are lot of different options for univariate selection. Some examples are:

  • Model Based Ranking
  • Mutual information and maximal information coefficient (MIC).

I did not approach the latter, because there has been some critique about MIC’s statistical power, i.e. the ability to reject the null hypothesis when the null hypothesis is false. This may or may not be a concern, depending on the particular dataset and its noisiness. If you have interest on this, in python, MIC is available in the minepy library.

Feature Selection by Model based ranking

We can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model.

In fact, this is already put to use with Pearson’s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression. But this method it is not good to select features with non-linear relation to dependent variable. For this, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don’t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.

In [62]:
rf = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
    scores = []
    for i in range(df.shape[1]):
         score = cross_val_score(rf, df.iloc[:, i:i+1], y_train, scoring="accuracy", cv=10)
         scores.append((round(np.mean(score), 3), cols[i]))
    MBR = pd.DataFrame(sorted(scores, reverse=True), columns=['Score', 'Feature'])
    g = MBR.iloc[:15, :].plot(x='Feature', kind='barh', figsize=(20,10), fontsize=12, grid=True)
    plt.show()
    MBR = MBR.iloc[:15, 1]
    

Feature Selection by SelectKBest:

On scikit-learn we find variety of implementation oriented to classifications tasks to select features according to the k highest scores, see below some of that:

  • f_classif compute the ANOVA F-value for the provided sample.
  • chi2 compute chi-squared stats between each non-negative feature and class.
  • mutual_info_classif estimate mutual information for a discrete target variable.

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

Other important point is if you use sparse data, for example if we continue consider hot-encode of surnames, chi2 and mutual_info_classif will deal with the data without making it dense.

Let’s see the SelectKBest of f_classif and chi2 for our data:

In [63]:
cols = pd.Index(cols)

    skb = SelectKBest(score_func=f_classif, k=10)
    skb.fit(df, y_train)

    select_features_kbest = skb.get_support()
    feature_f_clas = cols[select_features_kbest]
    feature_f_clas_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
    print('Total features slected by f_classif Statistical Methods',len(feature_f_clas))
    fig = plt.figure(figsize=(20,7))
    f1 = fig.add_subplot(121)
    g = pd.DataFrame(sorted(feature_f_clas_scores, key=lambda x: -x[1])[:len(feature_f_clas)], columns=['Feature','F-Calss Score']).\
    plot(x='Feature', kind='barh', title= 'F Class Score', fontsize=18, ax=f1, grid=True)

    scale = MinMaxScaler()
    df2 = scale.fit_transform(data.loc[data.Survived>=0, cols])
    skb = SelectKBest(score_func=chi2, k=10)
    skb.fit(df2, y_train)
    select_features_kbest = skb.get_support()
    feature_chi2 = cols[select_features_kbest]
    feature_chi2_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
    print('Total features slected by chi2 Statistical Methods',len(feature_chi2))
    f2 = fig.add_subplot(122)
    g = pd.DataFrame(sorted(feature_chi2_scores, key=lambda x: -x[1])[:len(feature_chi2)], columns=['Feature','Chi2 Score']).\
    plot(x='Feature', kind='barh',  title= 'Chi2 Score', fontsize=18, ax=f2, grid=True)

    SMcols = set(feature_f_clas).union(set(feature_chi2))
    print("Extra features select by f_class:\n", set(feature_f_clas).difference(set(feature_chi2)), '\n')
    print("Extra features select by chi2:\n", set(feature_chi2).difference(set(feature_f_clas)), '\n')
    print("Intersection features select by f_class and chi2:\n",set(feature_f_clas).intersection(set(feature_chi2)), '\n')
    print('Total number of features selected:', len(SMcols))
    print(SMcols)

    plt.tight_layout(); plt.show()
    

Total features slected by f_classif Statistical Methods 10
    Total features slected by chi2 Statistical Methods 10
    Extra features select by f_class:
     {'passenger_fare', 'Embarked_S'} 

    Extra features select by chi2:
     {'distinction_in_tikect_PC', 'Cabin_Letter_C'} 

    Intersection features select by f_class and chi2:
     {'Embarked_C', 'Cabin_Letter_E', 'Cabin_Letter_B', 'genre', 'distinction_in_tikect_Low', 'Pclass', 'distinction_in_name', 'Personal_Titles_Miss'} 

    Total number of features selected: 12
    {'passenger_fare', 'Embarked_C', 'Cabin_Letter_E', 'Cabin_Letter_B', 'genre', 'distinction_in_tikect_Low', 'Pclass', 'distinction_in_name', 'Embarked_S', 'distinction_in_tikect_PC', 'Cabin_Letter_C', 'Personal_Titles_Miss'}
    

Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem.

The two main disadvantages of these methods are :

  • The increasing overfitting risk when the number of observations is insufficient.
  • These methods are usually computationally very expensive.

Backward Elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

We will see below row implementation of backward elimination, one to select by P-values and other based on the accuracy of a model the we submitted to it.

Backward Elimination By P-values

The P-value, or probability value, or asymptotic significance, is a probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations more commonly known as the statistical summary is greater than or equal in magnitude to the observed results.

The null hypothesis is a general statement that there is no relationship between two measured phenomena.

For example, if the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

But you need to be careful how you interpret the statistical significance of a correlation. If your correlation coefficient has been determined to be statistically significant this does not mean that you have a strong association. It simply tests the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that states that there is a relationship, but with no information about the strength of the relationship or its importance.

Since removal of different features from the dataset will have different effects on the p-value for the dataset, we can remove different features and measure the p-value in each case. These measured p-values can be used to decide whether to keep a feature or not.

Next we make the test of a logit regression to check the result and select features based on its the P-value:

In [64]:
logit_model=sm.Logit(y_train,df)
    result=logit_model.fit(method='bfgs', maxiter=2000)
    print(result.summary())
    

Optimization terminated successfully.
             Current function value: 0.363678
             Iterations: 285
             Function evaluations: 289
             Gradient evaluations: 289
                               Logit Regression Results                           
    ==============================================================================
    Dep. Variable:               Survived   No. Observations:                  891
    Model:                          Logit   Df Residuals:                      840
    Method:                           MLE   Df Model:                           50
    Date:                Sat, 22 Sep 2018   Pseudo R-squ.:                  0.4539
    Time:                        18:38:46   Log-Likelihood:                -324.04
    converged:                       True   LL-Null:                       -593.33
                                            LLR p-value:                 4.183e-83
    ================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------------------------
    non_relatives                    0.6521      0.248      2.626      0.009       0.165       1.139
    surname_Alone                    0.2531      0.317      0.797      0.425      -0.369       0.875
    surname_Baclini                 35.0286    5.2e+05   6.74e-05      1.000   -1.02e+06    1.02e+06
    surname_Carter                   0.3247      1.478      0.220      0.826      -2.572       3.221
    surname_Richards                24.5135    1.8e+05      0.000      1.000   -3.53e+05    3.53e+05
    surname_Harper                   1.6077      1.493      1.077      0.282      -1.319       4.534
    surname_Beckwith                25.2653   1.48e+05      0.000      1.000    -2.9e+05     2.9e+05
    surname_Goldenberg              18.1282   6212.320      0.003      0.998   -1.22e+04    1.22e+04
    surname_Moor                    12.1274    708.078      0.017      0.986   -1375.680    1399.934
    surname_Chambers                20.4606    1.8e+04      0.001      0.999   -3.53e+04    3.53e+04
    surname_Hamalainen              12.1778    769.487      0.016      0.987   -1495.989    1520.345
    surname_Dick                    23.5492   5.64e+04      0.000      1.000    -1.1e+05    1.11e+05
    surname_Taylor                  26.5937   2.03e+05      0.000      1.000   -3.97e+05    3.97e+05
    surname_Doling                  11.1800    568.824      0.020      0.984   -1103.695    1126.054
    surname_Gordon                  21.6263   1.37e+04      0.002      0.999   -2.67e+04    2.68e+04
    surname_Beane                   29.3099   9.09e+05   3.23e-05      1.000   -1.78e+06    1.78e+06
    surname_Hippach                  9.4228    329.299      0.029      0.977    -635.992     654.838
    surname_Bishop                  13.9600    991.219      0.014      0.989   -1928.793    1956.713
    surname_Mellinger               12.1126    747.967      0.016      0.987   -1453.876    1478.101
    surname_Yarred                  17.7443   6003.748      0.003      0.998   -1.17e+04    1.18e+04
    Pclass                          -1.0271      0.258     -3.982      0.000      -1.533      -0.522
    Age                             -0.0297      0.016     -1.848      0.065      -0.061       0.002
    SibSp                           -0.1467      0.216     -0.679      0.497      -0.570       0.277
    Parch                            0.1022      0.302      0.338      0.736      -0.491       0.695
    qtd_same_ticket_bin             -0.6804      0.361     -1.885      0.059      -1.388       0.027
    passenger_fare                   0.0444      0.019      2.320      0.020       0.007       0.082
    distinction_in_name              1.2501      0.419      2.987      0.003       0.430       2.070
    Cabin_Number                     0.0699      0.037      1.876      0.061      -0.003       0.143
    sons                             0.6661      0.697      0.956      0.339      -0.700       2.032
    parents                          0.3338      0.712      0.469      0.639      -1.061       1.729
    relatives                        0.2784      0.485      0.574      0.566      -0.672       1.229
    companions                      -0.2517      0.508     -0.496      0.620      -1.247       0.744
    Without_Age                     -0.2149      0.292     -0.737      0.461      -0.787       0.357
    Age_bin_custom_label             0.0610      0.209      0.291      0.771      -0.349       0.471
    genre                           -3.0048      0.526     -5.715      0.000      -4.035      -1.974
    Cabin_Letter_B                  -0.2343      0.786     -0.298      0.766      -1.774       1.306
    Cabin_Letter_C                  -0.3873      0.727     -0.533      0.594      -1.812       1.037
    Cabin_Letter_D                   0.6481      0.738      0.878      0.380      -0.799       2.095
    Cabin_Letter_E                   0.6878      0.752      0.914      0.361      -0.787       2.162
    Cabin_Letter_F                   1.1108      0.756      1.468      0.142      -0.372       2.593
    Cabin_Letter_G                   0.8240      0.896      0.919      0.358      -0.933       2.581
    Personal_Titles_Kid              3.7469      0.723      5.179      0.000       2.329       5.165
    Personal_Titles_Miss             0.0229      0.551      0.042      0.967      -1.056       1.102
    Personal_Titles_Royalty         -1.6139      2.035     -0.793      0.428      -5.602       2.374
    Personal_Titles_Technical        0.0623      0.663      0.094      0.925      -1.236       1.361
    Embarked_C                      -3.1700     15.027     -0.211      0.833     -32.623      26.283
    Embarked_Q                      -3.3883     15.032     -0.225      0.822     -32.851      26.074
    Embarked_S                      -3.6679     15.026     -0.244      0.807     -33.119      25.783
    distinction_in_tikect_Low       -1.9666      0.786     -2.501      0.012      -3.508      -0.426
    distinction_in_tikect_Others    -1.1537      0.546     -2.114      0.035      -2.223      -0.084
    distinction_in_tikect_PC        -1.8562      0.726     -2.558      0.011      -3.278      -0.434
    ================================================================================================
    

As expect, P-values of dummies is high. Like before, I excluded one by one of the features with the highest P-value and run again until get only P-values below to 0.1, but here I use a backward elimination process.

In [65]:
pv_cols = cols.values

    def backwardElimination(x, Y, sl, columns):
        numVars = x.shape[1]
        for i in range(0, numVars):
            regressor = sm.Logit(Y, x).fit(method='bfgs', maxiter=2000, disp=False)
            maxVar = max(regressor.pvalues) #.astype(float)
            if maxVar > sl:
                for j in range(0, numVars - i):
                    if (regressor.pvalues[j].astype(float) == maxVar):
                        columns = np.delete(columns, j)
                        x = x.loc[:, columns]
                        
        print(regressor.summary())
        print('\nSelect {:d} features from {:d} by best p-values.'.format(len(columns), len(pv_cols)))
        print('The max p-value from the features selecte is {:.3f}.'.format(maxVar))
        
        # odds ratios and 95% CI
        conf = np.exp(regressor.conf_int())
        conf['Odds Ratios'] = np.exp(regressor.params)
        conf.columns = ['2.5%', '97.5%', 'Odds Ratios']
        display(conf)
        
        return columns, regressor

    SL = 0.1
    df2 = scale.fit_transform(data.loc[data.Survived>=0, pv_cols])
    df2 = pd.DataFrame(df2, columns = pv_cols)

    pv_cols, Logit = backwardElimination(df2, y_train, SL, pv_cols)
    

                           Logit Regression Results                           
    ==============================================================================
    Dep. Variable:               Survived   No. Observations:                  891
    Model:                          Logit   Df Residuals:                      876
    Method:                           MLE   Df Model:                           14
    Date:                Sat, 22 Sep 2018   Pseudo R-squ.:                  0.4146
    Time:                        18:38:51   Log-Likelihood:                -347.36
    converged:                       True   LL-Null:                       -593.33
                                            LLR p-value:                 4.732e-96
    ================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------------------------
    non_relatives                    9.8711      1.312      7.525      0.000       7.300      12.442
    Pclass                          -2.0428      0.354     -5.777      0.000      -2.736      -1.350
    Age                             -2.2167      0.782     -2.834      0.005      -3.750      -0.683
    qtd_same_ticket_bin             -3.7108      0.575     -6.451      0.000      -4.838      -2.583
    passenger_fare                   3.0945      1.683      1.839      0.066      -0.204       6.393
    distinction_in_name              1.1604      0.307      3.783      0.000       0.559       1.762
    sons                             1.0338      0.443      2.332      0.020       0.165       1.903
    parents                          1.1074      0.457      2.425      0.015       0.212       2.002
    relatives                        0.5993      0.305      1.963      0.050       0.001       1.198
    genre                           -2.9115      0.252    -11.566      0.000      -3.405      -2.418
    Personal_Titles_Kid              3.4269      0.576      5.948      0.000       2.298       4.556
    Embarked_S                      -0.3645      0.221     -1.649      0.099      -0.798       0.069
    distinction_in_tikect_Low       -2.1724      0.739     -2.941      0.003      -3.620      -0.724
    distinction_in_tikect_Others    -1.3182      0.475     -2.777      0.005      -2.249      -0.388
    distinction_in_tikect_PC        -2.0141      0.654     -3.078      0.002      -3.297      -0.732
    ================================================================================================

    Select 15 features from 51 by best p-values.
    The max p-value from the features selecte is 0.099.
    

2.5% 97.5% Odds Ratios
non_relatives 1480.171472 253274.069118 19362.051846
Pclass 0.064837 0.259310 0.129665
Age 0.023516 0.504884 0.108964
qtd_same_ticket_bin 0.007920 0.075521 0.024457
passenger_fare 0.815520 597.572483 22.075600
distinction_in_name 1.749180 5.821927 3.191175
sons 1.179098 6.704660 2.811664
parents 1.236503 7.407207 3.026389
relatives 1.000811 3.312990 1.820900
genre 0.033212 0.089092 0.054396
Personal_Titles_Kid 9.951006 95.205077 30.779641
Embarked_S 0.450403 1.071116 0.694574
distinction_in_tikect_Low 0.026771 0.484601 0.113901
distinction_in_tikect_Others 0.105529 0.678613 0.267607
distinction_in_tikect_PC 0.037006 0.481179 0.133440

From the results, we can highlight:

  • we’re very confident about some relationship between the probability of being survived:

    • there is an inverse relationship with class, Age and genre of the passenger.
    • there is an positive relationship, from greater to low, with non relatives, Kids, passenger fare and distinction in name of the passenger.
  • From the coefficient:

    • Non relatives, we confirm that fellows or companions of families had better chances to survived.
    • As we saw in the EDA, passengers who embarked on S had a higher fatality rate.
    • The model is only able to translate a single straight line by variable, obviously pondered by the others. Then he points out that the greater the number of passengers with the same ticket, the greater the likelihood of death. As seen through the ticket EDA, this is partly true, however, this coefficient does not adequately capture the fact that the chances of survival increase from 1 to 3 passengers with the same ticket. As you can see above, the model balances this through the variables of family relationships. However, this is a good example of why you should not ignore the EDA phase, and above all do not rely on a conclusion based on a single fact or point of view. Remember, models are not full truths, and possess precision, confident and accuracy!

Take the exponential of each of the coefficients to generate the odds ratios. This tells you how a 1 unit increase or decrease in a variable affects the odds of being survived. For example, we can expect the odds of being survived to decrease by about 69.5% if the passenger embarked on S. Go back to the embarked EDA and see that this hits the stacked bar chart.

You must be wondering, after all, this model has how much of accuracy?

Although we did not do cross validation or even split, let’s take a look, redeem the probabilities generated by it and take advantage to see how we can plot the results of models that return probabilities, so we can refine our perception much more than simply evaluate p-value, coefficients, accuracy, etc.

In [66]:
pred = Logit.predict(df2[pv_cols])
    train = data[data.Survived>=0]
    train['proba'] = pred
    train['Survived'] = y_train
    y_pred = pred.apply(lambda x: 1 if x > 0.5 else 0)
    print('Accurancy: {0:2.2%}'.format(accuracy_score(y_true=y_train, y_pred=y_pred)))

    def plot_proba(continous, predict, discret, data):
        grouped = pd.pivot_table(data, values=[predict], index=[continous, discret], aggfunc=np.mean)
        colors = 'rbgyrbgy'
        for col in data[discret].unique():
            plt_data = grouped.ix[grouped.index.get_level_values(1)==col]
            plt.plot(plt_data.index.get_level_values(0), plt_data[predict], color=colors[int(col)])
        plt.xlabel(continous)
        plt.ylabel("Probabilities")
        plt.legend(np.sort(data[discret].unique()), loc='upper left', title=discret)
        plt.title("Probabilities with " + continous + " and " + discret)

    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(231)
    plot_proba('non_relatives', 'Survived', 'Pclass', train)
    ax = fig.add_subplot(232)
    plot_proba('non_relatives', 'Survived', 'genre', train)
    ax = fig.add_subplot(233)
    plot_proba('non_relatives', 'Survived', 'qtd_same_ticket_bin', train)
    ax = fig.add_subplot(234)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'distinction_in_name', train)
    ax = fig.add_subplot(235)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
    ax = fig.add_subplot(235)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
    ax = fig.add_subplot(236)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'parents', train)
    plt.show()
    

Accurancy: 83.73%
    

Backward Elimination By Accuracy – A Sequential Backward Selection

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The motivation behind feature selection algorithms is to automatically select a subset of features that are most relevant to the problem to improve computational efficiency or reduce the generalization error of the model by removing irrelevant features or noise, which can be useful for algorithms that don’t
support regularization
.

Greedy algorithms make locally optimal choices at each stage of a combinatorial search problem and generally yield a suboptimal solution to the problem in contrast to exhaustive search algorithms, which evaluate all possible combinations and are guaranteed to find the optimal solution. However, in practice, an exhaustive search is often computationally not feasible, whereas greedy algorithms allow for a less complex, computationally more efficient solution.

SBS aims to reduce the dimensionality of the initial feature subspace with a minimum decay in performance of the classifier to improve upon computational efficiency. In certain cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.

SBS sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, we need to define criterion function J that we want to minimize. The criterion calculated by the criterion function can simply be the difference in performance of the classifier after and before the removal of a particular feature. Then the feature to be removed at each stage can simply be defined as the feature that maximizes this criterion.

From interactive executions, I already know that I need remove the surnames with less than 0.06 of correlation.

So, let’s see a example of SBS in our data,

In [67]:
class SBS():
        def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=101):
            self.scoring = scoring
            self.estimator = clone(estimator)
            self.k_features = k_features
            self.test_size = test_size
            self.random_state = random_state

        def fit(self, X, y):
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)
            dim = X_train.shape[1]
            self.indices_ = list(range(dim))
            self.subsets_ = [self.indices_]
            score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)
            self.scores_ = [score]
            
            while dim > self.k_features:
                scores = []
                subsets = []
                for p in combinations(self.indices_, r=dim-1):
                    score = self._calc_score(X_train, y_train, X_test, y_test, list(p))
                    scores.append(score)
                    subsets.append(list(p))
                    
                best = np.argmax(scores)
                self.indices_ = subsets[best]
                self.subsets_.append(self.indices_)
                dim -= 1
                self.scores_.append(scores[best])
                
            self.k_score_ = self.scores_[-1]
            return self

        def transform(self, X):
            return X.iloc[:, self.indices_]
        
        def _calc_score(self, X_train, y_train, X_test, y_test, indices):
            self.estimator.fit(X_train.iloc[:, indices], y_train)
            y_pred = self.estimator.predict(X_test.iloc[:, indices])
            score = self.scoring(y_test, y_pred)
            return score
        
    knn = KNeighborsClassifier(n_neighbors=3)
    sbs = SBS(knn, k_features=1)
    df2 = df.drop(['surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
                    'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
                    'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
                    'surname_Beane', 'surname_Hippach', 'surname_Bishop',
                    'surname_Mellinger', 'surname_Yarred'], axis = 1)

    sbs.fit(df2, y_train)

    print('Best Score:',max(sbs.scores_))

    k_feat = [len(k) for k in sbs.subsets_]
    fig = plt.figure(figsize=(10,5))
    plt.plot(k_feat, sbs.scores_, marker='o')
    #plt.ylim([0.7, max(sbs.scores_)+0.01])
    plt.xlim([1, len(sbs.subsets_)])
    plt.xticks(np.arange(1, len(sbs.subsets_)+1))
    plt.ylabel('Accuracy')
    plt.xlabel('Number of features')
    plt.grid(b=1)
    plt.show()

    print('First best accuracy with:\n',list(df.columns[sbs.subsets_[np.argmax(sbs.scores_)]]))

    SBS = list(df.columns[list(sbs.subsets_[max(np.arange(0, len(sbs.scores_))[(sbs.scores_==max(sbs.scores_))])])])

    print('\nBest accuracy with {0:2d} features:\n{1:}'.format(len(SBS), SBS))
    

Best Score: 0.8340807174887892
    

First best accuracy with:
     ['non_relatives', 'surname_Goldenberg', 'surname_Chambers', 'surname_Hamalainen', 'surname_Dick', 'surname_Beane', 'surname_Mellinger', 'surname_Yarred', 'Age', 'SibSp', 'qtd_same_ticket_bin', 'Cabin_Number', 'relatives', 'Age_bin_custom_label']

    Best accuracy with 14 features:
    ['non_relatives', 'surname_Goldenberg', 'surname_Chambers', 'surname_Hamalainen', 'surname_Dick', 'surname_Beane', 'surname_Mellinger', 'surname_Yarred', 'Age', 'SibSp', 'qtd_same_ticket_bin', 'Cabin_Number', 'relatives', 'Age_bin_custom_label']
    

Select Features by Recursive Feature Elimination

The goal of Recursive Feature Elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

RFE is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted.

Other option is sequential Feature Selector (SFS) from mlxtend, a separate Python library that is designed to work well with scikit-learn, also provides a S that works a bit differently.

RFE is computationally less complex using the feature’s weight coefficients (e.g., linear models) or feature importances (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.

In [68]:
from sklearn.feature_selection import RFE

    lr = LogisticRegression()
    rfe = RFE(estimator=lr,  step=1)
    rfe.fit(df, y_train)

    FRFE = cols[rfe.ranking_==1]
    print('\nFeatures selected:\n',FRFE)
    print('\n Total Features selected:',len(FRFE))
    

    Features selected:
     Index(['non_relatives', 'surname_Baclini', 'surname_Richards',
           'surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
           'surname_Moor', 'surname_Chambers', 'surname_Dick', 'surname_Taylor',
           'surname_Beane', 'surname_Bishop', 'surname_Yarred', 'Pclass',
           'qtd_same_ticket_bin', 'distinction_in_name', 'sons', 'parents',
           'relatives', 'genre', 'Personal_Titles_Kid', 'Embarked_S',
           'distinction_in_tikect_Low', 'distinction_in_tikect_Others',
           'distinction_in_tikect_PC'],
          dtype='object')

     Total Features selected: 25
    

Select Features by Embedded Methods

In addition to the return of the performance itself, some models has in their internal process some step to features select that best fit their proposal, and returns the features importance too. Thus, they provide two straightforward methods for feature selection and combine the qualities’ of filter and wrapper methods.

Some of the most popular examples of these methods are LASSO, RIDGE, SVM, Regularized trees, Memetic algorithm, and Random multinomial logit.

In the case of Random Forest, some other models base on trees, we have two basic approaches implemented in the packages:

  1. Gini/Entropy Importance or Mean Decrease in Impurity (MDI)
  2. Permutation Importance or Mean Decrease in Accuracy
  3. Permutation with Shadow Features
  4. Gradient Boosting

Others models has concerns om multicollinearity problem and adding additional constraints or penalty to regularize. When there are multiple correlated features, as is the case with very many real life datasets, the model becomes unstable, meaning that small changes in the data can cause large changes in the model, making model interpretation very difficult on the regularization terms.

This applies to regression models like LASSO and RIDGE. In classifier cases, you can use SGDClassifier where you can set the loss parameter to ‘log’ for Logistic Regression or ‘hinge’ for SVM. In SGDClassifier you can set the penalty to either of ‘l1’, ‘l2’ or ‘elasticnet’ which is a combination of both.

Let’s start with more details and examples:

Feature Selection by Mean Decrease Impurity

There are a two things to keep in mind when using the impurity based ranking:

  • Feature selection based on impurity reduction is biased towards preferring variables with more categories.
  • It can lead to the incorrect conclusion that one of the features is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the independent variable.

The second one refers to when the dataset has two or more correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data. The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely.

The Random Forests is one of them, the reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees, wheres for classification, it is typically either Gini impurity or information gain/entropy and for *regression it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

Them, we run a quick Random Forest to select the features most importants:

In [69]:
rfc = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
    rfc.fit(df, y_train)

    feature_importances = [(feature, score) for feature, score in zip(cols, rfc.feature_importances_)]

    MDI = cols[rfc.feature_importances_>0.010]
    print('Total features slected by Random Forest:',len(MDI))

    g = pd.DataFrame(sorted(feature_importances, key=lambda x: -x[1])[:len(MDI)], columns=['Feature','Importance']).\
    plot(x='Feature', kind='barh', figsize=(20,7), fontsize=18, grid=True)
    plt.show()
    

Total features slected by Random Forest: 15
    

Feature Selection by Mean Decrease Accuracy

The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant feature, the permutation should have little to no effect on model accuracy, while permuting important feature should significantly decrease it.

This method is not directly exposed in sklearn, but it is straightforward to implement it. Start record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

The rfpimp package in the src dir provided it For Random Forest, let’s see:

In [70]:
X_train, X_test, y, y_test = train_test_split(df, y_train , test_size=0.20,  random_state=101)

    # Add column of random numbers
    X_train['random'] = np.random.random(size=len(X_train))
    X_test['random'] = np.random.random(size=len(X_test))

    rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
    rf.fit(X_train, y)

    imp = importances(rf, X_test, y_test, n_samples=-1) # permutation
    MDA = imp[imp!=0].dropna().index
    if 'random' in MDA:
       MDA =  MDA.drop('random')
    print('%d features are selected.' % len(MDA))
    plot_importances(imp[imp!=0].dropna(), figsize=(20,7))
    

16 features are selected.
    

Feature Selection by Permutation with Shadow Features

Boruta randomly permutes variables like Permutation Importance does, but performs on all variables at the same time and concatenates the shuffled features with the original ones. The concatenated result is used to fit the model.

Daniel Homola, who also wrote the Python version of Boruta, BorutaPy, gave an wonderful overview of the Boruta algorithm in his blog post
“The shuffled features (a.k.a. shadow features) are basically noises with identical marginal distribution w.r.t the original feature. We count the times a variable performs better than the ‘best’ noise and calculate the confidence towards it being better than noise (the p-value) or not. Features which are confidently better are marked ‘confirmed’, and those which are confidently on par with noises are marked ‘rejected’. Then we remove those marked features and repeat the process until all features are marked or a certain number of iteration is reached.”

Although Boruta is a feature selection algorithm, we can use the order of confirmation/rejection as a way to rank the importance of features.

In [71]:
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
    X = df.values
    y = y_train.values.ravel()

    # define random forest classifier, with utilising all cores and
    # sampling in proportion to y labels
    #rf = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
    rf = ExtraTreesClassifier(n_estimators=100, max_depth=4, n_jobs=-1, oob_score=True, bootstrap=True, random_state=101)

    # define Boruta feature selection method
    feat_selector = BorutaPy(rf, n_estimators='auto', verbose=0, random_state=101)

    # find all relevant features - 5 features should be selected
    feat_selector.fit(X, y)

    shadow = cols[feat_selector.support_]
    # check selected features - first 5 features are selected
    print('Features selected:',shadow)

    # call transform() on X to filter it down to selected features
    print('Data transformaded has %d features' % feat_selector.n_features_) #feat_selector.transform(X).shape[1])
    print('Check the selector ranking:')
    display(pd.concat([pd.DataFrame(cols, columns=['Columns']), 
               pd.DataFrame(feat_selector.ranking_, columns=['Rank'])], axis=1).sort_values(by=['Rank']))
    

Features selected: Index(['surname_Alone', 'Pclass', 'Age', 'SibSp', 'Parch',
           'qtd_same_ticket_bin', 'passenger_fare', 'distinction_in_name', 'sons',
           'relatives', 'Age_bin_custom_label', 'genre', 'Cabin_Letter_B',
           'Cabin_Letter_C', 'Cabin_Letter_E', 'Cabin_Letter_G',
           'Personal_Titles_Kid', 'Personal_Titles_Miss', 'Embarked_C',
           'Embarked_S', 'distinction_in_tikect_Low', 'distinction_in_tikect_PC'],
          dtype='object')
    Data transformaded has 22 features
    Check the selector ranking:
    

Columns Rank
25 passenger_fare 1
22 SibSp 1
23 Parch 1
24 qtd_same_ticket_bin 1
26 distinction_in_name 1
28 sons 1
30 relatives 1
33 Age_bin_custom_label 1
34 genre 1
35 Cabin_Letter_B 1
36 Cabin_Letter_C 1
38 Cabin_Letter_E 1
40 Cabin_Letter_G 1
41 Personal_Titles_Kid 1
42 Personal_Titles_Miss 1
45 Embarked_C 1
47 Embarked_S 1
48 distinction_in_tikect_Low 1
21 Age 1
20 Pclass 1
50 distinction_in_tikect_PC 1
1 surname_Alone 1
0 non_relatives 2
32 Without_Age 2
37 Cabin_Letter_D 2
27 Cabin_Number 3
29 parents 3
31 companions 3
39 Cabin_Letter_F 5
49 distinction_in_tikect_Others 6
46 Embarked_Q 7
2 surname_Baclini 8
4 surname_Richards 9
3 surname_Carter 10
9 surname_Chambers 11
44 Personal_Titles_Technical 12
13 surname_Doling 13
19 surname_Yarred 14
10 surname_Hamalainen 15
6 surname_Beckwith 16
11 surname_Dick 17
15 surname_Beane 18
12 surname_Taylor 19
8 surname_Moor 20
18 surname_Mellinger 20
7 surname_Goldenberg 22
16 surname_Hippach 22
17 surname_Bishop 24
14 surname_Gordon 25
43 Personal_Titles_Royalty 26
5 surname_Harper 27

Feature Selection by Gradient Boosting

The LightGBM model the importance is calculated from, if ‘split’, result contains numbers of times the feature is used in a model, if ‘gain’, result contains total gains of splits which use the feature.

On the XGBoost model the importance is calculated by:

  • ‘weight’: the number of times a feature is used to split the data across all trees.
  • ‘gain’: the average gain across all splits the feature is used in.
  • ‘cover’: the average coverage across all splits the feature is used in.
  • ‘total_gain’: the total gain across all splits the feature is used in.
  • ‘total_cover’: the total coverage across all splits the feature is used in.

First measure is split-based and is very similar with the one given by for Gini Importance. But it doesn’t take the number of samples into account.

The second measure is gain-based. It’s basically the same as the Gini Importance implemented in R packages and in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model.

The cover, implemented exclusively in XGBoost, is counting the number of samples affected by the splits based on a feature.

get_score(fmap=”, importance_type=’weight’)
Get feature importance of each feature. Importance type can be defined as:

The default measure of both XGBoost and LightGBM is the split-based one. I think this measure will be problematic if there are one or two feature with strong signals and a few features with weak signals. The model will exploit the strong features in the first few trees and use the rest of the features to improve on the residuals. The strong features will look not as important as they actually are. While setting lower learning rate and early stopping should alleviate the problem, also checking gain-based measure may be a good idea.

Note that these measures are purely calculated using training data, so there’s a chance that a split creates no improvement on the objective in the holdout set. This problem is more severe than in the random forest since gradient boosting models are more prone to over-fitting.

Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a previous trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

In [72]:
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

    # split data into train and test sets
    X_train, X_test, y, y_test = train_test_split(df, y_train, test_size=0.30, random_state=101)

    # fit model on all training data
    model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
    model.fit(X_train, y)
    fig=plt.figure(figsize=(20,5))
    ax = fig.add_subplot(121)
    g = plot_importance(model, height=0.5, ax=ax)

    # Using each unique importance as a threshold
    thresholds = np.sort(np.unique(model.feature_importances_)) #np.sort(model.feature_importances_[model.feature_importances_>0])
    best = 0
    colsbest = 31
    my_model = model
    threshold = 0

    for thresh in thresholds:
        # select features using threshold
        selection = SelectFromModel(model, threshold=thresh, prefit=True)
        select_X_train = selection.transform(X_train)
        # train model
        selection_model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
        selection_model.fit(select_X_train, y)
        # eval model
        select_X_test = selection.transform(X_test)
        y_pred = selection_model.predict(select_X_test)
        predictions = [round(value) for value in y_pred]
        accuracy = accuracy_score(y_test, predictions)
        print("Thresh={:1.3f}, n={:d}, Accuracy: {:2.2%}".format(thresh, select_X_train.shape[1], accuracy))
        if (best <= accuracy):
            best = accuracy
            colsbest = select_X_train.shape[1]
            my_model = selection_model
            threshold = thresh
            
    ax = fig.add_subplot(122)
    g = plot_importance(my_model,height=0.5, ax=ax, 
                        title='The best accuracy: {:2.2%} with {:d} features'.\
                        format(best, colsbest))

    feature_importances = [(score, feature) for score, feature in zip(model.feature_importances_, cols)]
    XGBest = pd.DataFrame(sorted(sorted(feature_importances, reverse=True)[:colsbest]), columns=['Score', 'Feature'])
    g = XGBest.plot(x='Feature', kind='barh', figsize=(20,7), fontsize=14, grid= True,
         title='Original feature importance from selected features')
    plt.tight_layout(); plt.show()
    XGBestCols