Kaggle challenges us to learn data analysis and machine learning from the data the Titanic shipwreck, and try predict survival and get familiar with ML basics.

So, this material is intended to cover most of the techniques of data analysis and ML in Python, than to properly compete in Kaggle. That is why it following the natural flow of ML and contains many texts and links regarding the techniques, made your conference and references easy. as it can be extended over time.

In this way the material can be used for consultation and apply the methods to other similar classification cases, but for its application in the competition, or even to a real case, it will be necessary to make some choices and changes.

Competition Description:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, Kaggle ask you to complete the analysis of what sorts of people were likely to survive. In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Table of Contents

Preparing environment and uploading data

You can download the this python notebook and data from my github repository. The data can download on Kaggle here.

Import Packages

In [1]:
import os
    import warnings
    warnings.simplefilter(action = 'ignore', category=FutureWarning)
    warnings.filterwarnings('ignore')
    def ignore_warn(*args, **kwargs):
        pass

    warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

    import numpy as np
    import pandas as pd
    import pylab 
    import seaborn as sns
    sns.set(style="ticks", color_codes=True, font_scale=1.5)
    from matplotlib import pyplot as plt
    from matplotlib.ticker import FormatStrFormatter
    from matplotlib.colors import ListedColormap
    %matplotlib inline
    import mpl_toolkits
    from mpl_toolkits.mplot3d import Axes3D
    import model_evaluation_utils as meu

    from scipy.stats import skew, norm, probplot, boxcox
    from patsy import dmatrices
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectFromModel
    from boruta import BorutaPy
    from rfpimp import *

    from sklearn.decomposition import PCA, KernelPCA
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
    from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
    from sklearn.pipeline import Pipeline, make_pipeline
    from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, cross_val_predict, train_test_split
    from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score

    from sklearn.linear_model import LogisticRegression, SGDClassifier
    from sklearn.svm import SVC, LinearSVC
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import RBF
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
    from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.neighbors import KNeighborsClassifier
    import xgboost as xgb
    from xgboost import XGBClassifier
    from xgboost import plot_importance

    #from sklearn.base import BaseEstimator, TransformerMixin, clone, ClassifierMixin
    from sklearn.ensemble import VotingClassifier
    from itertools import combinations
    

Load Datasets

I start with load the datasets with pandas, and concatenate them.

In [2]:
train = pd.read_csv('train.csv') 

    test = pd.read_csv('test.csv') 
    Test_ID = test.PassengerId
    test.insert(loc=1, column='Survived', value=-1)

    data = pd.concat([train, test], ignore_index=True)
    

Exploratory Data Analysis (EDA) & Feature Engineering

Take a First Look of our Data:

I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.

If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the independent variable to the pred parameter.

Also, if your return is stored in a variable you can evaluate it in more detail, specific of a field, or sort them from different perspectives

In [3]:
def rstr(df, pred=None): 
        obs = df.shape[0]
        types = df.dtypes
        counts = df.apply(lambda x: x.count())
        uniques = df.apply(lambda x: [x.unique()])
        nulls = df.apply(lambda x: x.isnull().sum())
        distincts = df.apply(lambda x: x.unique().shape[0])
        missing_ration = (df.isnull().sum()/ obs) * 100
        skewness = df.skew()
        kurtosis = df.kurt() 
        print('Data shape:', df.shape)
        
        if pred is None:
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing ration', 'uniques', 'skewness', 'kurtosis']
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis], axis = 1)

        else:
            corr = df.corr()[pred]
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
            corr_col = 'corr '  + pred
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
        
        str.columns = cols
        dtypes = str.types.value_counts()
        print('___________________________\nData types:\n',str.types.value_counts())
        print('___________________________')
        return str
    

In [4]:
details = rstr(data.loc[: ,'Survived' : 'Embarked'], 'Survived')
    details.sort_values(by='corr Survived', ascending=False)
    

Data shape: (1309, 11)
    ___________________________
    Data types:
     object     5
    int64      4
    float64    2
    Name: types, dtype: int64
    ___________________________
    

Out[4]:
types counts distincts nulls missing_ration uniques skewness kurtosis corr Survived
Survived int64 1309 3 0 0.000000 [[0, 1, -1]] 0.097425 -1.263022 1.000000
Fare float64 1308 282 1 0.076394 [[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51… 4.367709 27.027986 0.081545
Parch int64 1309 8 0 0.000000 [[0, 1, 2, 5, 3, 4, 6, 9]] 3.669078 21.541079 0.028196
SibSp int64 1309 7 0 0.000000 [[1, 0, 3, 4, 2, 5, 8]] 3.844220 20.043251 0.012470
Age float64 1046 99 263 20.091673 [[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0… 0.407675 0.146948 -0.049620
Pclass int64 1309 3 0 0.000000 [[3, 1, 2]] -0.598647 -1.315079 -0.126769
Name object 1309 1307 0 0.000000 [[Braund, Mr. Owen Harris, Cumings, Mrs. John … NaN NaN NaN
Sex object 1309 2 0 0.000000 [[male, female]] NaN NaN NaN
Ticket object 1309 929 0 0.000000 [[A/5 21171, PC 17599, STON/O2. 3101282, 11380… NaN NaN NaN
Cabin object 295 187 1014 77.463713 [[nan, C85, C123, E46, G6, C103, D56, A6, C23 … NaN NaN NaN
Embarked object 1307 4 2 0.152788 [[S, C, Q, nan]] NaN NaN NaN

Data Dictionary

  • Survived: 0 = No, 1 = Yes. I use -1 to can separate test data from training data.
  • Fare: The passenger fare
  • Parch: # of parents / children aboard the Titanic
  • SibSp: # of siblings / spouses aboard the Titanic
  • Age: Age in years
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name: Name of the passenger
  • Sex: Sex of the passenger male and female
  • Ticket: Ticket number
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The points of attention we have here are:

  • Fare: Kaggle affirm that is the passenger fare, but with some data inspection in group of Tickets we discover that is the total amount fare paid for a ticket, and the existence of tickets for a group of passengers.
  • Parch: The dataset defines family relations in this way…
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children traveled only with a nanny, therefore parch=0 for them.
  • SibSP: The dataset defines family relations in this way…
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
  • Age: Have 20% of nulls, so we have to find a more efficient way of filling them out with just a single value, like the median for example.
  • Name: is a categorical data with a high distinct values as expect. The first reaction is drop this column, but we can use it to training different data engineering techniques, to see if we can get some valuable data. Besides that, notice that has two pairs of passengers with the same name?
  • Ticket Other categorical data, but in this case it have only 71% of distinct values and don’t have nulls. So, it is possible that some passenger voyage in groups and use the same ticket. Beyond that, we can check to if we can extract other interesting thought form it.
  • Cabin: the high number of distinct values (187) and nulls (77.5%).
    This is a categorical data of which I use to train different techniques to extract some information of value and null input, but with given the high rate of null the recommended would be to simplify the filling or even to exclude this attribute.

First see of some stats of Numeric Data

So, for the main statistics of our numeric data describe the function (like the summary of R)

In [5]:
print('Data is not balanced! Has {:2.2%} survives'.format(train.Survived.describe()[1]))
    display(data.loc[: ,'Pclass' : 'Embarked'].describe().transpose())
    print('Survived: [1] Survived; [0] Died; [-1] Test Data set:\n',data.Survived.value_counts())
    

Data is not balanced! Has 38.38% survives
    

count mean std min 25% 50% 75% max
Pclass 1309.0 2.294882 0.837836 1.00 2.0000 3.0000 3.000 3.0000
Age 1046.0 29.881138 14.413493 0.17 21.0000 28.0000 39.000 80.0000
SibSp 1309.0 0.498854 1.041658 0.00 0.0000 0.0000 1.000 8.0000
Parch 1309.0 0.385027 0.865560 0.00 0.0000 0.0000 0.000 9.0000
Fare 1308.0 33.295479 51.758668 0.00 7.8958 14.4542 31.275 512.3292

Survived: [1] Survived; [0] Died; [-1] Test Data set:
      0    549
    -1    418
     1    342
    Name: Survived, dtype: int64
    

EDA and Feature Engineering by Feature

Standard Data Visualization for Discrete or Binning Data

In [6]:
def charts(feature, df):
        print('\n           ____________________________ Plots of', feature, 'per Survived and Dead: ____________________________')
        # Pie of all Data
        fig = plt.figure(figsize=(20,5))
        f1 = fig.add_subplot(131)
        cnt = df[feature].value_counts()
        g = plt.pie(cnt, labels=cnt.index, autopct='%1.1f%%', shadow=True, startangle=90)
        
        # Count Plot By Survived and Dead
        f = fig.add_subplot(132)
        g = sns.countplot(x=feature, hue='Survived', hue_order=[1,0], data=df, ax=f)

        # Percent stacked Plot
        survived = df[df['Survived']==1][feature].value_counts()
        dead = df[df['Survived']==0][feature].value_counts()
        df2 = pd.DataFrame([survived,dead])
        df2.index = ['Survived','Dead']
        df2 = df2.T
        df2 = df2.fillna(0)
        df2['Total'] = df2.Survived + df2.Dead
        df2.Survived = df2.Survived/df2.Total
        df2.Dead = df2.Dead/df2.Total
        df2.drop(['Total'], axis=1, inplace=True)
        f = fig.add_subplot(133)
        df2.plot(kind='bar', stacked=True, ax=f)
        del df2, g, f, cnt, dead, fig
    

Ticket

Since Ticket is a transaction and categorical data, the first insight is drop this feature, but we may note that it has some hidden value information. At first look, safe few cases, we could affirms that:

  • families and group of persons that traveled together bought the same ticket.
  • People with alphanumerics Tickets has some special treatment (crew family, employees, VIP, free tickets, etc.)

So, we start by a new feature creation to quantify the number of passengers by ticket, and join this
quantity to each passenger with the same ticket.

In [7]:
same_ticket = data.Ticket.value_counts()
    data['qtd_same_ticket'] = data.Ticket.apply(lambda x: same_ticket[x])
    del same_ticket
    charts('qtd_same_ticket', data[data.Survived>=0])
    

               ____________________________ Plots of qtd_same_ticket per Survived and Dead: ____________________________
    

As we can see above:

  • the majority (54%) bought only one ticket per passenger, and have lower survival rate than passengers that bought tickets for 2, 3, 4, 5 and 8 people.
  • the survival rate is growing between 1 and 4, dropped a lot at 5. From the bar chart we can see that after 5 the number of samples is too low (84 out of 891, 9.4%, 1/4 of this is 5), and this data is skewed with a long tail to right. We can reduce this tail by binning all data after 4 in the same ordinal, its better to prevent overfitting, but we lose some others interesting case, see the next bullet. As alternative we can apply a box cox at this measure.
  • The case of 11 people with same ticket probably is a huge family that all samples on the training data died. Let’s check this below.

In [8]:
data[(data.qtd_same_ticket==11)]
    

Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked qtd_same_ticket
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S 11
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S 11
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S 11
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S 11
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S 11
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S 11
863 864 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.55 NaN S 11
1079 1080 -1 3 Sage, Miss. Ada female NaN 8 2 CA. 2343 69.55 NaN S 11
1233 1234 -1 3 Sage, Mr. John George male NaN 1 9 CA. 2343 69.55 NaN S 11
1251 1252 -1 3 Sage, Master. William Henry male 14.5 8 2 CA. 2343 69.55 NaN S 11
1256 1257 -1 3 Sage, Mrs. John (Annie Bullen) female NaN 1 9 CA. 2343 69.55 NaN S 11

We confirm our hypothesis, and we notice that Fare is not the price of each passenger, but the price of each ticket, its total amount. Since our data is per passenger, this information has some distortion, because only one passenger that bought a ticket alone of 69.55 pounds is different from 11 passenger that bought a special ticket, with discount for group, by 6.32 pounds per passenger. It suggest to create a new feature that represents the real fare by passenger.

Back to the quantity of persons with same ticket, if we keep this and the model can capture this pattern, you’ll probably predict that the respective test samples also died! However, even if true, can be a sign of overfitting, because we only have 1.2% of these cases in the training samples.

In order to increase representativeness and lose the minimum of information, since we have only 44 (4.9%) training samples that bought tickets for 4 people and 101 (11.3%) of 3, we binning the quantity of 3 and 4 together as 3 (16,3%, over than 5 as 5 (84 samples). Let’s see the results below.

In [9]:
data['qtd_same_ticket_bin'] = data.qtd_same_ticket.apply(lambda x: 3 if (x>2 and x<5) else (5 if x>4 else x))
    charts('qtd_same_ticket_bin', data[data.Survived>=0])
    

               ____________________________ Plots of qtd_same_ticket_bin per Survived and Dead: ____________________________
    

Other option, is create a binary feature that indicates if the passenger use a same ticket or not (not share his ticket)

In [10]:
print('Percent. survived from unique ticket: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].count()))
    print('Percent. survived from same tickets: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].count()))

    data['same_tckt'] = data.qtd_same_ticket.apply(lambda x: 1 if (x> 1) else 0)
    charts('same_tckt', data[data.Survived>=0])
    

Percent. survived from unique ticket: 27.03%
    Percent. survived from same tickets: 51.71%

               ____________________________ Plots of same_tckt per Survived and Dead: ____________________________
    

In this case we lose information that the chances of survival increase from 1 to 4, and fall from 5. In addition, cases 1 and 0 of the two measures are exactly the same. Then we will not use this option, and go work on Fare.

Finally, we have one more information to extract directly from Ticket, and check the possible special treatment!

In [11]:
data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x))).value_counts().head(7)
    

Out[11]:
           957
    PC          92
    CA          68
    A           39
    SOTONOQ     24
    STONO       21
    WC          15
    Name: Ticket, dtype: int64

In [12]:
data['distinction_in_tikect'] =\
       (data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x)).strip('[]')))

    data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Without' if y=='' else y if (y in ['PC', 'CA', 'A', 'SOTONOQ', 'STONO', 'WC', 'SCPARIS']) else 'Others')

    charts('distinction_in_tikect', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________
    

By the results, passengers with PC distinction in their tickets had best survival rate. Without, Others and CA is very close and can be grouped in one category and the we can do the same between STONO and SCAPARIS, and between A, SOTONOQ and WC.

In [13]:
data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Others' if (y in ['Without', 'Others', 'CA']) else\
            'Low' if (y in ['A', 'SOTONOQ', 'WC']) else\
            'High' if (y in ['STONO', 'SCPARIS']) else y)

    charts('distinction_in_tikect', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________
    

Fare

First, we treat the unique null fare case, then we take a look of the distribution of Fare (remember that is the total amount Fare of the Ticket).

In [14]:
# Fill null with median of most likely type passenger
    data.loc[data.Fare.isnull(), 'Fare'] = data.Fare[(data.Pclass==3) & (data.qtd_same_ticket==1) & (data.Age>60)].median()

    fig = plt.figure(figsize=(20,5))
    f = fig.add_subplot(121)
    g = sns.distplot(data[(data.Survived>=0)].Fare)
    f = fig.add_subplot(122)
    g = sns.boxplot(y='Fare', x='Survived', data=data[data.Survived>=0], notch = True)
    

Let’s take a look at how the fare per passenger is and how much it differs from the total

In [15]:
data['passenger_fare'] = data.Fare / data.qtd_same_ticket

    fig = plt.figure(figsize=(20,6))
    a = fig.add_subplot(141)
    g = sns.distplot(data[(data.Survived>=0)].passenger_fare)
    a = fig.add_subplot(142)
    g = sns.boxplot(y='passenger_fare', x='Survived', data=data[data.Survived>=0], notch = True)
    a = fig.add_subplot(143)
    g = pd.qcut(data.Fare[(data.Survived==0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Died')
    a = fig.add_subplot(144)
    g = pd.qcut(data.Fare[(data.Survived>0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Survived')
    plt.tight_layout(); plt.show()
    


From the comparison, we can see that:

  • the distributions are not exactly the same, with two spouts slightly apart on passenger fare.
  • Class and how much paid per passenger make differences!

    Although the number of survivors among the quartiles is approximately the same as expected, when we look at passenger fares, it is more apparent that the mortality rate is higher in the lower Fares, since the top of Q4 died is at the same height as the median plus a confidence interval of the fare paid by survivors.

  • the number of outliers is lower in the fare per passenger, especially among survivors.

    We can not rule out these outliers if there are cases of the same type in the test data set. In addition, these differences in values may be due to probably first class with additional fees for certain exclusives and cargo.

Below, you can see that the largest outlier all survival in the train data set, and has one case (1235 Passenger Id,
the matriarch of one son and two companions) to predict. Among all outlier cases of survivors, we see that all cases are first class, and different from the largest outlier, 27% actually died, and we have 18 cases to predict.

In [16]:
print('Passengers with higets passenger fare:')
    display(data[data.passenger_fare>120])
    print('\nSurivived of passenger fare more than 50:\n',
        pd.pivot_table(data.loc[data.passenger_fare>50, ['Pclass', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Pclass']))
    

Passengers with higets passenger fare:
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked qtd_same_ticket qtd_same_ticket_bin same_tckt distinction_in_tikect passenger_fare
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C 4 3 1 PC 128.0823
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C 4 3 1 PC 128.0823
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C 4 3 1 PC 128.0823
1234 1235 -1 1 Cardeza, Mrs. James Warburton Martinez (Charlo… female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C 4 3 1 PC 128.0823

    Surivived of passenger fare more than 50:
     Survived  -1   0   1
    Pclass              
    1         18   4  22
    

Note that if we leave this way, if the model succeeds in capturing this pattern of largest outlier we are again thinking of a model that is at risk of overfitting (0.03% of cases).

Pclass

Notwithstanding the fact that class 3 presents greater magnitude, as we see with Fare by passenger, we notice that survival rate is greater with greater fare by passenger. Its make to think that has some socioeconomic discrimination. It is confirmed when we saw the data distribution over the classes, and see the percent bar has a clearer aggressive decreasing survival rate through the first to the third classes.

In [17]:
charts('Pclass', data[(data.Survived>=0)])
    

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________
    

SibSp

In [18]:
charts('SibSp', data[(data.Survived>=0)])
    

               ____________________________ Plots of SibSp per Survived and Dead: ____________________________
    

Since more than 2 siblings has too few cases and lowest survival rate, we can aggregate all this case into unique bin in order to increase representativeness and lose the minimum of information.

In [19]:
data['SibSp_bin'] = data.SibSp.apply(lambda x: 6 if x > 2 else x)
    charts('SibSp_bin', data[(data.Survived>=0)])
    

               ____________________________ Plots of SibSp_bin per Survived and Dead: ____________________________
    

Parch

In [20]:
charts('Parch', data[data.Survived>=0])
    

               ____________________________ Plots of Parch per Survived and Dead: ____________________________
    

As we did with siblings, we will aggregate the Parch cases with more than 3, even with the highest survival rate with 3 Parch.

In [21]:
data['Parch_bin'] = data.Parch.apply(lambda x: x if x< 3 else 4)
    charts('Parch_bin', data[(data.Survived>=0)])
    

               ____________________________ Plots of Parch_bin per Survived and Dead: ____________________________
    

Family and non-relatives

If you investigate the data, you will notice that total family members It can be obtained by the sum of Parch and SibSp plus 1 (1 for the person of respective record). So, let’s create the Family and see what we get.

In [22]:
data['family'] = data.SibSp + data.Parch + 1
    charts('family', data[data.Survived>=0])
    

               ____________________________ Plots of family per Survived and Dead: ____________________________
    

As we can see, family groups of up to 4 people were more likely to survive than people without relatives on board.
However from 5 family members we see a drastic fall and the leveling of the 7-member cases with the unfamiliar ones.
You may be led to think that this distortion clearly has some relation to the social condition. Better see the right data!

In [23]:
charts('Pclass', data[(data.family>4) & (data.Survived>=0)])
    

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________
    


Yes, we have more cases in the third class, but on the other hand, what we see is that the numbers of cases with more than 4 relatives were rarer. n a more careful look, you will see that from 6 family members we only have third class (25 in training, 10 in test). So we confirmed that a large number of family members made a difference, yes, if you were from upper classes

You must have a feeling of déjà vu, and yes, this metric is very similar to the one we have already created, the amount of passengers with the same ticket.

So what’s the difference. At first you have only the amount of people aboard with family kinship plus herself, in the previous you have people reportedly grouped, family members or not. So, in cases where relatives bought tickets separately we see the family considering them, but the ticket separating them. On the other hand, as a family we do not consider travelers with their non-family companions, employees or friends, while in the other yes.

With this, we can now obtain the number of fellows or companions per passenger. This is the number of non-relatives who traveled with the passenger

In [24]:
data['non_relatives'] = data.qtd_same_ticket - data.family
    charts('non_relatives', data[data.Survived>=0])
    

               ____________________________ Plots of non_relatives per Survived and Dead: ____________________________
    

Here you see negative numbers because there are groups of travelers with the number of unrelated members larger than those with kinship.

Sex

As everybody knows, in that case women has more significant survival rate than men.

In [25]:
charts('Sex', data[(data.Survived>=0)])
    

               ____________________________ Plots of Sex per Survived and Dead: ____________________________
    

Embarked

First, we check the 2 embarked null cases to find the most likely pattern to considerate to fill with the respective mode.

In sequence, we take a look at the Embarked data. As we can see, the passengers that embarked from Cherbourg had best survival rates and most of the passengers embarked from Southampton and had the worst survival rate.

In [26]:
display(data[data.Embarked.isnull()])
    data.loc[data.Embarked=='NA', 'Embarked'] = data[(data.Cabin.str.match('B2')>0) & (data.Pclass==1)].Embarked.mode()[0]
    charts('Embarked', data[(data.Survived>=0)])
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked qtd_same_ticket qtd_same_ticket_bin same_tckt distinction_in_tikect passenger_fare SibSp_bin Parch_bin family non_relatives
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 NaN 2 2 1 Others 40.0 0 0 1 1
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 NaN 2 2 1 Others 40.0 0 0 1 1

2 rows × 21 columns

               ____________________________ Plots of Embarked per Survived and Dead: ____________________________
    

Name


Name feature has too much variance and is not significant, but has some value information to extracts and checks, like:

  • Personal Titles
  • Existence of nicknames
  • Existence of references to another person
  • Family names

In [27]:
def Personal_Titles(df):
        df['Personal_Titles'] = df.Name.str.findall('Mrs\.|Mr\.|Miss\.|Maste[r]|Dr\.|Lady\.|Countess\.|'
                                                    +'Sir\.|Rev\.|Don\.|Major\.|Col\.|Jonkheer\.|'
                                                    + 'Capt\.|Ms\.|Mme\.|Mlle\.').apply(lambda x: ''.join(map(str, x)).strip('[]'))

        df.Personal_Titles[df.Personal_Titles=='Mrs.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Mr.'] = 'Mr'
        df.Personal_Titles[df.Personal_Titles=='Miss.'] = 'Miss'
        df.Personal_Titles[df.Personal_Titles==''] = df[df.Personal_Titles==''].Sex.apply(lambda x: 'Mr' if (x=='male') else 'Mrs')
        df.Personal_Titles[df.Personal_Titles=='Mme.'] = 'Mrs' 
        df.Personal_Titles[df.Personal_Titles=='Ms.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Lady.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Mlle.'] = 'Miss'
        df.Personal_Titles[(df.Personal_Titles=='Miss.') & (df.Age>-1) & (df.Age<15)] = 'Kid' 
        df.Personal_Titles[df.Personal_Titles=='Master'] = 'Kid'
        df.Personal_Titles[df.Personal_Titles=='Don.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Jonkheer.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Capt.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Rev.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Sir.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Countess.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Major.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Col.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Dr.'] = 'Technical'

    Personal_Titles(data)
    display(pd.pivot_table(data[['Personal_Titles', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Personal_Titles']).T)

    charts('Personal_Titles', data[(data.Survived>=0)])
    

Personal_Titles Kid Miss Mr Mrs Royalty Technical
Survived
-1 42.0 156.0 480.0 148.0 NaN 10.0
0 17.0 55.0 436.0 26.0 2.0 13.0
1 46.0 258.0 162.0 202.0 6.0 10.0

               ____________________________ Plots of Personal_Titles per Survived and Dead: ____________________________
    

As you can see above, I opted to add some titles, but at first keep 2 small sets (Technical and Royalty), Because there are interesting survival rate variations.

Next, we identify the names with mentions to other people or with nicknames and create a boolean feature.

In [28]:
data['distinction_in_name'] =\
       ((data.Name.str.findall('\(').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='(')
        | (data.Name.str.findall(r'"[A-z"]*"').apply(lambda x: ''.join(map(str, x)).strip('""'))!=''))

    data.distinction_in_name = data.distinction_in_name.apply(lambda x: 1 if x else 0)

    charts('distinction_in_name', data[(data.Survived>=0)])
    

               ____________________________ Plots of distinction_in_name per Survived and Dead: ____________________________
    

It is interesting to note that those who have some type of reference or distinction in their names had a higher survival rate.

Next, we find 872 surnames in this dataset. Even adding loners in a single category, we have 229 with more than one member. It’s a huge categorical data to work, and it is to much sparse. The most of then has too few samples to really has significances to almost of algorithms, without risk to occurs overfitting. In addition, there are 18 surnames cases with more than one member exclusively in the test data set.

So, we create this feature with aggregation of unique member into one category and use this at models that could work on it to check if we get better results. Alternatively, we can use dimensionality reduction methods.

In [29]:
print('Total of differents surnames aboard:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).shape[0])
    print('More then one persons aboard with smae surnames:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).sum())

    surnames = (data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()

    data['surname'] = (data.Name.str.findall(r'[A-z]*\,').\
     apply(lambda x: ''.join(map(str, x)).strip(','))).apply(lambda x: x if surnames.get_value(x)>1 else 'Alone')

    test_surnames = set(data.surname[data.Survived>=0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the test data set:', 
          240-len(test_surnames))

    train_surnames = set(data.surname[data.Survived<0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the train data set:', 
          240-len(train_surnames))

    both_surnames = test_surnames.intersection(train_surnames)

    data.surname = data.surname.apply(lambda x : x if test_surnames.issuperset(set([x])) else 'Exclude')

    del surnames, both_surnames, test_surnames, train_surnames
    

Total of differents surnames aboard: 872
    More then one persons aboard with smae surnames: 239
    Surnames with more than one member aboard that happens only in the test data set: 18
    Surnames with more than one member aboard that happens only in the train data set: 76
    

Cabin

This information has to many nulls, but when it exist we can know what is the deck of the passenger, and some distinguish passengers from the same class.

Let’s start applying the same cabin to null cases where there are samples with cabins for the same ticket.

In [30]:
CabinByTicket = data.loc[~data.Cabin.isnull(), ['Ticket', 'Cabin']].groupby(by='Ticket').agg(min)
    before = data.Cabin.isnull().sum()
    data.loc[data.Cabin.isnull(), 'Cabin'] = data.loc[data.Cabin.isnull(), 'Ticket'].\
       apply(lambda x: CabinByTicket[CabinByTicket.index==x].min())
    print('Cabin nulls reduced:', (before - data.Cabin.isnull().sum()))
    del CabinByTicket, before
    

Cabin nulls reduced: 16
    

In [31]:
data.Cabin[data.Cabin.isnull()] = 'N999'
    data['Cabin_Letter'] = data.Cabin.str.findall('[^a-z]\d\d*')
    data['Cabin_Number'] = data.apply(lambda x: 0 if len(str(x.Cabin))== 1 else np.int(np.int(x.Cabin_Letter[0][1:])/10), axis=1)
    data.Cabin_Letter = data.apply(lambda x: x.Cabin if len(str(x.Cabin))== 1 else x.Cabin_Letter[0][0], axis=1)

    display(data[['Fare', 'Cabin_Letter']].groupby(['Cabin_Letter']).agg([np.median, np.mean, np.count_nonzero, np.max, np.min]))
    

Fare
median mean count_nonzero amax amin
Cabin_Letter
A 35.07710 41.244314 21.0 81.8583 0.0000
B 84.38335 131.484438 65.0 512.3292 0.0000
C 90.00000 112.161300 104.0 263.0000 25.7000
D 52.55420 53.007339 46.0 113.2750 12.8750
E 51.86250 49.557091 47.0 134.5000 7.2292
F 19.50000 22.303571 14.0 39.0000 7.7500
G 10.46250 11.291667 9.0 16.7000 7.6500
N 10.50000 16.861973 985.0 133.6500 0.0000
T 35.50000 35.500000 1.0 35.5000 35.5000

Doesn’t exist Cabin T in test dataset. This passenger is from first class and his passenger fare is the same from others 5 first class passengers. So, changed to ‘C’ to made same distribution between the six.

In [32]:
display(data[data.Cabin=='T'])

    display(data.Cabin_Letter[data.passenger_fare==35.5].value_counts())

    data.Cabin_Letter[data.Cabin_Letter=='T'] = 'C'
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare passenger_fare SibSp_bin Parch_bin family non_relatives Personal_Titles distinction_in_name surname Cabin_Letter Cabin_Number
339 340 0 1 Blackwell, Mr. Stephen Weart male 45.0 0 0 113784 35.5 35.5 0 0 1 0 Mr 0 Alone T 0

1 rows × 26 columns

B    2
    A    2
    C    1
    T    1
    Name: Cabin_Letter, dtype: int64

Fill Cabins letters NAs of third class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [33]:
data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

Fill Cabins letters NAs of second class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [34]:
data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

Fill Cabins letters NAs of first class with most common patterns of the same passenger fare range with one or lessen possible cases.

In [35]:
data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]
    

See below that we conquered a good results after filling nulls, but we need attention since they have too many nulls originally. In addition, the cabin may actually have made more difference in the deaths caused by the impact and not so much among those who drowned.

In [36]:
charts('Cabin_Letter', data[(data.Survived>=0)])
    

               ____________________________ Plots of Cabin_Letter per Survived and Dead: ____________________________
    

Rescue of family relationships

After some work, we notice that is difficult to understand SibSp and Patch isolated, and is difficult to extract directly families relationships from this data without a closer look.

So, in that configuration we not have clearly families relationships, and this information is primary to use for apply ages to ages with null with better distribution and accuracy.

Let’s start to rescue:

The first treatment, I discovered when check the results and noticed that I didn’t apply any relationship to a one case. Look the details below, we can see that is a case of a family with more than one ticket and the son has no age. So, I just manually applied this one case as a son, since the others member the father and the mother, and the son has the pattern 0 in SibSp and Parch 2

In [37]:
display(data[data.Name.str.findall('Bourke').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='Bourke'])
    family_w_age = data.Ticket[(data.Parch>0) & (data.SibSp>0) & (data.Age==-1)].unique().tolist()
    data[data.Ticket.isin(family_w_age)].sort_values('Ticket')
    

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare passenger_fare SibSp_bin Parch_bin family non_relatives Personal_Titles distinction_in_name surname Cabin_Letter Cabin_Number
188 189 0 3 Bourke, Mr. John male 40.0 1 1 364849 15.50 7.75 1 1 3 -1 Mr 0 Bourke F 3
593 594 0 3 Bourke, Miss. Mary female NaN 0 2 364848 7.75 7.75 0 2 3 -2 Miss 0 Bourke F 3
657 658 0 3 Bourke, Mrs. John (Catherine) female 32.0 1 1 364849 15.50 7.75 1 1 3 -1 Mrs 1 Bourke F 3

3 rows × 26 columns

Out[37]:

In [38]:
data['sons'] = data.apply(lambda x : \
                              1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp>0)) else 0, axis=1)

    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp>1)) else 0, axis=1)


    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles not in (['Mr', 'Mrs']))) else 0, axis=1)

    data.sons += data.apply(lambda x: 1 if ((x.Parch>0) & (x.Age>=0) & (x.Age<20)) else 0, axis=1)
    data.sons.loc[data.PassengerId==594] = 1 # Sun with diferente pattern (family with two tickets)
    data.sons.loc[data.PassengerId==1252] = 1 # Case of 'CA. 2343' and last rule
    data.sons.loc[data.PassengerId==1084] = 1 # Case of 'A/5. 851' and last rule
    data.sons.loc[data.PassengerId==1231] = 1 # Case of 'A/5. 851' and last rule

    charts('sons', data[(data.Survived>=0)])
    

               ____________________________ Plots of sons per Survived and Dead: ____________________________
    

We observe that has only 12.1% of sons, and their had better survival rate than others.

Next, we rescue the parents, and check cases where we have both (mother and father), and cases where we have only one aboard.

In [39]:
data['parents'] = data.apply(lambda x : \
                                  1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp==0)) else 0, axis=1)

    data.parents += data.apply(lambda x : \
                                  1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp==1)) else 0, axis=1)

    data.parents += data.apply(lambda x : 1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles in (['Mr', 'Mrs']))) \
                                    else 0, axis=1)

    # Identify parents and care nulls ages
    data.parents += data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

In [40]:
data['parent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parent_alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of parent_alone per Survived and Dead: ____________________________
    

We can notice that the both cases are to similar and it is not significant to has this two information separately.

Before I put them together, as I had learned in assembling the sons, I made a visual inspection and discovered some cases of sons and parents that required different rules for assigning them. As I did visually and this is not a rule for a pipeline, I proceeded with the settings manually.

In [41]:
t_p_alone = data.Ticket[data.parent_alone==1].tolist()

    data[data.Ticket.isin(t_p_alone)].sort_values('Ticket')[96:]

    data.parent_alone.loc[data.PassengerId==141] = 1

    data.parent_alone.loc[data.PassengerId==541] = 0
    data.sons.loc[data.PassengerId==541] = 1

    data.parent_alone.loc[data.PassengerId==1078] = 0
    data.sons.loc[data.PassengerId==1078] = 1

    data.parent_alone.loc[data.PassengerId==98] = 0
    data.sons.loc[data.PassengerId==98] = 1

    data.parent_alone.loc[data.PassengerId==680] = 0
    data.sons.loc[data.PassengerId==680] = 1

    data.parent_alone.loc[data.PassengerId==915] = 0
    data.sons.loc[data.PassengerId==915] = 1

    data.parent_alone.loc[data.PassengerId==333] = 0
    data.sons.loc[data.PassengerId==333] = 1

    data.parent_alone.loc[data.PassengerId==119] = 0
    data.sons[data.PassengerId==119] = 1

    data.parent_alone.loc[data.PassengerId==319] = 0
    data.sons.loc[data.PassengerId==319] = 1

    data.parent_alone.loc[data.PassengerId==103] = 0
    data.sons.loc[data.PassengerId==103] = 1

    data.parents.loc[data.PassengerId==154] = 0
    data.sons.loc[data.PassengerId==1084] = 1

    data.parents.loc[data.PassengerId==581] = 0
    data.sons.loc[data.PassengerId==581] = 1

    data.parent_alone.loc[data.PassengerId==881] = 0
    data.sons.loc[data.PassengerId==881] = 1

    data.parent_alone.loc[data.PassengerId==1294] = 0
    data.sons.loc[data.PassengerId==1294] = 1

    data.parent_alone.loc[data.PassengerId==378] = 0
    data.sons.loc[data.PassengerId==378] = 1

    data.parent_alone.loc[data.PassengerId==167] = 1
    data.parent_alone.loc[data.PassengerId==357] = 0
    data.sons.loc[data.PassengerId==357] = 1

    data.parent_alone.loc[data.PassengerId==918] = 0
    data.sons.loc[data.PassengerId==918] = 1

    data.parent_alone.loc[data.PassengerId==1042] = 0
    data.sons.loc[data.PassengerId==1042] = 1

    data.parent_alone.loc[data.PassengerId==540] = 0
    data.sons.loc[data.PassengerId==540] = 1

    data.parents += data.parent_alone 
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

Next, we rescue the grandparents and grandparents alone. We found the same situations with less cases and decided put all parents and grandparents in one feature and leave to age distinguish them.

In [42]:
data['grandparents'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age>45) ) else 0, axis=1)
    charts('grandparents', data[(data.Survived>=0)])
    

               ____________________________ Plots of grandparents per Survived and Dead: ____________________________
    

In [43]:
data['grandparent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>45) ) else 0, axis=1)
    charts('grandparent_alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of grandparent_alone per Survived and Dead: ____________________________
    

In [44]:
data.parents += data.grandparent_alone + data.grandparents
    charts('parents', data[(data.Survived>=0)])
    

               ____________________________ Plots of parents per Survived and Dead: ____________________________
    

Next, we identify the relatives aboard:

In [45]:
data['relatives'] = data.apply(lambda x: 1 if ((x.SibSp>0) & (x.Parch==0)) else 0, axis=1)
    charts('relatives', data[(data.Survived>=0)])
    

               ____________________________ Plots of relatives per Survived and Dead: ____________________________
    

And then, the companions, persons who traveled with a family but do not have family relationship with them.

In [46]:
data['companions'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==1)) else 0, axis=1)
    charts('companions', data[(data.Survived>=0)])
    

               ____________________________ Plots of companions per Survived and Dead: ____________________________
    

Finally, we rescue the passengers that traveled alone.

In [47]:
data['alone'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==0)) else 0, axis=1)
    charts('alone', data[(data.Survived>=0)])
    

               ____________________________ Plots of alone per Survived and Dead: ____________________________
    

As we can see, people with a family relationship, even if only as companions, had better survival rates and very close, than people who traveled alone.

Now we can work on issues of nulls ages and then on own information of age.

Age

We start with the numbers of nulls case by survived to remember that is too high.

Then, we plot the distributions of Ages, to check how is fit into the normal and see the distortions when apply a unique value (zero) to the nulls cases.

Next, we made the scatter plot of Ages and siblings, and see hat age decreases with the increase in the number of siblings, but with a great range

In [48]:
fig = plt.figure(figsize=(20, 10))
    fig1 = fig.add_subplot(221)
    g = sns.distplot(data.Age.fillna(0), fit=norm, label='Nulls as Zero')
    g = sns.distplot(data[~data.Age.isnull()].Age, fit=norm, label='Withou Nulls')
    plt.legend(loc='upper right')
    print('Survived without Age:')
    display(data[data.Age.isnull()].Survived.value_counts())
    fig2 = fig.add_subplot(222)
    g = sns.scatterplot(data = data[(~data.Age.isnull())], y='Age', x='SibSp',  hue='Survived')
    

Survived without Age:
    

 0    125
    -1     86
     1     52
    Name: Survived, dtype: int64

From the tables below, we can see that our enforce to get Personal Titles and rescue family relationships produce better medians to apply on nulls ages.

In [49]:
print('Mean and median ages by siblings:')
    data.loc[data.Age.isnull(), 'Age'] = -1
    display(data.loc[(data.Age>=0), ['SibSp', 'Age']].groupby('SibSp').agg([np.mean, np.median]).T)

    print('\nMedian ages by Personal_Titles:')
    Ages = { 'Age' : {'median'}}
    display(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg(Ages).T)

    print('\nMedian ages by Personal Titles and Family Relationships:')
    display(pd.pivot_table(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 
                                              'sons', 'relatives', 'companions','alone']],
                           aggfunc=np.median, 
                           index=['parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone'] , 
                           columns=['Personal_Titles']))

    print('\nNulls ages by Personal Titles and Family Relationships:')
    display(data[data.Age<0][['Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg([sum]))
    

Mean and median ages by siblings:
    

SibSp 0 1 2 3 4 5 8
Age mean 30.921766 31.058071 23.569444 16.3125 8.772727 10.166667 14.5
median 28.000000 30.000000 21.500000 14.5000 7.000000 10.500000 14.5

    Median ages by Personal_Titles:
    

Personal_Titles Kid Miss Mr Mrs Royalty Technical
Age median 4.0 22.0 29.0 35.0 40.0 49.5

    Median ages by Personal Titles and Family Relationships:
    

Age
Personal_Titles Kid Miss Mr Mrs Royalty Technical
parents grandparents sons relatives companions alone
0 0 0 0 0 1 NaN 23.0 28.75 36.0 39.0 50.0
1 0 NaN 30.0 28.00 40.0 33.0 NaN
1 0 0 12.0 23.0 30.00 33.0 48.5 48.5
1 0 0 0 4.0 8.5 18.00 19.0 NaN NaN
1 0 0 0 0 0 NaN 24.0 38.00 35.5 NaN 29.0
1 0 0 0 0 NaN NaN 56.00 50.0 NaN 61.5

    Nulls ages by Personal Titles and Family Relationships:
    

parents grandparents sons relatives companions alone
sum sum sum sum sum sum
Personal_Titles
Kid 0 0 8 0 0 1
Miss 0 0 10 7 3 30
Mr 2 0 3 16 20 135
Mrs 7 0 0 11 3 7
Technical 0 0 0 0 0 1

So, we apply to the nulls ages the respectively median of same personal title and same family relationship, but before, we create a binary feature to maintain the information of the presence of nulls.

In [50]:
data['Without_Age'] = data.Age.apply(lambda x: 0 if x>0 else 1)

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Technical')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Technical')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()
    data.Age.loc[(data.Age.isnull()) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')].median()
    

Finally, we check how age distribution lines after fill the nulls.

In [51]:
print('Age correlation with survived:',data.corr()['Survived'].Age)
    g = sns.distplot(data.Age, fit=norm, label='With nulls filled')
    plt.legend(loc='upper right')
    plt.show()
    

Age correlation with survived: -0.04046792444732172
    

To have better understanding of age, its proportion and its relation to survival ration, we binning it as follow

In [52]:
def binningAge(df):
        # Binning Age based on custom ranges
        bin_ranges = [0, 1.7, 8, 15, 18, 25, 55, 65, 100] 
        bin_names = [0, 1, 2, 3, 4, 5, 6, 7]
        df['Age_bin_custom_range'] = pd.cut(np.array(df.Age), bins=bin_ranges)
        df['Age_bin_custom_label'] = pd.cut(np.array(df.Age), bins=bin_ranges, labels=bin_names)
        return df

    data = binningAge(data)
    display(data[['Age', 'Age_bin_custom_range', 'Age_bin_custom_label']].sample(5))
    display(pd.pivot_table(data[['Age_bin_custom_range', 'Survived']], aggfunc=np.count_nonzero, 
                           index=['Survived'] , columns=['Age_bin_custom_range']))
    

Age Age_bin_custom_range Age_bin_custom_label
485 8.50 (8.0, 15.0] 2
1077 21.00 (18.0, 25.0] 4
754 48.00 (25.0, 55.0] 5
1282 51.00 (25.0, 55.0] 5
1275 28.75 (25.0, 55.0] 5

Age_bin_custom_range (0.0, 1.7] (1.7, 8.0] (8.0, 15.0] (15.0, 18.0] (18.0, 25.0] (25.0, 55.0] (55.0, 65.0] (65.0, 100.0]
Survived
-1 16 28 30 58 200 466 34 4
0 2 18 24 46 114 317 21 7
1 24 52 28 44 146 366 22 2

In [53]:
charts('Age_bin_custom_label', data[(data.Survived>=0)])
    

               ____________________________ Plots of Age_bin_custom_label per Survived and Dead: ____________________________
    

One hot encode and drop provisory and useless features

One hot encode categorical and non ordinal data and drop useless features.

In [54]:
data['genre'] = data.Sex.apply(lambda x: 1 if x=='male' else 0)
    data.drop(['Name', 'Cabin', 'Ticket', 'Sex', 'same_tckt', 'qtd_same_ticket', 'parent_alone', 'grandparents', 
               'grandparent_alone', 'Age_bin_custom_range'], axis=1, inplace=True) # , 'Age', 'Parch', 'SibSp',
    data = pd.get_dummies(data, columns = ['Cabin_Letter', 'Personal_Titles', 'Embarked', 'distinction_in_tikect'])

    data = pd.get_dummies(data, columns = ['surname']) # 'Age_bin_custom_label'
    data.drop(['surname_Exclude'], axis=1, inplace=True)
    

Scipy‘s pearson method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.
import numpy as np
from scipy.stats import pearson
pearson(data.loc[:, ‘Pclass’], data.Survived)

Select Features

All of the features we find in the dataset might not be useful in building a machine learning model to make the necessary prediction. Using some of the features might even make the predictions worse.

Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits.

  1. It reduces the variance of the model, and therefore overfitting.
  2. It reduces the complexity of a model and makes it easier to interpret.
  3. It improves the accuracy of a model if the right subset is chosen.
  4. Finally, it reduces the computational cost (and time) of training a model.

So, an alternative way to reduce the complexity of the model and avoid overfitting is dimensionality reduction via feature selection, which is especially useful for unregularized models. There are two main categories of dimensionality reduction techniques: feature selection and feature extraction. Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace.

Exist various methodologies and techniques that you can use to subset your feature space and help your models perform better and efficiently. So, let’s get started.

First check for any correlations between features

Correlation is a statistical term which in common usage refers to how close two features are to having a linear relationship with each other. The Pearson’s correlation which measures linear correlation between two features, the resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one feature increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two features.

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

There are five assumptions that are made with respect to Pearson’s correlation:

  • The feature must be either interval or ratio measurements.
  • The variables must be approximately normally distributed.
  • There is a linear relationship between the two variables.
  • Outliers are either kept to a minimum or are removed entirely
  • There is homoscedasticity of the data. Homoscedasticity basically means that the variances along the line of best fit remain similar as you move along the line.

One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables. For example, correlation between x and x2 is zero, when x is centered on 0.

Furthermore, relying only on the correlation value on interpreting the relationship of two variables can be highly misleading, so it is always worth plotting the data as we did on the EDA phase.

The following guidelines interpreting Pearson’s correlation coefficient (r):

Strength of Association r Positive r Negative
Small .1 to .3 -0.1 to -0.3
Medium .3 to .5 -0.3 to -0.5
Large .5 to 1.0 -0.5 to -1.0

The correlation matrix is identical to a covariance matrix computed from standardized data. The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson’s r), which measure the linear dependence between pairs of features. Pearson’s correlation coefficient can simply be calculated as the covariance between two features x and y (numerator) divided by the product of their standard deviations (denominator):

The covariance between standardized features is in fact equal to their linear correlation coefficient.

Let’s check what are the highest correlations with survived, I will now create a correlation matrix to quantify the linear relationship between the features. To do this I use NumPy’s corrcoef and seaborn’s heatmap function to plot the correlation matrix array as a heat map.

In [55]:
corr = data.loc[:, 'Survived':].corr()
    top_corr_cols = corr[abs(corr.Survived)>=0.06].Survived.sort_values(ascending=False).keys()
    top_corr = corr.loc[top_corr_cols, top_corr_cols]
    dropSelf = np.zeros_like(top_corr)
    dropSelf[np.triu_indices_from(dropSelf)] = True
    plt.figure(figsize=(15, 15))
    sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)
    sns.set(font_scale=0.8)
    plt.show()
    

Let’s see we have more surnames between 0.05 and 0.06 of correlation wit Survived.

In [56]:
display(corr[(abs(corr.Survived)>=0.05) & (abs(corr.Survived)<0.06)].Survived.sort_values(ascending=False).keys())
    del corr, dropSelf, top_corr
    

Index(['surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
           'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
           'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
           'surname_Beane', 'surname_Hippach', 'surname_Bishop',
           'surname_Mellinger', 'surname_Yarred', 'Personal_Titles_Royalty'],
          dtype='object')

Drop the features with highest correlations to other Features:

Colinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. And as you see above, it is easy to find highest collinearities (Personal_Titles_Mrs, Personal_Titles_Mr and Fare.

You should always be concerned about the collinearity, regardless of the model/method being linear or not, or the main task being prediction or classification.

Assume a number of linearly correlated covariates/features present in the data set and Random Forest as the method. Obviously, random selection per node may pick only (or mostly) collinear features which may/will result in a poor split, and this can happen repeatedly, thus negatively affecting the performance.

Now, the collinear features may be less informative of the outcome than the other (non-collinear) features and as such they should be considered for elimination from the feature set anyway. However, assume that the features are ranked high in the ‘feature importance’ list produced by RF. As such they would be kept in the data set unnecessarily increasing the dimensionality. So, in practice, I’d always, as an exploratory step (out of many related) check the pairwise association of the features, including linear correlation.

Identify and treat multicollinearity:

Multicollinearity is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model, leading to unreliable and unstable estimates of regression coefficients. To make matters worst multicollinearity can emerge even when isolated pairs of variables are not collinear.

To identify, we need start with the coefficient of determination, r2, is the square of the Pearson correlation coefficient r. The coefficient of determination, with respect to correlation, is the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model (the correlation is the model). It is sometimes expressed as a percentage (e.g., 36% instead of 0.36) when we discuss the proportion of variance explained by the correlation. However, you should not write r2 = 36%, or any other percentage. You should write it as a proportion (e.g., r2 = 0.36).

Already the Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression. It is may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors, and then obtaining the R2 from that regression. It is calculated by taking the the ratio of the variance of all a given model’s betas divide by the variance of a single beta if it were fit alone [1/(1-R2)]. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors. The VIF has a lower bound of 1 but no upper bound. Authorities differ on how high the VIF has to be to constitute a problem (e.g.: 2.50 (R2 equal to 0.6), sometimes 5 (R2 equal to .8), or greater than 10 (R2 equal to 0.9) and so on).

But there are several situations in which multicollinearity can be safely ignored:

  • Interaction terms and higher-order terms (e.g., squared and cubed predictors) are correlated with main effect terms because they include the main effects terms. Ops! Sometimes we use polynomials to solve problems, indeed! But keep calm, in these cases, standardizing the predictors can removed the multicollinearity.
  • Indicator, like dummy or one-hot-encode, that represent a categorical variable with three or more categories. If the proportion of cases in the reference category is small, the indicator will necessarily have high VIF’s, even if the categorical is not associated with other variables in the regression model. But, you need check if some dummy is collinear or has multicollinearity with other features outside of their dummies.
  • *Control feature if the feature of interest do not have high VIF’s. Here’s the thing about multicollinearity: it’s only a problem for the features that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear feature are only used as control feature, and they are not collinear with your feature of interest, there’s no problem. The coefficients of the features of interest are not affected, and the performance of the control feature as controls is not impaired.

So, generally, we could run the same model twice, once with severe multicollinearity and once with moderate multicollinearity. This provides a great head-to-head comparison and it reveals the classic effects of multicollinearity. However, when standardizing your predictors doesn’t work, you can try other solutions such as:

  • removing highly correlated predictors
  • linearly combining predictors, such as adding them together
  • running entirely different analyses, such as partial least squares regression or principal components analysis

When considering a solution, keep in mind that all remedies have potential drawbacks. If you can live with less precise coefficient estimates, or a model that has a high R-squared but few significant predictors, doing nothing can be the correct decision because it won’t impact the fit.

Given the potential for correlation among the predictors, we’ll have display the variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression analysis. Hence such variables need to be removed from the model. Deleting one variable at a time and then again checking the VIF for the model is the best way to do this.

So, I start the analysis removed the 3 features with he highest collinearities and the surnames different from my control surname_Alone and correlation with survived less then 0.05, and run VIF.

In [57]:
#Step 1: Remove the higest correlations and run a multiple regression
    cols = [ 'family',
             'non_relatives',
             'surname_Alone',
             'surname_Baclini',
             'surname_Carter',
             'surname_Richards',
             'surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
             'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
             'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
             'surname_Beane', 'surname_Hippach', 'surname_Bishop',
             'surname_Mellinger', 'surname_Yarred', 
             'Pclass',
             'Age',
             'SibSp',
             'Parch',
             #'Fare',
             'qtd_same_ticket_bin',
             'passenger_fare',
             #'SibSp_bin',
             #'Parch_bin',
             'distinction_in_name',
             'Cabin_Number',
             'sons',
             'parents',
             'relatives',
             'companions',
             'alone',
             'Without_Age',
             'Age_bin_custom_label',
             'genre',
             'Cabin_Letter_A',
             'Cabin_Letter_B',
             'Cabin_Letter_C',
             'Cabin_Letter_D',
             'Cabin_Letter_E',
             'Cabin_Letter_F',
             'Cabin_Letter_G',
             'Personal_Titles_Kid',
             'Personal_Titles_Miss',
             #'Personal_Titles_Mr',
             #'Personal_Titles_Mrs',
             'Personal_Titles_Royalty',
             'Personal_Titles_Technical',
             'Embarked_C',
             'Embarked_Q',
             'Embarked_S',
             'distinction_in_tikect_High',
             'distinction_in_tikect_Low',
             'distinction_in_tikect_Others',
             'distinction_in_tikect_PC'
    ]

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
48 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
35 1.382636 Without_Age
29 1.594563 Cabin_Number
47 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
45 2.156407 Personal_Titles_Kid
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
27 3.750101 passenger_fare
22 4.436090 Pclass
23 4.742809 Age
46 5.008117 Personal_Titles_Miss
37 6.403387 genre
36 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
26 20.666651 qtd_same_ticket_bin
50 39.868704 Embarked_Q
49 74.967792 Embarked_C
51 97.344161 Embarked_S
54 inf distinction_in_tikect_Others
53 inf distinction_in_tikect_Low
52 inf distinction_in_tikect_High
1 inf family
44 inf Cabin_Letter_G
43 inf Cabin_Letter_F
41 inf Cabin_Letter_D
40 inf Cabin_Letter_C
39 inf Cabin_Letter_B
38 inf Cabin_Letter_A
34 inf alone
33 inf companions
32 inf relatives
31 inf parents
30 inf sons
24 inf SibSp
25 inf Parch
42 inf Cabin_Letter_E
55 inf distinction_in_tikect_PC

From the results, I conclude that can safe maintain the dummies of Embarked, but need work in the remaining features where’s the VIF stated as inf. You can see that surname Alone has a VIF of 2.2. We’re going to treat it as our baseline and exclude it from our fit. This is done to prevent multicollinearity, or the dummy variable trap caused by including a dummy variable for every single category. let’s try remove the dummy alone, that is pretty similar, and check if it solves the other dummies from its category:

In [58]:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
    cols.remove('alone')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
47 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
46 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
44 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
27 3.750101 passenger_fare
22 4.436090 Pclass
31 4.540579 parents
23 4.742809 Age
45 5.008117 Personal_Titles_Miss
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
26 20.666651 qtd_same_ticket_bin
49 39.868704 Embarked_Q
48 74.967792 Embarked_C
50 97.344161 Embarked_S
52 inf distinction_in_tikect_Low
51 inf distinction_in_tikect_High
1 inf family
41 inf Cabin_Letter_E
42 inf Cabin_Letter_F
53 inf distinction_in_tikect_Others
40 inf Cabin_Letter_D
39 inf Cabin_Letter_C
38 inf Cabin_Letter_B
37 inf Cabin_Letter_A
24 inf SibSp
25 inf Parch
43 inf Cabin_Letter_G
54 inf distinction_in_tikect_PC

To solve Cabin Letter, we can try remove only the lowest frequency ‘A’, and see if we can accept the VIF’s of others Cabins:

In [59]:
#Step 1: Remove one feature with VIF on Inf from the same category and run a multiple regression
    cols.remove('Cabin_Letter_A')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
46 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
45 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
43 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
27 3.750101 passenger_fare
22 4.436090 Pclass
39 4.520122 Cabin_Letter_D
31 4.540579 parents
23 4.742809 Age
37 4.871771 Cabin_Letter_B
44 5.008117 Personal_Titles_Miss
38 5.836347 Cabin_Letter_C
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
42 9.446441 Cabin_Letter_G
41 16.107676 Cabin_Letter_F
40 17.938593 Cabin_Letter_E
26 20.666651 qtd_same_ticket_bin
48 39.868704 Embarked_Q
47 74.967792 Embarked_C
49 97.344161 Embarked_S
25 inf Parch
24 inf SibSp
1 inf family
50 inf distinction_in_tikect_High
51 inf distinction_in_tikect_Low
52 inf distinction_in_tikect_Others
53 inf distinction_in_tikect_PC

Now our focus is on distinct in name, since “High” has less observations, let’s try dropped it and drop the bins of Parch and SibSp.

In [60]:
cols.remove('distinction_in_tikect_High')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
17 1.027080 surname_Beane
20 1.039424 surname_Mellinger
15 1.048763 surname_Doling
12 1.056317 surname_Hamalainen
14 1.061651 surname_Taylor
11 1.063774 surname_Chambers
9 1.071730 surname_Goldenberg
10 1.072810 surname_Moor
8 1.081285 surname_Beckwith
13 1.082148 surname_Dick
18 1.083090 surname_Hippach
19 1.091573 surname_Bishop
6 1.091714 surname_Richards
7 1.097315 surname_Harper
21 1.101053 surname_Yarred
5 1.107265 surname_Carter
46 1.121313 Personal_Titles_Technical
4 1.133671 surname_Baclini
34 1.382636 Without_Age
29 1.594563 Cabin_Number
45 1.768952 Personal_Titles_Royalty
16 1.865907 surname_Gordon
43 2.156407 Personal_Titles_Kid
33 2.185638 companions
3 2.201757 surname_Alone
28 2.954703 distinction_in_name
32 3.021953 relatives
50 3.271151 distinction_in_tikect_Low
27 3.750101 passenger_fare
52 4.229762 distinction_in_tikect_PC
22 4.436090 Pclass
39 4.520122 Cabin_Letter_D
31 4.540579 parents
23 4.742809 Age
37 4.871771 Cabin_Letter_B
44 5.008117 Personal_Titles_Miss
51 5.273404 distinction_in_tikect_Others
38 5.836347 Cabin_Letter_C
30 6.128749 sons
36 6.403387 genre
35 7.591511 Age_bin_custom_label
2 7.644149 non_relatives
42 9.446441 Cabin_Letter_G
41 16.107676 Cabin_Letter_F
40 17.938593 Cabin_Letter_E
26 20.666651 qtd_same_ticket_bin
48 39.868704 Embarked_Q
47 74.967792 Embarked_C
49 97.344161 Embarked_S
24 inf SibSp
25 inf Parch
1 inf family

As we can see, we now have to remove one between family, parch and SibSp. Note that non_relatives qtd_same_ticket_bin are already with relatively acceptable VIF’s. The first is directly calculated from the family and the second is very close as we have seen. So let’s discard the family.

In [61]:
cols.remove('family')

    y_train = data.Survived[data.Survived>=0]
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    features = "+".join(cols)
    df2 = pd.concat([y_train, df], axis=1)

    # get y and X dataframes based on this regression:
    y, X = dmatrices('Survived ~' + features, data = df2, return_type='dataframe')

    #Step 2: Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    #Step 3: Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    

VIF Factor features
0 1.000000 Intercept
16 1.027080 surname_Beane
19 1.039424 surname_Mellinger
14 1.048763 surname_Doling
11 1.056317 surname_Hamalainen
13 1.061651 surname_Taylor
10 1.063774 surname_Chambers
8 1.071730 surname_Goldenberg
9 1.072810 surname_Moor
7 1.081285 surname_Beckwith
12 1.082148 surname_Dick
17 1.083090 surname_Hippach
18 1.091573 surname_Bishop
5 1.091714 surname_Richards
6 1.097315 surname_Harper
20 1.101053 surname_Yarred
4 1.107265 surname_Carter
45 1.121313 Personal_Titles_Technical
3 1.133671 surname_Baclini
33 1.382636 Without_Age
28 1.594563 Cabin_Number
44 1.768952 Personal_Titles_Royalty
15 1.865907 surname_Gordon
42 2.156407 Personal_Titles_Kid
32 2.185638 companions
2 2.201757 surname_Alone
27 2.954703 distinction_in_name
31 3.021953 relatives
49 3.271151 distinction_in_tikect_Low
26 3.750101 passenger_fare
51 4.229762 distinction_in_tikect_PC
21 4.436090 Pclass
38 4.520122 Cabin_Letter_D
30 4.540579 parents
22 4.742809 Age
36 4.871771 Cabin_Letter_B
23 4.913654 SibSp
43 5.008117 Personal_Titles_Miss
50 5.273404 distinction_in_tikect_Others
37 5.836347 Cabin_Letter_C
29 6.128749 sons
35 6.403387 genre
24 6.585030 Parch
34 7.591511 Age_bin_custom_label
1 7.644149 non_relatives
41 9.446441 Cabin_Letter_G
40 16.107676 Cabin_Letter_F
39 17.938593 Cabin_Letter_E
25 20.666651 qtd_same_ticket_bin
47 39.868704 Embarked_Q
46 74.967792 Embarked_C
48 97.344161 Embarked_S

Yea, we can accept, and we can proceed to the next step.

Feature Selection by Filter Methods

Filter methods use statistical methods for evaluation of a subset of features, they are generally used as a preprocessing step. These methods are also known as univariate feature selection, they examines each feature individually to determine the strength of the relationship of the feature with the dependent variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data, but not necessarily for optimizing the feature set for better generalization.

So, the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation co-efficients.

Feature/Response Continuous Categorical
Continuous Pearson’s Correlation LDA
Categorical Anova Chi-Square

One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.

There are lot of different options for univariate selection. Some examples are:

  • Model Based Ranking
  • Mutual information and maximal information coefficient (MIC).

I did not approach the latter, because there has been some critique about MIC’s statistical power, i.e. the ability to reject the null hypothesis when the null hypothesis is false. This may or may not be a concern, depending on the particular dataset and its noisiness. If you have interest on this, in python, MIC is available in the minepy library.

Feature Selection by Model based ranking

We can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model.

In fact, this is already put to use with Pearson’s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression. But this method it is not good to select features with non-linear relation to dependent variable. For this, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don’t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.

In [62]:
rf = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
    scores = []
    for i in range(df.shape[1]):
         score = cross_val_score(rf, df.iloc[:, i:i+1], y_train, scoring="accuracy", cv=10)
         scores.append((round(np.mean(score), 3), cols[i]))
    MBR = pd.DataFrame(sorted(scores, reverse=True), columns=['Score', 'Feature'])
    g = MBR.iloc[:15, :].plot(x='Feature', kind='barh', figsize=(20,10), fontsize=12, grid=True)
    plt.show()
    MBR = MBR.iloc[:15, 1]
    

Feature Selection by SelectKBest:

On scikit-learn we find variety of implementation oriented to classifications tasks to select features according to the k highest scores, see below some of that:

  • f_classif compute the ANOVA F-value for the provided sample.
  • chi2 compute chi-squared stats between each non-negative feature and class.
  • mutual_info_classif estimate mutual information for a discrete target variable.

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

Other important point is if you use sparse data, for example if we continue consider hot-encode of surnames, chi2 and mutual_info_classif will deal with the data without making it dense.

Let’s see the SelectKBest of f_classif and chi2 for our data:

In [63]:
cols = pd.Index(cols)

    skb = SelectKBest(score_func=f_classif, k=10)
    skb.fit(df, y_train)

    select_features_kbest = skb.get_support()
    feature_f_clas = cols[select_features_kbest]
    feature_f_clas_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
    print('Total features slected by f_classif Statistical Methods',len(feature_f_clas))
    fig = plt.figure(figsize=(20,7))
    f1 = fig.add_subplot(121)
    g = pd.DataFrame(sorted(feature_f_clas_scores, key=lambda x: -x[1])[:len(feature_f_clas)], columns=['Feature','F-Calss Score']).\
    plot(x='Feature', kind='barh', title= 'F Class Score', fontsize=18, ax=f1, grid=True)

    scale = MinMaxScaler()
    df2 = scale.fit_transform(data.loc[data.Survived>=0, cols])
    skb = SelectKBest(score_func=chi2, k=10)
    skb.fit(df2, y_train)
    select_features_kbest = skb.get_support()
    feature_chi2 = cols[select_features_kbest]
    feature_chi2_scores = [(item, score) for item, score in zip(cols, skb.scores_)]
    print('Total features slected by chi2 Statistical Methods',len(feature_chi2))
    f2 = fig.add_subplot(122)
    g = pd.DataFrame(sorted(feature_chi2_scores, key=lambda x: -x[1])[:len(feature_chi2)], columns=['Feature','Chi2 Score']).\
    plot(x='Feature', kind='barh',  title= 'Chi2 Score', fontsize=18, ax=f2, grid=True)

    SMcols = set(feature_f_clas).union(set(feature_chi2))
    print("Extra features select by f_class:\n", set(feature_f_clas).difference(set(feature_chi2)), '\n')
    print("Extra features select by chi2:\n", set(feature_chi2).difference(set(feature_f_clas)), '\n')
    print("Intersection features select by f_class and chi2:\n",set(feature_f_clas).intersection(set(feature_chi2)), '\n')
    print('Total number of features selected:', len(SMcols))
    print(SMcols)

    plt.tight_layout(); plt.show()
    

Total features slected by f_classif Statistical Methods 10
    Total features slected by chi2 Statistical Methods 10
    Extra features select by f_class:
     {'passenger_fare', 'Embarked_S'} 

    Extra features select by chi2:
     {'distinction_in_tikect_PC', 'Cabin_Letter_C'} 

    Intersection features select by f_class and chi2:
     {'Embarked_C', 'Cabin_Letter_E', 'Cabin_Letter_B', 'genre', 'distinction_in_tikect_Low', 'Pclass', 'distinction_in_name', 'Personal_Titles_Miss'} 

    Total number of features selected: 12
    {'passenger_fare', 'Embarked_C', 'Cabin_Letter_E', 'Cabin_Letter_B', 'genre', 'distinction_in_tikect_Low', 'Pclass', 'distinction_in_name', 'Embarked_S', 'distinction_in_tikect_PC', 'Cabin_Letter_C', 'Personal_Titles_Miss'}
    

Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem.

The two main disadvantages of these methods are :

  • The increasing overfitting risk when the number of observations is insufficient.
  • These methods are usually computationally very expensive.

Backward Elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

We will see below row implementation of backward elimination, one to select by P-values and other based on the accuracy of a model the we submitted to it.

Backward Elimination By P-values

The P-value, or probability value, or asymptotic significance, is a probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations more commonly known as the statistical summary is greater than or equal in magnitude to the observed results.

The null hypothesis is a general statement that there is no relationship between two measured phenomena.

For example, if the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

But you need to be careful how you interpret the statistical significance of a correlation. If your correlation coefficient has been determined to be statistically significant this does not mean that you have a strong association. It simply tests the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that states that there is a relationship, but with no information about the strength of the relationship or its importance.

Since removal of different features from the dataset will have different effects on the p-value for the dataset, we can remove different features and measure the p-value in each case. These measured p-values can be used to decide whether to keep a feature or not.

Next we make the test of a logit regression to check the result and select features based on its the P-value:

In [64]:
logit_model=sm.Logit(y_train,df)
    result=logit_model.fit(method='bfgs', maxiter=2000)
    print(result.summary())
    

Optimization terminated successfully.
             Current function value: 0.363678
             Iterations: 285
             Function evaluations: 289
             Gradient evaluations: 289
                               Logit Regression Results                           
    ==============================================================================
    Dep. Variable:               Survived   No. Observations:                  891
    Model:                          Logit   Df Residuals:                      840
    Method:                           MLE   Df Model:                           50
    Date:                Sat, 22 Sep 2018   Pseudo R-squ.:                  0.4539
    Time:                        18:38:46   Log-Likelihood:                -324.04
    converged:                       True   LL-Null:                       -593.33
                                            LLR p-value:                 4.183e-83
    ================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------------------------
    non_relatives                    0.6521      0.248      2.626      0.009       0.165       1.139
    surname_Alone                    0.2531      0.317      0.797      0.425      -0.369       0.875
    surname_Baclini                 35.0286    5.2e+05   6.74e-05      1.000   -1.02e+06    1.02e+06
    surname_Carter                   0.3247      1.478      0.220      0.826      -2.572       3.221
    surname_Richards                24.5135    1.8e+05      0.000      1.000   -3.53e+05    3.53e+05
    surname_Harper                   1.6077      1.493      1.077      0.282      -1.319       4.534
    surname_Beckwith                25.2653   1.48e+05      0.000      1.000    -2.9e+05     2.9e+05
    surname_Goldenberg              18.1282   6212.320      0.003      0.998   -1.22e+04    1.22e+04
    surname_Moor                    12.1274    708.078      0.017      0.986   -1375.680    1399.934
    surname_Chambers                20.4606    1.8e+04      0.001      0.999   -3.53e+04    3.53e+04
    surname_Hamalainen              12.1778    769.487      0.016      0.987   -1495.989    1520.345
    surname_Dick                    23.5492   5.64e+04      0.000      1.000    -1.1e+05    1.11e+05
    surname_Taylor                  26.5937   2.03e+05      0.000      1.000   -3.97e+05    3.97e+05
    surname_Doling                  11.1800    568.824      0.020      0.984   -1103.695    1126.054
    surname_Gordon                  21.6263   1.37e+04      0.002      0.999   -2.67e+04    2.68e+04
    surname_Beane                   29.3099   9.09e+05   3.23e-05      1.000   -1.78e+06    1.78e+06
    surname_Hippach                  9.4228    329.299      0.029      0.977    -635.992     654.838
    surname_Bishop                  13.9600    991.219      0.014      0.989   -1928.793    1956.713
    surname_Mellinger               12.1126    747.967      0.016      0.987   -1453.876    1478.101
    surname_Yarred                  17.7443   6003.748      0.003      0.998   -1.17e+04    1.18e+04
    Pclass                          -1.0271      0.258     -3.982      0.000      -1.533      -0.522
    Age                             -0.0297      0.016     -1.848      0.065      -0.061       0.002
    SibSp                           -0.1467      0.216     -0.679      0.497      -0.570       0.277
    Parch                            0.1022      0.302      0.338      0.736      -0.491       0.695
    qtd_same_ticket_bin             -0.6804      0.361     -1.885      0.059      -1.388       0.027
    passenger_fare                   0.0444      0.019      2.320      0.020       0.007       0.082
    distinction_in_name              1.2501      0.419      2.987      0.003       0.430       2.070
    Cabin_Number                     0.0699      0.037      1.876      0.061      -0.003       0.143
    sons                             0.6661      0.697      0.956      0.339      -0.700       2.032
    parents                          0.3338      0.712      0.469      0.639      -1.061       1.729
    relatives                        0.2784      0.485      0.574      0.566      -0.672       1.229
    companions                      -0.2517      0.508     -0.496      0.620      -1.247       0.744
    Without_Age                     -0.2149      0.292     -0.737      0.461      -0.787       0.357
    Age_bin_custom_label             0.0610      0.209      0.291      0.771      -0.349       0.471
    genre                           -3.0048      0.526     -5.715      0.000      -4.035      -1.974
    Cabin_Letter_B                  -0.2343      0.786     -0.298      0.766      -1.774       1.306
    Cabin_Letter_C                  -0.3873      0.727     -0.533      0.594      -1.812       1.037
    Cabin_Letter_D                   0.6481      0.738      0.878      0.380      -0.799       2.095
    Cabin_Letter_E                   0.6878      0.752      0.914      0.361      -0.787       2.162
    Cabin_Letter_F                   1.1108      0.756      1.468      0.142      -0.372       2.593
    Cabin_Letter_G                   0.8240      0.896      0.919      0.358      -0.933       2.581
    Personal_Titles_Kid              3.7469      0.723      5.179      0.000       2.329       5.165
    Personal_Titles_Miss             0.0229      0.551      0.042      0.967      -1.056       1.102
    Personal_Titles_Royalty         -1.6139      2.035     -0.793      0.428      -5.602       2.374
    Personal_Titles_Technical        0.0623      0.663      0.094      0.925      -1.236       1.361
    Embarked_C                      -3.1700     15.027     -0.211      0.833     -32.623      26.283
    Embarked_Q                      -3.3883     15.032     -0.225      0.822     -32.851      26.074
    Embarked_S                      -3.6679     15.026     -0.244      0.807     -33.119      25.783
    distinction_in_tikect_Low       -1.9666      0.786     -2.501      0.012      -3.508      -0.426
    distinction_in_tikect_Others    -1.1537      0.546     -2.114      0.035      -2.223      -0.084
    distinction_in_tikect_PC        -1.8562      0.726     -2.558      0.011      -3.278      -0.434
    ================================================================================================
    

As expect, P-values of dummies is high. Like before, I excluded one by one of the features with the highest P-value and run again until get only P-values below to 0.1, but here I use a backward elimination process.

In [65]:
pv_cols = cols.values

    def backwardElimination(x, Y, sl, columns):
        numVars = x.shape[1]
        for i in range(0, numVars):
            regressor = sm.Logit(Y, x).fit(method='bfgs', maxiter=2000, disp=False)
            maxVar = max(regressor.pvalues) #.astype(float)
            if maxVar > sl:
                for j in range(0, numVars - i):
                    if (regressor.pvalues[j].astype(float) == maxVar):
                        columns = np.delete(columns, j)
                        x = x.loc[:, columns]
                        
        print(regressor.summary())
        print('\nSelect {:d} features from {:d} by best p-values.'.format(len(columns), len(pv_cols)))
        print('The max p-value from the features selecte is {:.3f}.'.format(maxVar))
        
        # odds ratios and 95% CI
        conf = np.exp(regressor.conf_int())
        conf['Odds Ratios'] = np.exp(regressor.params)
        conf.columns = ['2.5%', '97.5%', 'Odds Ratios']
        display(conf)
        
        return columns, regressor

    SL = 0.1
    df2 = scale.fit_transform(data.loc[data.Survived>=0, pv_cols])
    df2 = pd.DataFrame(df2, columns = pv_cols)

    pv_cols, Logit = backwardElimination(df2, y_train, SL, pv_cols)
    

                           Logit Regression Results                           
    ==============================================================================
    Dep. Variable:               Survived   No. Observations:                  891
    Model:                          Logit   Df Residuals:                      876
    Method:                           MLE   Df Model:                           14
    Date:                Sat, 22 Sep 2018   Pseudo R-squ.:                  0.4146
    Time:                        18:38:51   Log-Likelihood:                -347.36
    converged:                       True   LL-Null:                       -593.33
                                            LLR p-value:                 4.732e-96
    ================================================================================================
                                       coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------------------------
    non_relatives                    9.8711      1.312      7.525      0.000       7.300      12.442
    Pclass                          -2.0428      0.354     -5.777      0.000      -2.736      -1.350
    Age                             -2.2167      0.782     -2.834      0.005      -3.750      -0.683
    qtd_same_ticket_bin             -3.7108      0.575     -6.451      0.000      -4.838      -2.583
    passenger_fare                   3.0945      1.683      1.839      0.066      -0.204       6.393
    distinction_in_name              1.1604      0.307      3.783      0.000       0.559       1.762
    sons                             1.0338      0.443      2.332      0.020       0.165       1.903
    parents                          1.1074      0.457      2.425      0.015       0.212       2.002
    relatives                        0.5993      0.305      1.963      0.050       0.001       1.198
    genre                           -2.9115      0.252    -11.566      0.000      -3.405      -2.418
    Personal_Titles_Kid              3.4269      0.576      5.948      0.000       2.298       4.556
    Embarked_S                      -0.3645      0.221     -1.649      0.099      -0.798       0.069
    distinction_in_tikect_Low       -2.1724      0.739     -2.941      0.003      -3.620      -0.724
    distinction_in_tikect_Others    -1.3182      0.475     -2.777      0.005      -2.249      -0.388
    distinction_in_tikect_PC        -2.0141      0.654     -3.078      0.002      -3.297      -0.732
    ================================================================================================

    Select 15 features from 51 by best p-values.
    The max p-value from the features selecte is 0.099.
    

2.5% 97.5% Odds Ratios
non_relatives 1480.171472 253274.069118 19362.051846
Pclass 0.064837 0.259310 0.129665
Age 0.023516 0.504884 0.108964
qtd_same_ticket_bin 0.007920 0.075521 0.024457
passenger_fare 0.815520 597.572483 22.075600
distinction_in_name 1.749180 5.821927 3.191175
sons 1.179098 6.704660 2.811664
parents 1.236503 7.407207 3.026389
relatives 1.000811 3.312990 1.820900
genre 0.033212 0.089092 0.054396
Personal_Titles_Kid 9.951006 95.205077 30.779641
Embarked_S 0.450403 1.071116 0.694574
distinction_in_tikect_Low 0.026771 0.484601 0.113901
distinction_in_tikect_Others 0.105529 0.678613 0.267607
distinction_in_tikect_PC 0.037006 0.481179 0.133440

From the results, we can highlight:

  • we’re very confident about some relationship between the probability of being survived:

    • there is an inverse relationship with class, Age and genre of the passenger.
    • there is an positive relationship, from greater to low, with non relatives, Kids, passenger fare and distinction in name of the passenger.
  • From the coefficient:

    • Non relatives, we confirm that fellows or companions of families had better chances to survived.
    • As we saw in the EDA, passengers who embarked on S had a higher fatality rate.
    • The model is only able to translate a single straight line by variable, obviously pondered by the others. Then he points out that the greater the number of passengers with the same ticket, the greater the likelihood of death. As seen through the ticket EDA, this is partly true, however, this coefficient does not adequately capture the fact that the chances of survival increase from 1 to 3 passengers with the same ticket. As you can see above, the model balances this through the variables of family relationships. However, this is a good example of why you should not ignore the EDA phase, and above all do not rely on a conclusion based on a single fact or point of view. Remember, models are not full truths, and possess precision, confident and accuracy!

Take the exponential of each of the coefficients to generate the odds ratios. This tells you how a 1 unit increase or decrease in a variable affects the odds of being survived. For example, we can expect the odds of being survived to decrease by about 69.5% if the passenger embarked on S. Go back to the embarked EDA and see that this hits the stacked bar chart.

You must be wondering, after all, this model has how much of accuracy?

Although we did not do cross validation or even split, let’s take a look, redeem the probabilities generated by it and take advantage to see how we can plot the results of models that return probabilities, so we can refine our perception much more than simply evaluate p-value, coefficients, accuracy, etc.

In [66]:
pred = Logit.predict(df2[pv_cols])
    train = data[data.Survived>=0]
    train['proba'] = pred
    train['Survived'] = y_train
    y_pred = pred.apply(lambda x: 1 if x > 0.5 else 0)
    print('Accurancy: {0:2.2%}'.format(accuracy_score(y_true=y_train, y_pred=y_pred)))

    def plot_proba(continous, predict, discret, data):
        grouped = pd.pivot_table(data, values=[predict], index=[continous, discret], aggfunc=np.mean)
        colors = 'rbgyrbgy'
        for col in data[discret].unique():
            plt_data = grouped.ix[grouped.index.get_level_values(1)==col]
            plt.plot(plt_data.index.get_level_values(0), plt_data[predict], color=colors[int(col)])
        plt.xlabel(continous)
        plt.ylabel("Probabilities")
        plt.legend(np.sort(data[discret].unique()), loc='upper left', title=discret)
        plt.title("Probabilities with " + continous + " and " + discret)

    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(231)
    plot_proba('non_relatives', 'Survived', 'Pclass', train)
    ax = fig.add_subplot(232)
    plot_proba('non_relatives', 'Survived', 'genre', train)
    ax = fig.add_subplot(233)
    plot_proba('non_relatives', 'Survived', 'qtd_same_ticket_bin', train)
    ax = fig.add_subplot(234)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'distinction_in_name', train)
    ax = fig.add_subplot(235)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
    ax = fig.add_subplot(235)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'Embarked_S', train)
    ax = fig.add_subplot(236)
    plot_proba('qtd_same_ticket_bin', 'Survived', 'parents', train)
    plt.show()
    

Accurancy: 83.73%
    

Backward Elimination By Accuracy – A Sequential Backward Selection

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The motivation behind feature selection algorithms is to automatically select a subset of features that are most relevant to the problem to improve computational efficiency or reduce the generalization error of the model by removing irrelevant features or noise, which can be useful for algorithms that don’t
support regularization
.

Greedy algorithms make locally optimal choices at each stage of a combinatorial search problem and generally yield a suboptimal solution to the problem in contrast to exhaustive search algorithms, which evaluate all possible combinations and are guaranteed to find the optimal solution. However, in practice, an exhaustive search is often computationally not feasible, whereas greedy algorithms allow for a less complex, computationally more efficient solution.

SBS aims to reduce the dimensionality of the initial feature subspace with a minimum decay in performance of the classifier to improve upon computational efficiency. In certain cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.

SBS sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, we need to define criterion function J that we want to minimize. The criterion calculated by the criterion function can simply be the difference in performance of the classifier after and before the removal of a particular feature. Then the feature to be removed at each stage can simply be defined as the feature that maximizes this criterion.

From interactive executions, I already know that I need remove the surnames with less than 0.06 of correlation.

So, let’s see a example of SBS in our data,

In [67]:
class SBS():
        def __init__(self, estimator, k_features, scoring=accuracy_score, test_size=0.25, random_state=101):
            self.scoring = scoring
            self.estimator = clone(estimator)
            self.k_features = k_features
            self.test_size = test_size
            self.random_state = random_state

        def fit(self, X, y):
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)
            dim = X_train.shape[1]
            self.indices_ = list(range(dim))
            self.subsets_ = [self.indices_]
            score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)
            self.scores_ = [score]
            
            while dim > self.k_features:
                scores = []
                subsets = []
                for p in combinations(self.indices_, r=dim-1):
                    score = self._calc_score(X_train, y_train, X_test, y_test, list(p))
                    scores.append(score)
                    subsets.append(list(p))
                    
                best = np.argmax(scores)
                self.indices_ = subsets[best]
                self.subsets_.append(self.indices_)
                dim -= 1
                self.scores_.append(scores[best])
                
            self.k_score_ = self.scores_[-1]
            return self

        def transform(self, X):
            return X.iloc[:, self.indices_]
        
        def _calc_score(self, X_train, y_train, X_test, y_test, indices):
            self.estimator.fit(X_train.iloc[:, indices], y_train)
            y_pred = self.estimator.predict(X_test.iloc[:, indices])
            score = self.scoring(y_test, y_pred)
            return score
        
    knn = KNeighborsClassifier(n_neighbors=3)
    sbs = SBS(knn, k_features=1)
    df2 = df.drop(['surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
                    'surname_Moor', 'surname_Chambers', 'surname_Hamalainen',
                    'surname_Dick', 'surname_Taylor', 'surname_Doling', 'surname_Gordon',
                    'surname_Beane', 'surname_Hippach', 'surname_Bishop',
                    'surname_Mellinger', 'surname_Yarred'], axis = 1)

    sbs.fit(df2, y_train)

    print('Best Score:',max(sbs.scores_))

    k_feat = [len(k) for k in sbs.subsets_]
    fig = plt.figure(figsize=(10,5))
    plt.plot(k_feat, sbs.scores_, marker='o')
    #plt.ylim([0.7, max(sbs.scores_)+0.01])
    plt.xlim([1, len(sbs.subsets_)])
    plt.xticks(np.arange(1, len(sbs.subsets_)+1))
    plt.ylabel('Accuracy')
    plt.xlabel('Number of features')
    plt.grid(b=1)
    plt.show()

    print('First best accuracy with:\n',list(df.columns[sbs.subsets_[np.argmax(sbs.scores_)]]))

    SBS = list(df.columns[list(sbs.subsets_[max(np.arange(0, len(sbs.scores_))[(sbs.scores_==max(sbs.scores_))])])])

    print('\nBest accuracy with {0:2d} features:\n{1:}'.format(len(SBS), SBS))
    

Best Score: 0.8340807174887892
    

First best accuracy with:
     ['non_relatives', 'surname_Goldenberg', 'surname_Chambers', 'surname_Hamalainen', 'surname_Dick', 'surname_Beane', 'surname_Mellinger', 'surname_Yarred', 'Age', 'SibSp', 'qtd_same_ticket_bin', 'Cabin_Number', 'relatives', 'Age_bin_custom_label']

    Best accuracy with 14 features:
    ['non_relatives', 'surname_Goldenberg', 'surname_Chambers', 'surname_Hamalainen', 'surname_Dick', 'surname_Beane', 'surname_Mellinger', 'surname_Yarred', 'Age', 'SibSp', 'qtd_same_ticket_bin', 'Cabin_Number', 'relatives', 'Age_bin_custom_label']
    

Select Features by Recursive Feature Elimination

The goal of Recursive Feature Elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

RFE is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted.

Other option is sequential Feature Selector (SFS) from mlxtend, a separate Python library that is designed to work well with scikit-learn, also provides a S that works a bit differently.

RFE is computationally less complex using the feature’s weight coefficients (e.g., linear models) or feature importances (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.

In [68]:
from sklearn.feature_selection import RFE

    lr = LogisticRegression()
    rfe = RFE(estimator=lr,  step=1)
    rfe.fit(df, y_train)

    FRFE = cols[rfe.ranking_==1]
    print('\nFeatures selected:\n',FRFE)
    print('\n Total Features selected:',len(FRFE))
    

    Features selected:
     Index(['non_relatives', 'surname_Baclini', 'surname_Richards',
           'surname_Harper', 'surname_Beckwith', 'surname_Goldenberg',
           'surname_Moor', 'surname_Chambers', 'surname_Dick', 'surname_Taylor',
           'surname_Beane', 'surname_Bishop', 'surname_Yarred', 'Pclass',
           'qtd_same_ticket_bin', 'distinction_in_name', 'sons', 'parents',
           'relatives', 'genre', 'Personal_Titles_Kid', 'Embarked_S',
           'distinction_in_tikect_Low', 'distinction_in_tikect_Others',
           'distinction_in_tikect_PC'],
          dtype='object')

     Total Features selected: 25
    

Select Features by Embedded Methods

In addition to the return of the performance itself, some models has in their internal process some step to features select that best fit their proposal, and returns the features importance too. Thus, they provide two straightforward methods for feature selection and combine the qualities’ of filter and wrapper methods.

Some of the most popular examples of these methods are LASSO, RIDGE, SVM, Regularized trees, Memetic algorithm, and Random multinomial logit.

In the case of Random Forest, some other models base on trees, we have two basic approaches implemented in the packages:

  1. Gini/Entropy Importance or Mean Decrease in Impurity (MDI)
  2. Permutation Importance or Mean Decrease in Accuracy
  3. Permutation with Shadow Features
  4. Gradient Boosting

Others models has concerns om multicollinearity problem and adding additional constraints or penalty to regularize. When there are multiple correlated features, as is the case with very many real life datasets, the model becomes unstable, meaning that small changes in the data can cause large changes in the model, making model interpretation very difficult on the regularization terms.

This applies to regression models like LASSO and RIDGE. In classifier cases, you can use SGDClassifier where you can set the loss parameter to ‘log’ for Logistic Regression or ‘hinge’ for SVM. In SGDClassifier you can set the penalty to either of ‘l1’, ‘l2’ or ‘elasticnet’ which is a combination of both.

Let’s start with more details and examples:

Feature Selection by Mean Decrease Impurity

There are a two things to keep in mind when using the impurity based ranking:

  • Feature selection based on impurity reduction is biased towards preferring variables with more categories.
  • It can lead to the incorrect conclusion that one of the features is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the independent variable.

The second one refers to when the dataset has two or more correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data. The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely.

The Random Forests is one of them, the reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees, wheres for classification, it is typically either Gini impurity or information gain/entropy and for *regression it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

Them, we run a quick Random Forest to select the features most importants:

In [69]:
rfc = RandomForestClassifier(n_estimators=20, max_depth=4, random_state=101)
    rfc.fit(df, y_train)

    feature_importances = [(feature, score) for feature, score in zip(cols, rfc.feature_importances_)]

    MDI = cols[rfc.feature_importances_>0.010]
    print('Total features slected by Random Forest:',len(MDI))

    g = pd.DataFrame(sorted(feature_importances, key=lambda x: -x[1])[:len(MDI)], columns=['Feature','Importance']).\
    plot(x='Feature', kind='barh', figsize=(20,7), fontsize=18, grid=True)
    plt.show()
    

Total features slected by Random Forest: 15
    

Feature Selection by Mean Decrease Accuracy

The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant feature, the permutation should have little to no effect on model accuracy, while permuting important feature should significantly decrease it.

This method is not directly exposed in sklearn, but it is straightforward to implement it. Start record a baseline accuracy (classifier) or R2 score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R2. The importance of that feature is the difference between the baseline and the drop in overall accuracy or R2 caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

The rfpimp package in the src dir provided it For Random Forest, let’s see:

In [70]:
X_train, X_test, y, y_test = train_test_split(df, y_train , test_size=0.20,  random_state=101)

    # Add column of random numbers
    X_train['random'] = np.random.random(size=len(X_train))
    X_test['random'] = np.random.random(size=len(X_test))

    rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
    rf.fit(X_train, y)

    imp = importances(rf, X_test, y_test, n_samples=-1) # permutation
    MDA = imp[imp!=0].dropna().index
    if 'random' in MDA:
       MDA =  MDA.drop('random')
    print('%d features are selected.' % len(MDA))
    plot_importances(imp[imp!=0].dropna(), figsize=(20,7))
    

16 features are selected.
    

Feature Selection by Permutation with Shadow Features

Boruta randomly permutes variables like Permutation Importance does, but performs on all variables at the same time and concatenates the shuffled features with the original ones. The concatenated result is used to fit the model.

Daniel Homola, who also wrote the Python version of Boruta, BorutaPy, gave an wonderful overview of the Boruta algorithm in his blog post
“The shuffled features (a.k.a. shadow features) are basically noises with identical marginal distribution w.r.t the original feature. We count the times a variable performs better than the ‘best’ noise and calculate the confidence towards it being better than noise (the p-value) or not. Features which are confidently better are marked ‘confirmed’, and those which are confidently on par with noises are marked ‘rejected’. Then we remove those marked features and repeat the process until all features are marked or a certain number of iteration is reached.”

Although Boruta is a feature selection algorithm, we can use the order of confirmation/rejection as a way to rank the importance of features.

In [71]:
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
    X = df.values
    y = y_train.values.ravel()

    # define random forest classifier, with utilising all cores and
    # sampling in proportion to y labels
    #rf = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, n_jobs=-1, oob_score=True, random_state=101)
    rf = ExtraTreesClassifier(n_estimators=100, max_depth=4, n_jobs=-1, oob_score=True, bootstrap=True, random_state=101)

    # define Boruta feature selection method
    feat_selector = BorutaPy(rf, n_estimators='auto', verbose=0, random_state=101)

    # find all relevant features - 5 features should be selected
    feat_selector.fit(X, y)

    shadow = cols[feat_selector.support_]
    # check selected features - first 5 features are selected
    print('Features selected:',shadow)

    # call transform() on X to filter it down to selected features
    print('Data transformaded has %d features' % feat_selector.n_features_) #feat_selector.transform(X).shape[1])
    print('Check the selector ranking:')
    display(pd.concat([pd.DataFrame(cols, columns=['Columns']), 
               pd.DataFrame(feat_selector.ranking_, columns=['Rank'])], axis=1).sort_values(by=['Rank']))
    

Features selected: Index(['surname_Alone', 'Pclass', 'Age', 'SibSp', 'Parch',
           'qtd_same_ticket_bin', 'passenger_fare', 'distinction_in_name', 'sons',
           'relatives', 'Age_bin_custom_label', 'genre', 'Cabin_Letter_B',
           'Cabin_Letter_C', 'Cabin_Letter_E', 'Cabin_Letter_G',
           'Personal_Titles_Kid', 'Personal_Titles_Miss', 'Embarked_C',
           'Embarked_S', 'distinction_in_tikect_Low', 'distinction_in_tikect_PC'],
          dtype='object')
    Data transformaded has 22 features
    Check the selector ranking:
    

Columns Rank
25 passenger_fare 1
22 SibSp 1
23 Parch 1
24 qtd_same_ticket_bin 1
26 distinction_in_name 1
28 sons 1
30 relatives 1
33 Age_bin_custom_label 1
34 genre 1
35 Cabin_Letter_B 1
36 Cabin_Letter_C 1
38 Cabin_Letter_E 1
40 Cabin_Letter_G 1
41 Personal_Titles_Kid 1
42 Personal_Titles_Miss 1
45 Embarked_C 1
47 Embarked_S 1
48 distinction_in_tikect_Low 1
21 Age 1
20 Pclass 1
50 distinction_in_tikect_PC 1
1 surname_Alone 1
0 non_relatives 2
32 Without_Age 2
37 Cabin_Letter_D 2
27 Cabin_Number 3
29 parents 3
31 companions 3
39 Cabin_Letter_F 5
49 distinction_in_tikect_Others 6
46 Embarked_Q 7
2 surname_Baclini 8
4 surname_Richards 9
3 surname_Carter 10
9 surname_Chambers 11
44 Personal_Titles_Technical 12
13 surname_Doling 13
19 surname_Yarred 14
10 surname_Hamalainen 15
6 surname_Beckwith 16
11 surname_Dick 17
15 surname_Beane 18
12 surname_Taylor 19
8 surname_Moor 20
18 surname_Mellinger 20
7 surname_Goldenberg 22
16 surname_Hippach 22
17 surname_Bishop 24
14 surname_Gordon 25
43 Personal_Titles_Royalty 26
5 surname_Harper 27

Feature Selection by Gradient Boosting

The LightGBM model the importance is calculated from, if ‘split’, result contains numbers of times the feature is used in a model, if ‘gain’, result contains total gains of splits which use the feature.

On the XGBoost model the importance is calculated by:

  • ‘weight’: the number of times a feature is used to split the data across all trees.
  • ‘gain’: the average gain across all splits the feature is used in.
  • ‘cover’: the average coverage across all splits the feature is used in.
  • ‘total_gain’: the total gain across all splits the feature is used in.
  • ‘total_cover’: the total coverage across all splits the feature is used in.

First measure is split-based and is very similar with the one given by for Gini Importance. But it doesn’t take the number of samples into account.

The second measure is gain-based. It’s basically the same as the Gini Importance implemented in R packages and in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model.

The cover, implemented exclusively in XGBoost, is counting the number of samples affected by the splits based on a feature.

get_score(fmap=”, importance_type=’weight’)
Get feature importance of each feature. Importance type can be defined as:

The default measure of both XGBoost and LightGBM is the split-based one. I think this measure will be problematic if there are one or two feature with strong signals and a few features with weak signals. The model will exploit the strong features in the first few trees and use the rest of the features to improve on the residuals. The strong features will look not as important as they actually are. While setting lower learning rate and early stopping should alleviate the problem, also checking gain-based measure may be a good idea.

Note that these measures are purely calculated using training data, so there’s a chance that a split creates no improvement on the objective in the holdout set. This problem is more severe than in the random forest since gradient boosting models are more prone to over-fitting.

Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a previous trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

In [72]:
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

    # split data into train and test sets
    X_train, X_test, y, y_test = train_test_split(df, y_train, test_size=0.30, random_state=101)

    # fit model on all training data
    model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
    model.fit(X_train, y)
    fig=plt.figure(figsize=(20,5))
    ax = fig.add_subplot(121)
    g = plot_importance(model, height=0.5, ax=ax)

    # Using each unique importance as a threshold
    thresholds = np.sort(np.unique(model.feature_importances_)) #np.sort(model.feature_importances_[model.feature_importances_>0])
    best = 0
    colsbest = 31
    my_model = model
    threshold = 0

    for thresh in thresholds:
        # select features using threshold
        selection = SelectFromModel(model, threshold=thresh, prefit=True)
        select_X_train = selection.transform(X_train)
        # train model
        selection_model = XGBClassifier(importance_type='gain', scale_pos_weight=((len(y)-y.sum())/y.sum()))
        selection_model.fit(select_X_train, y)
        # eval model
        select_X_test = selection.transform(X_test)
        y_pred = selection_model.predict(select_X_test)
        predictions = [round(value) for value in y_pred]
        accuracy = accuracy_score(y_test, predictions)
        print("Thresh={:1.3f}, n={:d}, Accuracy: {:2.2%}".format(thresh, select_X_train.shape[1], accuracy))
        if (best <= accuracy):
            best = accuracy
            colsbest = select_X_train.shape[1]
            my_model = selection_model
            threshold = thresh
            
    ax = fig.add_subplot(122)
    g = plot_importance(my_model,height=0.5, ax=ax, 
                        title='The best accuracy: {:2.2%} with {:d} features'.\
                        format(best, colsbest))

    feature_importances = [(score, feature) for score, feature in zip(model.feature_importances_, cols)]
    XGBest = pd.DataFrame(sorted(sorted(feature_importances, reverse=True)[:colsbest]), columns=['Score', 'Feature'])
    g = XGBest.plot(x='Feature', kind='barh', figsize=(20,7), fontsize=14, grid= True,
         title='Original feature importance from selected features')
    plt.tight_layout(); plt.show()
    XGBestCols = XGBest.iloc[:, 1].tolist()
    

Thresh=0.000, n=51, Accuracy: 84.33%
    Thresh=0.002, n=26, Accuracy: 84.33%
    Thresh=0.003, n=24, Accuracy: 84.70%
    Thresh=0.005, n=23, Accuracy: 84.70%
    Thresh=0.007, n=21, Accuracy: 84.70%
    Thresh=0.008, n=19, Accuracy: 84.33%
    Thresh=0.010, n=17, Accuracy: 84.33%
    Thresh=0.017, n=15, Accuracy: 84.33%
    Thresh=0.018, n=13, Accuracy: 83.58%
    Thresh=0.020, n=12, Accuracy: 83.96%
    Thresh=0.025, n=11, Accuracy: 84.33%
    Thresh=0.029, n=10, Accuracy: 82.09%
    Thresh=0.042, n=9, Accuracy: 82.09%
    Thresh=0.045, n=8, Accuracy: 83.21%
    Thresh=0.052, n=7, Accuracy: 83.21%
    Thresh=0.059, n=6, Accuracy: 82.84%
    Thresh=0.062, n=5, Accuracy: 82.84%
    Thresh=0.065, n=4, Accuracy: 70.52%
    Thresh=0.067, n=3, Accuracy: 69.40%
    Thresh=0.186, n=2, Accuracy: 69.03%
    Thresh=0.228, n=1, Accuracy: 63.06%
    

Feature Selection by Regularized Models

Regularization is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalization. Instead of minimizing a loss function E(X,Y), the loss function to minimize becomes E(X,Y)+α∥w∥, where w is the vector of model coefficients, ∥⋅∥ is typically L1 or L2 norm and α is a tunable free parameter, specifying the amount of regularization (so α=0 implies an unregularized model). The two widely used regularization methods are L1 and L2 regularization, also called lasso and ridge.

Regularized models are a powerful set of tool for feature interpretation and selection. Lasso produces sparse solutions and as such is very useful selecting a strong subset of features for improving model performance. Ridge on the other hand can be used for data interpretation due to its stability and the fact that useful features tend to have non-zero coefficients. Since the relationship between the response variable and features in often non-linear, basis expansion can be used to convert features into a more suitable space, while keeping the simple linear models fully applicable.

Let’s see the SGDClassifier that have this concept implemented for classifications cases, and to address the skewness of our dataset in terms of labels we use the class_weight parameter of SGDCassifier and set it to “balanced”:

In [73]:
X_train, X_test, y, y_test = train_test_split(df, y_train , test_size=0.20,  random_state=101)

    # Add column of random numbers
    X_train['random'] = np.random.random(size=len(X_train))
    X_test['random'] = np.random.random(size=len(X_test))

    svm = SGDClassifier(penalty='elasticnet', class_weight='balanced', n_jobs = - 1, random_state=101)
    svm.fit(X_train, y)

    imp = importances(svm, X_test, y_test, n_samples=-1) # permutation
    RM = imp[imp!=0].dropna().index
    if 'random' in RM:
        RM = RM.drop('random')
        
    print('%d features are selected.' % len(RM))
    plot_importances(imp[imp!=0].dropna(), figsize=(20,7))
    

21 features are selected.
    

Combine Features Selection Methods

As each machine learning model benefits from one or another set of features selected, depending on its own method, and our dataset does in fact present few features, since we first removed the collinear and multilinear with the highest correlation and the highest degree of FIV that represents risk for our model, now we can union all selected features in a unique set, and check what features are elected exclusively by a unique method.

In [74]:
bcols = set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MDA)).union(set(MBR)).union(set(SMcols)).union(set(RM)).\
            union(set(XGBestCols)).union(set(SBS))
    print("Extra features select by RFE:", set(FRFE).difference(set(pv_cols).union(set(MDI)).union(set(MDA)).union(set(MBR)).union(set(RM)).\
                                                                union(set(SMcols)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by pv_cols:", set(pv_cols).difference(set(FRFE).union(set(MDI)).union(set(MDA)).union(set(MBR)).union(set(SMcols)).\
                                                  union(set(RM)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by Statistical Methods:", set(SMcols).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).\
                                                             union(set(MDA)).union(set(MBR)).union(set(RM)).\
                                                            union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by MDI:", set(MDI).difference(set(pv_cols).union(set(FRFE)).union(set(MDA)).union(set(MBR)).\
                                              union(set(SMcols)).union(set(RM)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by MDA:", set(MDA).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MBR)).\
                                              union(set(SMcols)).union(set(RM)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by MBR:", set(MBR).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MDA)).\
                                              union(set(SMcols)).union(set(RM)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by RM:", set(RM).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MDA)).\
                                              union(set(SMcols)).union(set(MBR)).union(set(XGBestCols)).union(set(SBS))), '\n')
    print("Extra features select by XGBestCols:", set(XGBestCols).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MDA)).\
                                              union(set(SMcols)).union(set(MBR)).union(set(RM)).union(set(SBS))), '\n')
    print("Extra features select by SBS:", set(SBS).difference(set(pv_cols).union(set(FRFE)).union(set(MDI)).union(set(MDA)).\
                                              union(set(SMcols)).union(set(MBR)).union(set(RM)).union(set(XGBestCols))), '\n')
    print("Intersection features:",set(MDI).intersection(set(SMcols)).intersection(set(FRFE)).intersection(set(pv_cols)).\
                                      intersection(set(RM)).intersection(set(MDA)).intersection(set(MBR)).\
                                      intersection(set(XGBestCols)).intersection(set(SBS)), '\n')
    print('Total number of features selected:', len(bcols))
    print(bcols)
    print('\n{0:2d} features removed if use the union of selections:\n{1:}'.format(len(cols.difference(bcols)), cols.difference(bcols)))
    

Extra features select by RFE: {'surname_Baclini', 'surname_Bishop', 'surname_Taylor', 'surname_Harper', 'surname_Richards', 'surname_Moor', 'surname_Beckwith'} 

    Extra features select by pv_cols: set() 

    Extra features select by Statistical Methods: set() 

    Extra features select by MDI: set() 

    Extra features select by MDA: {'Cabin_Letter_G'} 

    Extra features select by MBR: set() 

    Extra features select by RM: {'Cabin_Letter_D', 'Cabin_Letter_F'} 

    Extra features select by XGBestCols: {'surname_Alone'} 

    Extra features select by SBS: {'surname_Mellinger', 'surname_Hamalainen'} 

    Intersection features: set() 

    Total number of features selected: 44
    {'Without_Age', 'distinction_in_tikect_Others', 'surname_Bishop', 'surname_Dick', 'surname_Harper', 'SibSp', 'sons', 'distinction_in_tikect_PC', 'passenger_fare', 'non_relatives', 'surname_Baclini', 'Cabin_Letter_E', 'Age', 'relatives', 'distinction_in_tikect_Low', 'surname_Beckwith', 'surname_Goldenberg', 'surname_Alone', 'Cabin_Letter_B', 'surname_Beane', 'surname_Hamalainen', 'Pclass', 'surname_Richards', 'Cabin_Letter_C', 'Personal_Titles_Miss', 'Cabin_Letter_F', 'surname_Yarred', 'Age_bin_custom_label', 'Embarked_C', 'Parch', 'Cabin_Letter_D', 'surname_Chambers', 'Cabin_Letter_G', 'surname_Taylor', 'companions', 'parents', 'genre', 'qtd_same_ticket_bin', 'distinction_in_name', 'Embarked_S', 'surname_Moor', 'Cabin_Number', 'surname_Mellinger', 'Personal_Titles_Kid'}

     7 features removed if use the union of selections:
    Index(['Embarked_Q', 'Personal_Titles_Royalty', 'Personal_Titles_Technical',
           'surname_Carter', 'surname_Doling', 'surname_Gordon',
           'surname_Hippach'],
          dtype='object')
    

Chose The Features From The Selection Methods

As you can saw the methods chose some different features that,if we consider the intersection, there is nothing left, and if the union only removes eight features.As expected, we can’t made a unique strategy, the right chose depends on your proposal and the respective model that you will run.

Since, we have few features, by the reason of don’t consider all results of hot encode of surnames, we can try some models with different sets and check the influence in the results, like we did on the methods that selecting based on accuracy of the model. In the other hand, we can make the feature selection as part of the pipeline of the model and select features based on the best for the respective model. This is allows different strategies, including the use of methods for sparse data and thus evaluate the surnames, or methods that apply regularization in data to submit to models that don’t have regularization terms.

For the submission, I already run multiple times and made choices to submit. Here for simplicity, I chose the initial result with only drop collinearity and multicollinearity, and make the selection into the pipelines, but we need first check what we can get from dimension reduction techniques.

Additional Feature Engineering: Feature transformation

Feature transformation (FT) refers to family of algorithms that create new features using the existing features. These new features may not have the same interpretation as the original features, but they may have more discriminatory power in a different space than the original space. This can also be used for feature reduction. FT may happen in many ways, by simple/linear combinations of original features or using non-linear functions. Some common techniques for FT are:

  • Scaling or normalizing features (e.g.: StandardScaler, RobustSacaler and MinMaxScaler)
  • Principle Component Analysis
  • Random Projection
  • Neural Networks
  • SVM also transforms features internally.
  • Transforming categorical features to numerical.

Polynomial Features – Create Degree 3 of some Features

Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms.

In [75]:
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
    res = pf.fit_transform(data[['Pclass', 'passenger_fare']])

    display(pd.DataFrame(pf.powers_, columns=['Pclass', 'passenger_fare']))
    del res 

    # We can contact the new res with data, but we need treat the items without interactions and power, 
    # or if is few features it can generate and incorporate to data manually.
    data['Pclass^2'] = data.Pclass**2
    data['Plcass_X_p_fare'] = data.Pclass * data.passenger_fare
    data['p_fare^2'] = data.passenger_fare**2

    cols = cols.insert(33, 'Pclass^2')
    cols = cols.insert(34, 'Plcass_X_p_fare')
    cols = cols.insert(35, 'p_fare^2')

    bcols.add('Pclass^2')
    bcols.add('Plcass_X_p_fare')
    bcols.add('p_fare^2')

    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data.loc[data.Survived>=0, cols]), columns= cols)
    

Pclass passenger_fare
0 1 0
1 0 1
2 2 0
3 1 1
4 0 2

Defining Categorical Data as Category

In order that the models do not make inappropriate use of features transformed into numbers and apply only calculations relevant to categorical, we have to transform their type into category type

In [76]:
data.Pclass = data.Pclass.astype('category')
    data.genre = data.genre.astype('category')
    data.distinction_in_tikect_Low = data.distinction_in_tikect_Low.astype('category')
    data.distinction_in_tikect_PC = data.distinction_in_tikect_PC.astype('category')
    data.distinction_in_tikect_Others = data.distinction_in_tikect_Others.astype('category')
    data.Cabin_Letter_B = data.Cabin_Letter_B.astype('category')
    data.Cabin_Letter_C = data.Cabin_Letter_C.astype('category')
    data.Cabin_Letter_D = data.Cabin_Letter_D.astype('category')
    data.Cabin_Letter_E = data.Cabin_Letter_E.astype('category')
    data.Cabin_Letter_F = data.Cabin_Letter_F.astype('category')
    data.Cabin_Letter_G = data.Cabin_Letter_G.astype('category')
    data.Embarked_C = data.Embarked_C.astype('category')
    data.Embarked_S = data.Embarked_S.astype('category')
    data.Embarked_Q = data.Embarked_Q.astype('category')
    data.Personal_Titles_Royalty = data.Personal_Titles_Royalty.astype('category')
    data.Personal_Titles_Technical = data.Personal_Titles_Technical.astype('category')
    data.Personal_Titles_Kid = data.Personal_Titles_Kid.astype('category')
    data.Personal_Titles_Mrs = data.Personal_Titles_Mrs.astype('category')
    data.Personal_Titles_Mr = data.Personal_Titles_Mr.astype('category')
    data.Personal_Titles_Miss = data.Personal_Titles_Miss.astype('category')
    data.Without_Age = data.Without_Age.astype('category')
    data.distinction_in_name = data.distinction_in_name.astype('category')
    data.parents = data.parents.astype('category')
    data.relatives = data.relatives.astype('category')
    data.sons = data.sons.astype('category')
    data.companions = data.companions.astype('category')
    data.surname_Alone = data.surname_Alone.astype('category')
    data.surname_Baclini = data.surname_Baclini.astype('category')
    data.surname_Carter = data.surname_Carter.astype('category')
    data.surname_Richards = data.surname_Richards.astype('category')
    data.surname_Harper = data.surname_Harper.astype('category')
    data.surname_Beckwith = data.surname_Beckwith.astype('category')
    data.surname_Goldenberg = data.surname_Goldenberg.astype('category')
    data.surname_Moor = data.surname_Moor.astype('category')
    data.surname_Chambers = data.surname_Chambers.astype('category')
    data.surname_Hamalainen = data.surname_Hamalainen.astype('category')
    data.surname_Dick = data.surname_Dick.astype('category')
    data.surname_Taylor = data.surname_Taylor.astype('category')
    data.surname_Doling = data.surname_Doling.astype('category')
    data.surname_Gordon = data.surname_Gordon.astype('category')
    data.surname_Beane = data.surname_Beane.astype('category')
    data.surname_Hippach = data.surname_Hippach.astype('category')
    data.surname_Bishop = data.surname_Bishop.astype('category')
    data.surname_Mellinger = data.surname_Mellinger.astype('category')
    data.surname_Yarred = data.surname_Yarred.astype('category')
    

Box cox transformation of highly skewed features

A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape.

Why does this matter?

  • Model bias and spurious interactions: If you are performing a regression or any statistical modeling, this asymmetrical behavior may lead to a bias in the model. If a factor has a significant effect on the average, because the variability is much larger, many factors will seem to have a stronger effect when the mean is larger. This is not due, however, to a true factor effect but rather to an increased amount of variability that affects all factor effect estimates when the mean gets larger. This will probably generate spurious interactions due to a non-constant variation, resulting in a very complex model with many spurious and unrealistic interactions.
  • Normality is an important assumption for many statistical techniques: such as individuals control charts, Cp/Cpk analysis, t-tests and analysis of variance (ANOVA). A substantial departure from normality will bias your capability estimates.

One solution to this is to transform your data into normality using a Box-Cox transformation means that you are able to run a broader number of tests.

At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The ‘optimal value’ is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

The scipy implementation proceeded with this formula, then you need before take care of negatives values if you have. A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.

This test only works for positive data. However, Box and Cox did propose a second formula that can be used for negative y-values, not implemented in scipy:

The formula are deceptively simple. Testing all possible values by hand is unnecessarily labor intensive.

Common Box-Cox Transformations

Lambda value (λ) Transformed data (Y’)
-3 Y**-3 = 1/Y**3
-2 Y**-2 = 1/Y**2
-1 Y**-1 = 1/Y
-0.5 Y**-0.5 = 1/(√(Y))
0 log(Y)(*)
0.5 Y0.5 = √(Y)
1 Y**1 = Y
2 Y**2
3 Y**3

(*)Note: the transformation for zero is log(0), otherwise all data would transform to Y**0 = 1.
The transformation doesn’t always work well, so make sure you check your data after the transformation with a normal probability plot or if the skew are reduced, tending to zero.

In [77]:
numeric_features = list(data.loc[:, cols].dtypes[data.dtypes != "category"].index)

    # non_relative is skwed and have negatives values, so we need adding 6 as a shift parameter.
    data['non_relatives_shift'] = data.non_relatives + 6
    numeric_features.remove('non_relatives')
    numeric_features.append('non_relatives_shift')

    skewed_features = data[numeric_features].apply(lambda x : skew (x.dropna())).sort_values(ascending=False)

    #compute skewness
    skewness = pd.DataFrame({'Skew' :skewed_features})   

    # Get only higest skewed features
    skewness = skewness[abs(skewness) > 0.7]
    skewness = skewness.dropna()
    print ("There are {} higest skewed numerical features to box cox transform".format(skewness.shape[0]))

    l_opt = {}

    #df = pd.DataFrame()    
    for feat in skewness.index:
        #df[feat] = boxcox1p(data[feat], l_opt[feat])
        #data[feat] = boxcox1p(data[feat], l_opt[feat])
        data[feat], l_opt[feat] = boxcox((data[feat]+1))

    #skewed_features2 = df.apply(lambda x : skew (x.dropna())).sort_values(ascending=False)
    skewed_features2 = data[skewness.index].apply(lambda x : skew (x.dropna())).sort_values(ascending=False)

    #compute skewness
    skewness2 = pd.DataFrame({'New Skew' :skewed_features2})   
    display(pd.concat([skewness, skewness2], axis=1).sort_values(by=['Skew'], ascending=False))
    

There are 8 higest skewed numerical features to box cox transform
    

Skew New Skew
p_fare^2 10.198854 -0.042275
Plcass_X_p_fare 4.275970 0.799022
SibSp 3.839814 0.795656
Parch 3.664872 1.255973
passenger_fare 3.041171 -0.028864
non_relatives_shift 2.721651 0.464157
qtd_same_ticket_bin 1.378407 0.390559
Cabin_Number 1.253198 -0.013094

As you can see, we were able at first to bring the numerical values closer to normal. Let’s take a look at the QQ test of these features.

In [78]:
def QQ_plot(data, measure):
        fig = plt.figure(figsize=(12,4))

        #Get the fitted parameters used by the function
        (mu, sigma) = norm.fit(data)

        #Kernel Density plot
        fig1 = fig.add_subplot(121)
        sns.distplot(data, fit=norm)
        fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
        fig1.set_xlabel(measure)
        fig1.set_ylabel('Frequency')

        #QQ plot
        fig2 = fig.add_subplot(122)
        res = probplot(data, plot=fig2)
        fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.\
                       format(data.skew(), data.kurt()), loc='center')

        plt.tight_layout()
        plt.show()
        
    for feat in skewness.index:
        QQ_plot(data[feat], ('Boxcox1p of {}'.format(feat)))
    

Compressing Data via Dimensionality Reduction

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. If there are n observations with p variables, then the number of distinct principal components is min(n-1,p). This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Let’s see how PCA can reduce the dimensionality of our dataset with minimum of lose information:

In [79]:
pca_all = PCA(random_state=101, whiten=True).fit(df)

    my_color=y_train.astype('category').cat.codes

    # Store results of PCA in a data frame
    result=pd.DataFrame(pca_all.transform(df), columns=['PCA%i' % i for i in range(df.shape[1])], index=df.index)

    # Plot initialisation
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(result['PCA0'], result['PCA1'], result['PCA2'], c=my_color, cmap="Set2_r", s=60)
     
    # make simple, bare axis lines through space:
    xAxisLine = ((min(result['PCA0']), max(result['PCA0'])), (0, 0), (0,0))
    ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r')
    yAxisLine = ((0, 0), (min(result['PCA1']), max(result['PCA1'])), (0,0))
    ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'r')
    zAxisLine = ((0, 0), (0,0), (min(result['PCA2']), max(result['PCA2'])))
    ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'r')
     
    # label the axes
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    ax.set_zlabel("PC3")
    ax.set_title("PCA on the Titanic data set")
    plt.show()

    X_train , X_test, y, y_test = train_test_split(df , y_train, test_size=0.3, random_state=0)

    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train, y)
    print('LR Training Accuracy: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train))))
    y_pred = lr.predict(X_test)
    print('LR Accuracy: {:2.2%}'.format(accuracy_score(y_test, y_pred)))

    print('_' * 40)
    print('\nApply PCA:\n')
    AccPca = pd.DataFrame(columns=['Components', 'Var_ratio', 'Train_Acc', 'Test_Acc'])

    for componets in np.arange(1, df.shape[1]):
        variance_ratio = sum(pca_all.explained_variance_ratio_[:componets])*100
        pca = PCA(n_components=componets, random_state=101, whiten=True)
        X_train_pca = pca.fit_transform(X_train)
        Components = X_train_pca.shape[1]
        lr = LogisticRegression(class_weight='balanced', random_state=101)
        lr = lr.fit(X_train_pca, y)
        Training_Accuracy = accuracy_score(y, lr.predict(X_train_pca))
        X_test_pca = pca.transform(X_test)
        y_pred = lr.predict(X_test_pca)
        Test_Accuracy = accuracy_score(y_test, y_pred)
        AccPca = AccPca.append(pd.DataFrame([(Components, variance_ratio, Training_Accuracy, Test_Accuracy)],
                                            columns=['Components', 'Var_ratio', 'Train_Acc', 'Test_Acc']))#], axis=0)

    AccPca.set_index('Components', inplace=True)
    display(AccPca.sort_values(by='Test_Acc', ascending=False))
    

LR Training Accuracy: 84.27%
    LR Accuracy: 81.72%
    ________________________________________

    Apply PCA:

    

Var_ratio Train_Acc Test_Acc
Components
22 99.999974 0.812199 0.832090
25 99.999985 0.813804 0.824627
23 99.999978 0.815409 0.824627
24 99.999982 0.813804 0.820896
18 99.999953 0.810594 0.820896
31 99.999995 0.839486 0.817164
28 99.999991 0.844302 0.817164
19 99.999959 0.813804 0.817164
17 99.999947 0.799358 0.813433
16 99.999937 0.794543 0.813433
30 99.999994 0.841091 0.813433
29 99.999993 0.841091 0.813433
21 99.999970 0.810594 0.813433
20 99.999965 0.813804 0.813433
53 100.000000 0.861958 0.809701
36 99.999997 0.841091 0.805970
48 99.999999 0.861958 0.805970
52 100.000000 0.861958 0.805970
51 100.000000 0.861958 0.805970
50 100.000000 0.861958 0.805970
49 99.999999 0.861958 0.805970
34 99.999997 0.841091 0.805970
35 99.999997 0.844302 0.805970
47 99.999999 0.861958 0.805970
33 99.999996 0.842697 0.805970
11 99.999852 0.781701 0.805970
37 99.999997 0.842697 0.805970
14 99.999913 0.796148 0.805970
32 99.999996 0.844302 0.805970
46 99.999999 0.861958 0.802239
45 99.999999 0.861958 0.802239
44 99.999999 0.861958 0.802239
27 99.999990 0.823435 0.798507
26 99.999988 0.826645 0.798507
15 99.999926 0.797753 0.798507
13 99.999896 0.794543 0.798507
38 99.999998 0.849117 0.794776
43 99.999998 0.855538 0.794776
41 99.999998 0.847512 0.791045
42 99.999998 0.847512 0.791045
39 99.999998 0.841091 0.787313
40 99.999998 0.837881 0.787313
12 99.999875 0.783307 0.783582
2 99.992725 0.654896 0.746269
4 99.998387 0.683788 0.731343
9 99.999792 0.703050 0.723881
5 99.999328 0.658106 0.720149
10 99.999824 0.720706 0.708955
1 99.978042 0.672552 0.708955
6 99.999525 0.701445 0.705224
3 99.996828 0.677368 0.705224
8 99.999747 0.691814 0.690299
7 99.999643 0.688604 0.690299

In that case, although we can see some separation in planes, you can see that we lose too much information if we consider only the 3 first PCA’s, and we can get the best test accuracy from 22 components.

Linear Discriminant Analysis (LDA)

As a supervised dimensionality reduction technique for maximizing class separability. LDA can be used as a technique for feature extraction to increase the computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality in nonregularized models.

So, the goal is to find the feature subspace that optimizes class separability.

However, even if one or more of those assumptions are slightly violated, LDA for dimensionality reduction can still work reasonably well.

Some Important Parameters:
solver : string, optional
Solver to use, possible values:

  - svd: Singular value decomposition (default).
        Does not compute the covariance matrix, therefore this solver is
        recommended for data with a large number of features.
      - eigen: Eigenvalue decomposition, can be combined with shrinkage.

    

shrinkage : string or float, optional
Shrinkage parameter, possible values:

  - None: no shrinkage (default).
      - auto: automatic shrinkage using the Ledoit-Wolf lemma.
      - float between 0 and 1: fixed shrinkage parameter.

    Note that shrinkage works only with 'lsqr' and 'eigen' solvers.

In [80]:
X_train , X_test, y, y_test = train_test_split(df , y_train, test_size=0.3, random_state=0)

    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train, y)
    print('LR Training Accuracy: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train))))
    y_pred = lr.predict(X_test)
    print('LR Accuracy: {:2.2%}'.format(accuracy_score(y_test, y_pred)))
    print('_' * 40)
    print('\nApply LDA:\n')
    lda = LDA(store_covariance=True)
    X_train_lda = lda.fit_transform(X_train, y)
    #X_train_lda = pd.DataFrame(X_train_lda)

    print('Number of features after LDA:',X_train_lda.shape[1])
    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train_lda, y)
    print('LR Training Accuracy With LDA: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train_lda))))
    X_test_lda = lda.transform(X_test)
    y_pred = lr.predict(X_test_lda)
    print('LR Test Accuracy With LDA: {:2.2%}'.format(accuracy_score(y_test, y_pred)))

    fig = plt.figure(figsize=(20,5))
    fig.add_subplot(121)
    plt.scatter(X_train_lda[y==0, 0], np.zeros((len(X_train_lda[y==0, 0]),1)), color='red', alpha=0.1)
    plt.scatter(X_train_lda[y==1, 0], np.zeros((len(X_train_lda[y==1, 0]),1)), color='blue', alpha=0.1)
    plt.title('LDA on Training Data Set')
    plt.xlabel('LDA')
    fig.add_subplot(122)
    plt.scatter(X_test_lda[y_test==0, 0], np.zeros((len(X_test_lda[y_test==0, 0]),1)), color='red', alpha=0.1)
    plt.scatter(X_test_lda[y_test==1, 0], np.zeros((len(X_test_lda[y_test==1, 0]),1)), color='blue', alpha=0.1)
    plt.title('LDA on Test Data Set')
    plt.xlabel('LDA')

    plt.show()
    

LR Training Accuracy: 84.27%
    LR Accuracy: 81.72%
    ________________________________________

    Apply LDA:

    Number of features after LDA: 1
    LR Training Accuracy With LDA: 85.55%
    LR Test Accuracy With LDA: 80.60%
    

As you can saw, we have better accuracy on training after LDA, but with overfitting, sure without cross validation.

If I don’t consider the surnames below to 0.06 of correlation with survived, we can get basically the same results after LDA, with lose only 0.48% at training and 0.38% at test, but reduce the difference in 0.10%.

The LDA returns a total of components equal to the number of class minus 1, or less if you define the n_components less than the number of classes. Since our case is binary classification, we only have one column after applying the LDA.

For we can have another visualization, a small trick to having two components as a return is to fit some rows to X with a not common in their training observations, in that case with -0.1 for example, and the same number of rows with -1 to y. Let’s see it:

In [81]:
X_train , X_test, y, y_test = train_test_split(df , y_train, test_size=0.3, random_state=0)

    X_train = X_train.append(pd.DataFrame(-np.ones((20,len(cols)))/10, columns = X_train.columns), ignore_index=True)
    y = y.append(pd.Series(-np.ones((20))), ignore_index=True)

    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train, y)

    print('Artficial training %d observations' % X_train.Age[y==-1].count())
    print('LR Training Accuracy: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train))))
    y_pred = lr.predict(X_test)
    print('LR Accuracy: {:2.2%}'.format(accuracy_score(y_test, y_pred)))

    print('_' * 40)
    print('\nApply LDA:\n')
    lda = LDA(store_covariance=True)
    X_train_lda = lda.fit_transform(X_train, y)

    print('Number of features after LDA:',X_train_lda.shape[1])
    print('Number test observations predit as -1:', len(X_test_lda[y_test==-1, :]))
    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train_lda, y)
    print('LR Training Accuracy With LDA: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train_lda))))
    X_test_lda = lda.transform(X_test)
    y_pred = lr.predict(X_test_lda)
    print('LR Test Accuracy With LDA: {:2.2%}'.format(accuracy_score(y_test, y_pred)))

    fig = plt.figure(figsize=(20,5))
    fig.add_subplot(121)
    plt.scatter(x=X_train_lda[y==0, 0], y=X_train_lda[y==0, 1], color='red', alpha=0.1)
    plt.scatter(x=X_train_lda[y==1, 0], y=X_train_lda[y==1, 1], color='blue', alpha=0.1)
    plt.title('LDA on Training Data Set')
    plt.xlabel('LDA 1')
    plt.ylabel('LDA 2')

    fig.add_subplot(122)
    plt.scatter(x=X_test_lda[y_test==0, 0], y=X_test_lda[y_test==0, 1], color='red', alpha=0.1)
    plt.scatter(x=X_test_lda[y_test==1, 0], y=X_test_lda[y_test==1, 1], color='blue', alpha=0.1)
    plt.title('LDA on Test Data Set')
    plt.xlabel('LDA 1')
    plt.ylabel('LDA 2')

    plt.show()
    

Artficial training 20 observations
    LR Training Accuracy: 85.07%
    LR Accuracy: 81.72%
    ________________________________________

    Apply LDA:

    Number of features after LDA: 2
    Number test observations predit as -1: 0
    LR Training Accuracy With LDA: 85.69%
    LR Test Accuracy With LDA: 80.60%
    

Nonlinear dimensionality reduction via kernel principal component analysis

Many machine learning algorithms make assumptions about the linear separability of the input data. If we are dealing with nonlinear problems, which is more common in real cases, linear transformation techniques for dimensionality reduction like PCA and LDA, may not be the best choice. Using kernel PCA to transform nonlinear data onto a new, lower-dimensional subspace that is suitable for linear classifiers.

In what way, with kernel PCA we perform a nonlinear mapping that transforms the data onto a higher-dimensional space and use standard PCA in this higher-dimensional space to project the data back onto a lower-dimensional space where the samples can be separated by a linear classifier. However, one downside of this approach is that it is computationally very expensive.

Using the kernel trick, we can compute the similarity between two high-dimension feature vectors in the original feature space. In other words, what we obtain after kernel PCA are the samples already projected onto the respective components.

The most commonly used kernels

  • The polynomial kernel.
  • The hyperbolic tangent (sigmoid) kernel.
  • The Radial Basis Function (RBF) or Gaussian kernel.

Scikit-learn implements a kernel PCA class and also implements manifold, a class with advanced techniques for nonlinear dimensionality reduction.

Some important parameters:

  • n_components : int, default=None. Number of components. If None, all non-zero components are kept.

  • eigen_solver : string [‘auto’|’dense’|’arpack’], default=’auto'</p>
    Select eigensolver to use. If n_components is much less than the number of training samples, arpack may be more efficient than the dense eigensolver.

  • kernel : “linear” | “poly” | “rbf” | “sigmoid” | “cosine” | “precomputed”. Kernel. Default=”linear”.

  • gamma : float, default=1/n_features. Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels.

  • degree : int, default=3. Degree for poly kernels. Ignored by other kernels.

  • coef0 : float, default=1. Independent term in poly and sigmoid kernels. Ignored by other kernels.

Let’s start and see if kernel PCA can help with our data:

In [82]:
n_components = 3
    kernel = 'linear' 
    degree = 3
    gamma = 1/df.shape[0]

    kpca = KernelPCA(n_components = n_components, degree = degree, random_state = 101, #gamma = gamma,
                    kernel = kernel, eigen_solver='arpack')
    X_kpca = kpca.fit_transform(df)

    # Plot first two KPCA components
    fig = plt.figure(figsize=(20,6))
    ax  = fig.add_subplot(121)
    plt.scatter(x = X_kpca[y_train==0, 0], y = X_kpca[y_train==0, 1], color='red', marker='^', alpha=0.5)
    plt.scatter(x = X_kpca[y_train==1, 0], y = X_kpca[y_train==1, 1], color='blue', marker='o', alpha=0.5)
    ax.set_xlabel("KPCA_0")
    ax.set_ylabel("KPCA_1")
    ax.set_title("Plot of first 2 KPCA Components on the Titanic data set")

    my_color=y_train.astype('category').cat.codes

    # Store results of PCA in a data frame
    result=pd.DataFrame(X_kpca, columns=['KPCA%i' % i for i in range(n_components)], index=df.index)

    # Plot initialisation
    ax = fig.add_subplot(122, projection='3d')
    ax.scatter(result['KPCA0'], result['KPCA1'], result['KPCA2'], c=my_color, cmap="Set2_r", s=60)
     
    # make simple, bare axis lines through space:
    xAxisLine = ((min(result['KPCA0']), max(result['KPCA0'])), (0, 0), (0,0))
    ax.plot(xAxisLine[0], xAxisLine[1], xAxisLine[2], 'r')
    yAxisLine = ((0, 0), (min(result['KPCA1']), max(result['KPCA1'])), (0,0))
    ax.plot(yAxisLine[0], yAxisLine[1], yAxisLine[2], 'r')
    zAxisLine = ((0, 0), (0,0), (min(result['KPCA2']), max(result['KPCA2'])))
    ax.plot(zAxisLine[0], zAxisLine[1], zAxisLine[2], 'r')
     
    # label the axes
    ax.set_xlabel("KPCA_0")
    ax.set_ylabel("KPCA_1")
    ax.set_zlabel("KPCA_2")
    ax.set_title("KPCA of 3 Components on the Titanic data set")
    plt.tight_layout(); plt.show()

    X_train , X_test, y, y_test = train_test_split(df , y_train, test_size=0.3, random_state=0)

    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train, y)
    print('\nLogistic Regression over data without transformation:\n' + '_' * 53 + '\n')
    print('LR Training Accuracy: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train))))
    y_pred = lr.predict(X_test)
    print('LR Test Accuracy: {:2.2%}'.format(accuracy_score(y_test, y_pred)))


    print('\nApply KPCA:\n' + '_' * 53)
    kpca = KernelPCA(kernel = kernel, random_state = 101, degree = degree, eigen_solver='arpack', n_components = 23)
    X_train_kpca = kpca.fit_transform(X_train)
    print('Number of features after KPCA:', X_train_kpca.shape[1])
    lr = LogisticRegression(class_weight='balanced', random_state=101)
    lr = lr.fit(X_train_kpca, y)
    print('LR Training Accuracy: {:2.2%}'.format(accuracy_score(y, lr.predict(X_train_kpca))))
    X_test_kpca = kpca.transform(X_test)
    y_pred = lr.predict(X_test_kpca)
    print('LR Test Accuracy: {:2.2%}'.format(accuracy_score(y_test, y_pred)))
    

    Logistic Regression over data without transformation:
    _____________________________________________________

    LR Training Accuracy: 84.27%
    LR Test Accuracy: 81.72%

    Apply KPCA:
    _____________________________________________________
    Number of features after KPCA: 23
    LR Training Accuracy: 81.06%
    LR Test Accuracy: 83.21%
    

Although the algorithm is admittedly exhaustive, as we have few data it runs very well, even using a single core.

So, instead of proceeding with a hyper parameterization via grid search, I chose to run manually with some variations to see the graphs and results on accuracy. I leave the best result I got, but if you want you can proceed with play it and check for yourself.

My conclusions are:

  • The liner solver not exceeding PCA and we don’t see a great difference in accuracy, if we define the number of features to 23, and get the same with 45.
  • The poly solver degenerate our accuracy.
  • With rbf solver we have overfitting.
  • With sigmoid or cosine the accuracy is the worst.

So applying nonlinear transformations to all of these data may not be the best, and it’s important checked it against your model’s performance. Also, as you may notice these transformations are subject to hyper parameterization, then, you should not ignore this if your case is computationally costs

Feature Selection into the Pipeline

Since we have a very different selection of features selection methods, from the results it may be interesting keeping only the removal of collinear, multicollinear and most of one-hot encode results from surnames, and apply PCA around 22 or LDA to linear models, and we can still improve the results through hyper parameterization and cross-validation.

In [83]:
class select_fetaures(object): # BaseEstimator, TransformerMixin, 
        def __init__(self, select_cols):
            self.select_cols_ = select_cols
        
        def fit(self, X, Y ):
            print('Recive {0:2d} features...'.format(X.shape[1]))
            return self

        def transform(self, X):
            print('Select {0:2d} features'.format(X.loc[:, self.select_cols_].shape[1]))
            return X.loc[:, self.select_cols_]    

        def fit_transform(self, X, Y):
            self.fit(X, Y)
            df = self.transform(X)
            return df 
            #X.loc[:, self.select_cols_]    

        def __getitem__(self, x):
            return self.X[x], self.Y[x]
            
    

In [84]:
Test_ID = data.PassengerId[data.Survived<0]
    y_train = data.Survived[data.Survived>=0]
    train = data.loc[data.Survived>=0, list(cols)]
    test = data.loc[data.Survived<0, list(cols)]
    

Modeling – Hyper Parametrization

First, we start to looking at different approaches to implement classifiers models, and use hyper parametrization, cross validation and compare the results between different errors measures.

The standard error of the coefficient (std err) indicates the precision of the coefficient estimates. Smaller values represent more reliable estimates.

When you run two models to check the effects of multicollinearity, ever compare the Summary of Model statistics between the two models and you’ll notice that Pseudo R-squ. and the others are all identical, if the effects is None or minimal. In that case multicollinearity doesn’t affect how well the model fits. In fact, if you want to use the model to make predictions, both models produce identical results for fitted values and prediction intervals!

Simplify Get Results

Let’s build a function to standardize the capture and exposure of the results of our models.

In [85]:
def get_results(model, name, results=None, data=train, reasume=False):

        modelo = model.fit(data, y_train)
        print('Mean Best Accuracy: {:2.2%}'.format(gs.best_score_))
        print(gs.best_params_,'\n')
        best = gs.best_estimator_
        param_grid = best
        y_pred = model.predict(data)
        meu.display_model_performance_metrics(true_labels=y_train, predicted_labels=y_pred)

        print('\n\n              ROC AUC Score: {:2.2%}'.format(roc_auc_score(y_true=y_train, y_score=y_pred)))
        if hasattr(param_grid, 'predict_proba'):
                prob = model.predict_proba(data)
                score_roc = prob[:, prob.shape[1]-1] 
                prob = True
        elif hasattr(param_grid, 'decision_function'):
                score_roc = model.decision_function(data)
                prob = False
        else:
                raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
        fpr, tpr, thresholds = roc_curve(y_true=y_train, y_score=score_roc)
        roc_auc = auc(fpr, tpr)
        plt.title('Receiver Operating Characteristic (ROC) Curve')
        plt.plot(fpr, tpr, 'b', label='AUC = {:2.2%}'.format(roc_auc))
        plt.legend(loc='lower right')
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')
        plt.plot([0, 1], [0, 1], 'k--')
        plt.show()

        r1 = pd.DataFrame([(prob, gs.best_score_, np.round(accuracy_score(y_train, y_pred), 4), 
                             roc_auc_score(y_true=y_train, y_score=y_pred), roc_auc)], index = [name],
                             columns = ['Prob', 'CV Accuracy', 'Acc All', 'ROC AUC Score', 'ROC Area'])
        if reasume:
            results = r1
        elif (name in results.index):        
            results.loc[[name], :] = r1
        else: 
            results = results.append(r1)
            
        return results, modelo
    

Logistic Regression

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers. It can handle both dense and sparse input.

Additional Parameters

  • class_weight : dict or ‘balanced’, default: None
    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

    For how class_weight works: It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1. So higher class-weight means you want to put more emphasis on a class. For example, our class 0 is 1.24 times more frequent than class 1. So you should increase the class_weight of class 1 relative to class 0, say {1: 0.6, 0: 0.4}. If the class_weight doesn’t sum to 1, it will basically change the regularization parameter.

    “balanced” basically means replicating the smaller class until you have as many samples as in the larger one, but in an implicit way.

  • warm_start : bool, default: False. Useless for liblinear solver.
  • 'clf__multi_class' : ['ovr', 'multinomial'] for 'clf__solver': ['newton-cg', 'sag', 'lbfgs']

Attributes:

  • coef_ : array, shape (1, n_features) or (n_classes, n_features)
  • intercept_ : array, shape (1,) or (n_classes,)
  • niter : array, shape (n_classes,) or (1, )

See also:

  • SGDClassifier : incrementally trained logistic regression (when given the parameter loss="log").
  • sklearn.svm.LinearSVC : learns SVM models using the same algorithm.

    See the best results below, wheres get with PCA 21 but take more time then LDA.

In [86]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', LogisticRegression(random_state=101))])  

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    C =  [0.008, 0.007, 0.009, 0.01]#, 0.1, 1.0, 10.0, 100.0, 1000.0]
    tol = [0.001, 0.003, 0.002, 0.005] # [1e-06, 5e-07, 1e-05, 1e-04, 1e-03, 1e-02, 1e-01]

    param_grid =\
        [{'clf__C': C
         ,'clf__solver': ['liblinear', 'saga'] 
         ,'clf__penalty': ['l1', 'l2']
         ,'clf__tol' : tol 
         ,'clf__class_weight': ['balanced']
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
    },
        {'clf__C': C
         ,'clf__max_iter': [3, 9, 2, 7, 4]
         ,'clf__solver': ['newton-cg', 'sag', 'lbfgs']
         ,'clf__penalty': ['l2']
         ,'clf__tol' : tol 
         ,'clf__class_weight': ['balanced'] 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
    }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(shadow))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  


    results, lr = get_results(main_pip, 'LogisticRegression', reasume=True)
    

Recive 54 features...
    Select 22 features
    Fitting 5 folds for each of 304 candidates, totalling 1520 fits
    

[Parallel(n_jobs=3)]: Done 785 tasks      | elapsed:    4.1s
    [Parallel(n_jobs=3)]: Done 1520 out of 1520 | elapsed:    5.9s finished
    

Mean Best Accuracy: 82.27%
    {'clf__C': 0.008, 'clf__class_weight': 'balanced', 'clf__max_iter': 4, 'clf__penalty': 'l2', 'clf__solver': 'sag', 'clf__tol': 0.001} 

    Select 22 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  82.83% 
    Precision: 82.72% 
    Recall:    82.83% 
    F1 Score:  82.74% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.79      0.75      0.77       342
              0       0.85      0.88      0.86       549

    avg / total       0.83      0.83      0.83       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        257   85
            0         68  481


                  ROC AUC Score: 81.38%
    Select 22 features
    

SGDClassifier

This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning, see the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve on-line feature selection.

Parameters:

loss:

  • Classifier: hinge, log, modified_huber, squared_hinge, perceptron
    • Defaults to ‘hinge’, which gives a linear SVM.
    • The ‘log’ loss gives logistic regression, a probabilistic classifier.
    • ‘modified_huber’ is another smooth loss that brings tolerance to outliers as well as probability estimates.
    • ‘squared_hinge’ is like hinge but is quadratically penalized.
    • ‘perceptron’ is the linear loss used by the perceptron algorithm.
  • regression: squared_loss, huber, epsilon_insensitive, squared_epsilon_insensitive

penalty: The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.

alpha: Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.

l1_ratio: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.

tol: The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss – tol).Defaults to 1e-3 from 0.21.

learning_rate:

  • ‘constant’: eta = eta0
  • ‘optimal’: eta = 1.0 / (alpha * (t + t0)) [default]
  • ‘invscaling’: eta = eta0 / pow(t, power_t)
    where t0 is chosen by a heuristic proposed by Leon Bottou.

eta0: The initial learning rate for the constant or invscaling schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.

power_t: The exponent for inverse scaling learning rate [default 0.5].

class_weight: dict, {class_label: weight} or “balanced” or None, optional

  • The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

In [87]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', SGDClassifier(random_state=101))])

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [30, 22, 21, 50]
    whiten = [True, False]
    alpha = [4e-03, 5e-03, 6e-03, 1e-03]
    tol = [1e-08, 1e-07, 5e-09]

    param_grid =\
        [{'clf__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']
         ,'clf__tol': tol
         ,'clf__alpha': alpha
         ,'clf__penalty': ['l2', 'l1']
         ,'clf__class_weight' : ['balanced'] 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         },
        {'clf__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']
         ,'clf__tol': tol
         ,'clf__alpha': alpha
         ,'clf__penalty': ['elasticnet']
         ,'clf__l1_ratio' : [0.3, 0.5, 0.1]
         ,'clf__class_weight' : ['balanced'] 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]


    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(FRFE))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, svm = get_results(main_pip, 'SGDClassifier', results)
    

Recive 54 features...
    Select 25 features
    Fitting 5 folds for each of 300 candidates, totalling 1500 fits
    

[Parallel(n_jobs=3)]: Done 815 tasks      | elapsed:    3.8s
    [Parallel(n_jobs=3)]: Done 1500 out of 1500 | elapsed:    4.7s finished
    

Mean Best Accuracy: 83.95%
    {'clf__alpha': 0.001, 'clf__class_weight': 'balanced', 'clf__l1_ratio': 0.5, 'clf__loss': 'hinge', 'clf__penalty': 'elasticnet', 'clf__tol': 1e-08} 

    Select 25 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  83.05% 
    Precision: 83.23% 
    Recall:    83.05% 
    F1 Score:  83.12% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.77      0.80      0.78       342
              0       0.87      0.85      0.86       549

    avg / total       0.83      0.83      0.83       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        274   68
            0         83  466


                  ROC AUC Score: 82.50%
    Select 25 features
    

Linear Support Vector Classification

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

The combination of penalty=’l1′ and loss=’hinge’ is not supported, and penalty=’l2′ and loss=’hinge’ needs dual=True.

In [88]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', LinearSVC(random_state=101))])

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    C =  [0.5, 0.3, 0.05, 0.1] #, 1.0, 10.0, 100.0, 1000.0]
    tol = [1e-06, 3e-06, 5e-07]
    max_iter = [9, 15, 7]

    param_grid =\
        [{'clf__loss': ['hinge']
         ,'clf__tol': tol
         ,'clf__C': C
         ,'clf__penalty': ['l2']
         ,'clf__class_weight' : ['balanced'] 
         ,'clf__max_iter' : max_iter
         ,'clf__dual' : [True]
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }
        ,{'clf__loss': ['squared_hinge']
         ,'clf__tol': tol
         ,'clf__C': C
         ,'clf__penalty': ['l2', 'l1']
         ,'clf__class_weight' : ['balanced'] 
         ,'clf__max_iter' : max_iter
         ,'clf__dual' : [False]
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(FRFE))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, lsvc = get_results(main_pip, 'LinearSVC', results)
    

Recive 54 features...
    Select 25 features
    Fitting 5 folds for each of 108 candidates, totalling 540 fits
    

[Parallel(n_jobs=3)]: Done 540 out of 540 | elapsed:    3.1s finished
    

Mean Best Accuracy: 82.83%
    {'clf__C': 0.5, 'clf__class_weight': 'balanced', 'clf__dual': False, 'clf__loss': 'squared_hinge', 'clf__max_iter': 9, 'clf__penalty': 'l2', 'clf__tol': 1e-06} 

    Select 25 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  83.05% 
    Precision: 83.23% 
    Recall:    83.05% 
    F1 Score:  83.12% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.77      0.80      0.78       342
              0       0.87      0.85      0.86       549

    avg / total       0.83      0.83      0.83       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        274   68
            0         83  466


                  ROC AUC Score: 82.50%
    Select 25 features
    

Gaussian Process Classifier (GPC)

Internally, the Laplace approximation is used for approximating the non-Gaussian posterior by a Gaussian.

Currently, the implementation is restricted to using the logistic link function. For multi-class classification, several binary one-versus rest classifiers are fitted. Note that this class thus does not implement a true multi-class Laplace approximation.

In [89]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', GaussianProcessClassifier(1.0 * RBF(1.0), random_state=101))
    ])

    # n_restarts_optimizer=5
    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    max_iter_predict = [5, 10, 15, 20]

    param_grid =\
        [{'clf__max_iter_predict':  max_iter_predict
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(bcols))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, gpc = get_results(main_pip, 'GaussianProcessClassifier', results)
    

Recive 54 features...
    Select 47 features
    Fitting 5 folds for each of 4 candidates, totalling 20 fits
    

[Parallel(n_jobs=3)]: Done  20 out of  20 | elapsed:   48.2s finished
    

Mean Best Accuracy: 84.40%
    {'clf__max_iter_predict': 10} 

    Select 47 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  84.40% 
    Precision: 84.32% 
    Recall:    84.40% 
    F1 Score:  84.22% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.83      0.75      0.79       342
              0       0.85      0.91      0.88       549

    avg / total       0.84      0.84      0.84       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        255   87
            0         52  497


                  ROC AUC Score: 82.54%
    Select 47 features
    

Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [90]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', RandomForestClassifier(random_state=101))])

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    param_grid =\
        [{'clf__n_estimators' : [500, 3000]
          ,'clf__criterion': ['gini', 'entropy']
          ,'clf__min_samples_split': [4, 3, 5]
          ,'clf__min_impurity_split': [0.05, 0.03, 0.07]
          #,'clf__max_depth': [5, 10]
          #,'clf__min_impurity_decrease': [0.0003]
          #,'clf__min_samples_leaf': [1,2,3,4]
          ,'clf__class_weight': ['balanced']
          #,'clf__bootstrap': [True, False]
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]

    sele = bcols
    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(sele))),
            ('scl', StandardScaler()),
            ('gs', gs)
     ])  

    results, rfc = get_results(main_pip, 'RandomForestClassifier', results)
    

Recive 54 features...
    Select 47 features
    Fitting 5 folds for each of 36 candidates, totalling 180 fits
    

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  1.2min
    [Parallel(n_jobs=3)]: Done 180 out of 180 | elapsed:  5.3min finished
    

Mean Best Accuracy: 83.84%
    {'clf__class_weight': 'balanced', 'clf__criterion': 'gini', 'clf__min_impurity_split': 0.07, 'clf__min_samples_split': 4, 'clf__n_estimators': 500} 

    Select 47 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  95.85% 
    Precision: 95.84% 
    Recall:    95.85% 
    F1 Score:  95.84% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.96      0.94      0.95       342
              0       0.96      0.97      0.97       549

    avg / total       0.96      0.96      0.96       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        320   22
            0         15  534


                  ROC AUC Score: 95.42%
    Select 47 features
    

AdaBoost classifier

Is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

This class implements the algorithm known as AdaBoost-SAMME.

Parameters:

  • n_estimators: The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.

  • learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • algorithm: {‘SAMME’, ‘SAMME.R’}. If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.

In [91]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', AdaBoostClassifier(random_state=101))])
    # , max_iter_predict=500, n_restarts_optimizer=5

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]

    param_grid =\
        [{'clf__learning_rate': [3e-03, 15e-02, 5e-02]
         ,'clf__n_estimators': [300, 350, 400, 500] # np.arange(96,115)
         ,'clf__algorithm' : ['SAMME', 'SAMME.R']
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(FRFE))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, AdaB = get_results(main_pip, 'AdaBoostClassifier', results)
    

Recive 54 features...
    Select 25 features
    Fitting 5 folds for each of 24 candidates, totalling 120 fits
    

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   12.3s
    [Parallel(n_jobs=3)]: Done 120 out of 120 | elapsed:   33.4s finished
    

Mean Best Accuracy: 84.06%
    {'clf__algorithm': 'SAMME', 'clf__learning_rate': 0.003, 'clf__n_estimators': 300} 

    Select 25 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  84.40% 
    Precision: 85.03% 
    Recall:    84.40% 
    F1 Score:  83.87% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.89      0.68      0.77       342
              0       0.82      0.95      0.88       549

    avg / total       0.85      0.84      0.84       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        231  111
            0         28  521


                  ROC AUC Score: 81.22%
    Select 25 features
    

In [92]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', KNeighborsClassifier())])

    #max_iter_predict=500, n_restarts_optimizer=5
    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    param_grid =\
        [{'clf__n_neighbors': [3, 7, 8, 9] #
         ,'clf__weights': ['uniform', 'distance'] 
         ,'clf__algorithm' : ['ball_tree', 'kd_tree'] # ['auto', 'ball_tree', 'kd_tree', 'brute']
         ,'clf__leaf_size': [12, 15, 16, 20]
         ,'clf__p': [1, 2] 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(FRFE))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, KNNC = get_results(main_pip, 'KNeighborsClassifier', results)
    

Recive 54 features...
    Select 25 features
    Fitting 5 folds for each of 128 candidates, totalling 640 fits
    

[Parallel(n_jobs=3)]: Done 640 out of 640 | elapsed:    3.4s finished
    

Mean Best Accuracy: 82.72%
    {'clf__algorithm': 'ball_tree', 'clf__leaf_size': 12, 'clf__n_neighbors': 9, 'clf__p': 1, 'clf__weights': 'uniform'} 

    Select 25 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  85.19% 
    Precision: 85.11% 
    Recall:    85.19% 
    F1 Score:  85.12% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.82      0.79      0.80       342
              0       0.87      0.89      0.88       549

    avg / total       0.85      0.85      0.85       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        269   73
            0         59  490


                  ROC AUC Score: 83.95%
    Select 25 features
    

Multi-layer Perceptron classifier

This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

MLPClassifier trains iteratively since at each time step the partial derivatives of the loss function with respect to the model parameters are computed to update the parameters.

It can also have a regularization term added to the loss function that shrinks model parameters to prevent overfitting.

This implementation works with data represented as dense numpy arrays or sparse scipy arrays of floating point values.

In [93]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', MLPClassifier(random_state=101))])

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    param_grid =\
        [{#'clf__activation': ['identity', 'logistic', 'tanh', 'relu'],
          'clf__solver': ['adam'] # , 'lbfgs', 'sgd'
         ,'clf__tol': [5e-04] #, 3e-04, 7e-04]
         #,'clf__max_iter': [200, 1000]
         ,'clf__alpha': [1e-06] #, 1e-07, 1e-08] 
         ,'clf__learning_rate_init': [3e-04]
         ,'clf__hidden_layer_sizes': [(512, 256, 128, 64, )]#, (1024, 512, 256, 128, 64, )]
         ,'clf__batch_size': [64]
         ,'clf__epsilon': [1e-08] 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
         },
         {'clf__solver': ['sgd'] 
         ,'clf__tol': [5e-04]
         ,'clf__learning_rate_init': [3e-04]
         ,'clf__learning_rate': ['constant', 'adaptive']
         ,'clf__alpha': [1e-06] #, 1e-07, 1e-08] #, 1e-03, 1e-02, 1e-01]
         ,'clf__hidden_layer_sizes': [(512, 256, 128, 64, )]#, (1024, 512, 256, 128, 64, )]
         ,'clf__batch_size': [64]
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
        },
         {'clf__solver': ['sgd'] 
         ,'clf__tol': [5e-04]
         ,'clf__learning_rate_init': [3e-04]
         ,'clf__learning_rate': ['invscaling']
         ,'clf__power_t' : [ 0.25, 0.5]
         ,'clf__alpha': [1e-06]
         ,'clf__hidden_layer_sizes': [(256, 128, 64, 32, )]
         ,'clf__batch_size': [64]
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
        }]
        
    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(cols))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, mlpc = get_results(main_pip, 'MLPClassifier', results)
    

Recive 54 features...
    Select 54 features
    Fitting 5 folds for each of 5 candidates, totalling 25 fits
    

[Parallel(n_jobs=3)]: Done  25 out of  25 | elapsed:  1.7min finished
    

Mean Best Accuracy: 84.74%
    {'clf__alpha': 1e-06, 'clf__batch_size': 64, 'clf__epsilon': 1e-08, 'clf__hidden_layer_sizes': (512, 256, 128, 64), 'clf__learning_rate_init': 0.0003, 'clf__solver': 'adam', 'clf__tol': 0.0005} 

    Select 54 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  84.51% 
    Precision: 84.46% 
    Recall:    84.51% 
    F1 Score:  84.30% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.84      0.74      0.79       342
              0       0.85      0.91      0.88       549

    avg / total       0.84      0.85      0.84       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        253   89
            0         49  500


                  ROC AUC Score: 82.53%
    Select 54 features
    

Gradient Boosting for Classification

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage nclasses regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.

  • loss: loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.

In [94]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', GradientBoostingClassifier(random_state=101))])  

    # a list of dictionaries to specify the parameters that we'd want to tune
    #cv=None, dual=False,  scoring=None, refit=True,  multi_class='ovr'
    n_components= [25, 22, 31, 54]
    whiten = [True, False]
    learning_rate =  [1e-02] #, 5e-03, 2e-02]
    n_estimators= [140, 150, 160, 145]
    max_depth = [2, 3, 5]

    param_grid =\
        [{'clf__learning_rate': learning_rate
         ,'clf__max_depth': max_depth
         ,'clf__n_estimators' : n_estimators 
         #,'pca__n_components' : n_components
         #,'pca__whiten' : whiten
    }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=3)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(FRFE))),
            ('scl', StandardScaler()),
            ('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, GBC = get_results(main_pip, 'GradientBoostingClassifier', results)
    

Recive 54 features...
    Select 25 features
    Fitting 5 folds for each of 12 candidates, totalling 60 fits
    

[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    4.1s
    [Parallel(n_jobs=3)]: Done  60 out of  60 | elapsed:    5.3s finished
    

Mean Best Accuracy: 84.06%
    {'clf__learning_rate': 0.01, 'clf__max_depth': 2, 'clf__n_estimators': 140} 

    Select 25 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  84.40% 
    Precision: 85.03% 
    Recall:    84.40% 
    F1 Score:  83.87% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.89      0.68      0.77       342
              0       0.82      0.95      0.88       549

    avg / total       0.85      0.84      0.84       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        231  111
            0         28  521


                  ROC AUC Score: 81.22%
    Select 25 features
    

XGBoost (eXtreme Gradient Boosting)

XGBoost is an advanced implementation of gradient boosting algorithm. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

  • Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
  • XGBoost implements parallel processing to making a tree using all cores and is blazingly faster as compared to GBM.
  • XGBoost also supports implementation on Hadoop.
  • High flexibility, it allow users to define custom optimization objectives and evaluation criteria.
  • XGBoost has an in-built routine to handle missing values.
  • It make splits up to the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
  • Sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
  • XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run. You don’t need use grid search.
  • User can start training an XGBoost model from its last iteration of previous run.

The overall parameters have been divided into 3 categories by XGBoost authors, let’s see the most importants:

  1. General Parameters: Guide the overall functioning:
    • booster: default is gbtree fom [‘gbtree’, ‘gblinear’]

  2. Booster Parameters: Guide the individual booster (tree/regression) at each step:
    • learning_rate (eta): default is 0.3. Makes the model more robust by shrinking the weights on each step. Typical final values to be used: 0.01-0.2
    • min_child_weight: default is 1. Defines the minimum sum of weights of all observations required in a child. Used to control over-fitting. Higher values prevent a over-fitting, but too high values can lead to under-fitting hence, it should be tuned using CV.
    • max_depth: default is 6. The maximum depth of a tree used to control over-fitting and should be tuned using CV. Typical values: 3-10
    • max_leaf_nodes: The maximum number of terminal nodes or leaves in a tree. Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves. If this is defined, GBM will ignore max_depth.
    • gamma: default is 0. A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split. Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
    • max_delta_step: default is 0. In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    • subsample: default is 1. Denotes the fraction of observations to be randomly samples for each tree. Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting. Typical values: 0.5-1
    • colsample_bytree: default is 1. Denotes the fraction of columns to be randomly samples for each tree. Typical values: 0.5-1
    • colsample_bylevel: default is 1. Denotes the subsample ratio of columns for each split, in each level.
    • reg_lambda (lambda): default is 1. L2 regularization term on weights, analogous to Ridge regression, it should be explored to reduce overfitting.
    • reg_alpha (alpha): default is 0. L1 regularization term on weight, analogous to Lasso regression, Can be used in case of very high dimensionality so that the algorithm runs faster when implemented.
    • scale_pos_weight: default is 1. A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence. To balance use sum(negative cases)/sum(positive cases) and Use AUC for evaluation.

  3. Learning Task Parameters: These parameters are used to define the optimization objective the metric to be calculated at each step:
    • objective: default is reg:linear and binary:logistic for XGBClassifier. This defines the loss function to be minimized. Mostly used values are:
      • binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
      • multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities). You also need to set an additional num_class (number of classes) parameter defining the number of unique classes
      • multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
    • eval_metric: The default values are rmse for regression and error for classification. The metric to be used for validation data. Typical values are:
      • rmse – root mean square error
      • mae – mean absolute error
      • logloss – negative log-likelihood
      • error – Binary classification error rate (0.5 threshold)
      • merror – Multiclass classification error rate
      • mlogloss – Multiclass logloss
      • auc: Area under the curve
    • seed: The random number seed. Can be used for generating reproducible results and also for parameter tuning.

Before proceeding further, since cgb don’t accept categorical let’s change it to boolean or integer.

In [95]:
def categorical_change_back(df):
        categorical_features = list(df.dtypes[df.dtypes == "category"].index)
        for feat in categorical_features:
            if len(df[feat].unique())==2:
                df[feat] = df[feat].astype(bool)
            else:
                df[feat] = df[feat].astype(int)
        return df

    trainXGB = data.loc[data.Survived>=0, cols].copy()
    trainXGB = categorical_change_back(trainXGB)
    testXGB = data.loc[data.Survived<0, cols].copy()
    testXGB = categorical_change_back(testXGB)
    

General Approach for Parameter Tuning:

  • Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as ‘cv’ which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
  • Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
  • Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
  • Lower the learning rate and decide the optimal parameters .
  • max_depth: This should be between 3-10. 4-6 can be good starting points.
  • min_child_weight: Define a smaller value if your data set is a highly imbalanced class problem and leaf nodes can have smaller size groups.
  • gamma: apply L1 regularization, a smaller value like 0.1-0.2 can also be chosen for starting.
  • subsample, colsample_bytree: 0.8 is a commonly used used start value. Typical values range between 0.5-0.9.
  • scale_pos_weight: use to balance, for binary class do ((len(y_train)-y_train.sum())/y_train.sum()).
  • reg_lambda: apply L2 regularization to reduce overfitting. Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity. But we should always try it.

In [146]:
clf = Pipeline([
            #('pca', PCA(random_state = 101)),
            ('clf', XGBClassifier(learning_rate = 0.1, n_estimators=200, max_depth=5,
                     min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, importance_type='gain',
                     objective= 'binary:logistic', n_jobs=4, scale_pos_weight=scale, seed=101, random_state=101))])  

    # a list of dictionaries to specify the parameters that we'd want to tune
    n_components= [5, 10, 15, 19, 21] # [25, 22, 31, 54]
    whiten = [True, False]
    scale = ((len(y_train)-y_train.sum())/y_train.sum())
    sample = np.arange(5,10)/10

    param_grid = \
        [{
         'clf__max_depth': [3, 4],
         'clf__min_child_weight': range(1,5),
         'clf__gamma': np.arange(0,11)/10,
         'clf__reg_alpha': np.arange(0,11)/10,
         'clf__subsample' : sample,
         'clf__colsample_bytree' :sample
         #,'pca__n_components' : n_components
         #,'pca__whiten' : [True] # whiten
    }]

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1, n_jobs=4)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(pv_cols))),
            #('scl', StandardScaler()),
            #('lda', LDA(store_covariance=True)),
            ('gs', gs)
     ])  

    results, xgb1 = get_results(main_pip, 'XGBClassifier', results, data = trainXGB)
    

Recive 54 features...
    Select 15 features
    Fitting 3 folds for each of 24200 candidates, totalling 72600 fits
    

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    4.5s
    [Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   10.7s
    [Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   21.7s
    [Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:   38.1s
    [Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:  1.0min
    [Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:  1.5min
    [Parallel(n_jobs=4)]: Done 2442 tasks      | elapsed:  2.1min
    [Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed:  2.7min
    [Parallel(n_jobs=4)]: Done 4042 tasks      | elapsed:  3.5min
    [Parallel(n_jobs=4)]: Done 4992 tasks      | elapsed:  4.4min
    [Parallel(n_jobs=4)]: Done 6042 tasks      | elapsed:  5.3min
    [Parallel(n_jobs=4)]: Done 7192 tasks      | elapsed:  6.2min
    [Parallel(n_jobs=4)]: Done 8442 tasks      | elapsed:  7.3min
    [Parallel(n_jobs=4)]: Done 9792 tasks      | elapsed:  8.4min
    [Parallel(n_jobs=4)]: Done 11242 tasks      | elapsed:  9.7min
    [Parallel(n_jobs=4)]: Done 12792 tasks      | elapsed: 11.1min
    [Parallel(n_jobs=4)]: Done 14442 tasks      | elapsed: 12.5min
    [Parallel(n_jobs=4)]: Done 16192 tasks      | elapsed: 14.0min
    [Parallel(n_jobs=4)]: Done 18042 tasks      | elapsed: 15.5min
    [Parallel(n_jobs=4)]: Done 19992 tasks      | elapsed: 17.1min
    [Parallel(n_jobs=4)]: Done 22042 tasks      | elapsed: 18.8min
    [Parallel(n_jobs=4)]: Done 24192 tasks      | elapsed: 20.5min
    [Parallel(n_jobs=4)]: Done 26442 tasks      | elapsed: 22.3min
    [Parallel(n_jobs=4)]: Done 28792 tasks      | elapsed: 24.2min
    [Parallel(n_jobs=4)]: Done 31242 tasks      | elapsed: 26.3min
    [Parallel(n_jobs=4)]: Done 33792 tasks      | elapsed: 28.4min
    [Parallel(n_jobs=4)]: Done 36442 tasks      | elapsed: 31.0min
    [Parallel(n_jobs=4)]: Done 39192 tasks      | elapsed: 33.7min
    [Parallel(n_jobs=4)]: Done 42042 tasks      | elapsed: 36.3min
    [Parallel(n_jobs=4)]: Done 44992 tasks      | elapsed: 39.0min
    [Parallel(n_jobs=4)]: Done 48042 tasks      | elapsed: 41.7min
    [Parallel(n_jobs=4)]: Done 51192 tasks      | elapsed: 44.6min
    [Parallel(n_jobs=4)]: Done 54442 tasks      | elapsed: 47.6min
    [Parallel(n_jobs=4)]: Done 57792 tasks      | elapsed: 51.1min
    [Parallel(n_jobs=4)]: Done 61242 tasks      | elapsed: 54.7min
    [Parallel(n_jobs=4)]: Done 64792 tasks      | elapsed: 58.3min
    [Parallel(n_jobs=4)]: Done 68442 tasks      | elapsed: 61.7min
    [Parallel(n_jobs=4)]: Done 72192 tasks      | elapsed: 65.2min
    [Parallel(n_jobs=4)]: Done 72600 out of 72600 | elapsed: 65.6min finished
    

Mean Best Accuracy: 84.06%
    {'clf__colsample_bytree': 0.7, 'clf__gamma': 0.0, 'clf__max_depth': 3, 'clf__min_child_weight': 2, 'clf__reg_alpha': 0.9, 'clf__subsample': 0.9} 

    Select 15 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  89.23% 
    Precision: 89.25% 
    Recall:    89.23% 
    F1 Score:  89.24% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.86      0.87      0.86       342
              0       0.92      0.91      0.91       549

    avg / total       0.89      0.89      0.89       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        296   46
            0         50  499


                  ROC AUC Score: 88.72%
    Select 15 features
    

You may have noticed that the number of estimators is actually a parameter that defines the maximum of interactions, whether or not your final model has this number of estimators.

In fact, the ideal is that it has a smaller number, because this means that its model actually found a local minimum. In fact, this local minimum may not be great, so making a choice by a high number when you are running a grid search may be the best.

However, you may experience performance issues, so you will be tempted to perform some low number interactions to try to find the best possible values for your other parameters and then proceed with an additional round with a high number of estimators. Although this is a valid attitude, beware of the fact that by doing this you may be discarding some value from another parameter, which would actually be used with a number of estimators to be greater than you have defined.

My recommendation is that after you have tried a bit, and have understood a little more how your model is performing in front of the data, make a run with the number of larger estimators and keep the others in some range, however small, or that it has the value found and two opposite ends.

Since you are verbose in 1, you will be able to estimate the total execution time right after some tasks have been completed, and then evaluate whether you prefer to abort and reduce the number of estimators or the range of some of the other parameters.

Once you have reached a set of parameters that satisfied you, it is interesting that you run it once again, and check if a lower learning rate and higher estimators can produce better results. Let’s do it and use our normal 5 CV to get the final model:

In [96]:
# a list of dictionaries to specify the parameters that we'd want to tune
    scale = ((len(y_train)-y_train.sum())/y_train.sum())
    param_grid = \
        [{
         'clf__learning_rate': [0.1, 0.09, 0.03, 0.01, 0.001],
         'clf__n_estimators': [200, 3000]
    }]

    clf = Pipeline([
            ('clf', XGBClassifier(learning_rate =0.1, n_estimators=2000, max_depth=3, min_child_weight=2, gamma=0.0, 
                                  subsample=0.9, colsample_bytree=0.7, objective= 'binary:logistic', importance_type='gain', 
                                  reg_alpha = 0.9, n_jobs=4, scale_pos_weight=scale, seed=101, random_state=101))])  

    gs = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1, n_jobs=4)
    main_pip = Pipeline([
            ('sel', select_fetaures(select_cols=list(pv_cols))),
            ('gs', gs)
     ])  

    results, xgbF = get_results(main_pip, 'XGBClassifier Final', results, data = trainXGB)
    

Recive 54 features...
    Select 15 features
    Fitting 3 folds for each of 10 candidates, totalling 30 fits
    

[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:   12.8s finished
    

Mean Best Accuracy: 84.06%
    {'clf__learning_rate': 0.1, 'clf__n_estimators': 200} 

    Select 15 features
    Model Performance metrics:
    ------------------------------
    Accuracy:  89.23% 
    Precision: 89.25% 
    Recall:    89.23% 
    F1 Score:  89.24% 

    Model Classification report:
    ------------------------------
                 precision    recall  f1-score   support

              1       0.86      0.87      0.86       342
              0       0.92      0.91      0.91       549

    avg / total       0.89      0.89      0.89       891


    Prediction Confusion Matrix:
    ------------------------------
              Predicted:     
                       1    0
    Actual: 1        296   46
            0         50  499


                  ROC AUC Score: 88.72%
    Select 15 features
    

Finalize The Model: Stacking the Models

Check the best results from the models hyper parametrization

In [101]:
display(results.sort_values(by='ROC Area', ascending=False))
    

Prob CV Accuracy Acc All ROC AUC Score ROC Area
RandomForestClassifier True 0.838384 0.9585 0.954175 0.993113
XGBClassifier Final True 0.840629 0.8923 0.887211 0.953539
KNeighborsClassifier True 0.827160 0.8519 0.839541 0.905762
MLPClassifier True 0.847363 0.8451 0.825256 0.898074
GaussianProcessClassifier True 0.843996 0.8440 0.825448 0.896995
GradientBoostingClassifier True 0.840629 0.8440 0.812218 0.894321
SGDClassifier False 0.839506 0.8305 0.824993 0.888644
LinearSVC False 0.828283 0.8305 0.824993 0.888644
LogisticRegression True 0.822671 0.8283 0.813800 0.877323
AdaBoostClassifier True 0.840629 0.8440 0.812218 0.875292

Validation Function

In [98]:
n_folds = 10

    def cvscore(model):
        kf = KFold(n_folds, shuffle=True, random_state=101).get_n_splits(train.values)
        score= cross_val_score(estimator=model, X=train.values, y=y_train, scoring="accuracy", verbose=1, n_jobs=3, cv = kf)
        return(score)
    

Make Staked Classifier

Create an ensemble model by staked models with mean probabilities.

In [118]:
models = ( xgbF, rfc, GBC, AdaB, mlpc, gpc, lr, KNNC )

    trained_models = []
    for model in models:
        #model.fit(train, targets) models all ready fited
        trained_models.append(model)

    predictions = []
    for i, model in enumerate(trained_models):
        if i < 1:
             predictions.append(model.predict_proba(testXGB)[:, 1])
        else:
            predictions.append(model.predict_proba(test)[:, 1])

    # Preper Submission File of Probabilities Classifier
    predictions_df = pd.DataFrame(predictions).T

    ensemble = predictions_df.mean(axis=1).map(lambda s: 1 if s >= 0.5 else 0)
    submit = pd.DataFrame()
    submit['PassengerId'] = Test_ID.values
    submit['Survived'] = ensemble

    # ----------------------------- Create File to Submit --------------------------------
    submit.to_csv('Titanic_Probabilities_submission.csv', index = False)
    print('Sample of Probabilities Submit:')
    display(submit.head())
    

Select 15 features
    Select 47 features
    Select 25 features
    Select 25 features
    Select 54 features
    Select 47 features
    Select 22 features
    Select 25 features
    Sample of Probabilities Submit:
    

PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1

Conclusion

As you saw from the results of the models, their accuracy from cross validation vary from 82.27% to 84.74%. In the other hand, you see a more large variance on the ROC accuracy, from 87.93% to 99.31%. Since this accuracy is taken by the probabilities, not by the results class, it may suggest to use a different sets of models to assembly.

In fact, if you submit this stacked classifier as it stands, you’ll hit the 0.78947 score, which is exactly the same score I got with just Random Forest in R on my first submission. But, if you play around little you can get best score, if you pay attention to not get into overfitting.

Finally, as you saw, this is a interesting case to training techniques, because it is not easy to obtain a high score without risk of some overfitting, since it really have some cases wheres too specific. If you try submit to the competition, you discover that you can find some models that have high score, and you use cross validation and take other actions to dealing with overfitting, but your score is worse than other with minor score that you had submitted. Don’t give up!

For next steps, you can try:

  • Try other models like Tensorflow
  • Try others configurations of stacked models (Voting, Mean Probabilities with other set of models)
  • Not apply boxcox1p on discrete data, or apply on all numeric but before the selection features
  • Create others features and make other transformations