Kaggle challenges us to learn data analysis and machine learning from the data the Titanic shipwreck, and try predict survival and get familiar with ML basics.

So, this material is intended to cover most of the techniques of data analysis and ML in Python, than to properly compete in Kaggle. That is why it following the natural flow of ML and contains many texts and links regarding the techniques, made your conference and references easy. as it can be extended over time.

In this way the material can be used for consultation and apply the methods to other similar classification cases, but for its application in the competition, or even to a real case, it will be necessary to make some choices and changes.

Competition Description:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, Kaggle ask you to complete the analysis of what sorts of people were likely to survive. In particular, they ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

1 Preparing environment and uploading data
- 1.1 Import Packages
- 1.2 Load Datasets
2 Exploratory Data Analysis (EDA) & Feature Engineering
3 Select Features
4 Additional Feature Engineering: Feature transformation
- 4.1 Polynomial Features – Create Degree 3 of some Features
- 4.2 Defining Categorical Data as Category
- 4.3 Box cox transformation of highly skewed features
- 4.4 Compressing Data via Dimensionality Reduction
- 4.5 Feature Selection into the Pipeline
5 Modeling – Hyper Parametrization
6 Finalize The Model: Stacking the Models
7 Conclusion

Preparing environment and uploading data¶

You can download the this python notebook and data from my github repository. The data can download on Kaggle here.

Import Packages¶

import os
    import warnings
    warnings.simplefilter(action = 'ignore', category=FutureWarning)
    warnings.filterwarnings('ignore')
    def ignore_warn(*args, **kwargs):
        pass

    warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

    import numpy as np
    import pandas as pd
    import pylab 
    import seaborn as sns
    sns.set(style="ticks", color_codes=True, font_scale=1.5)
    from matplotlib import pyplot as plt
    from matplotlib.ticker import FormatStrFormatter
    from matplotlib.colors import ListedColormap
    %matplotlib inline
    import mpl_toolkits
    from mpl_toolkits.mplot3d import Axes3D
    import model_evaluation_utils as meu

    from scipy.stats import skew, norm, probplot, boxcox
    from patsy import dmatrices
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectFromModel
    from boruta import BorutaPy
    from rfpimp import *

    from sklearn.decomposition import PCA, KernelPCA
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
    from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
    from sklearn.pipeline import Pipeline, make_pipeline
    from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, cross_val_predict, train_test_split
    from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score

    from sklearn.linear_model import LogisticRegression, SGDClassifier
    from sklearn.svm import SVC, LinearSVC
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import RBF
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
    from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.neighbors import KNeighborsClassifier
    import xgboost as xgb
    from xgboost import XGBClassifier
    from xgboost import plot_importance

    #from sklearn.base import BaseEstimator, TransformerMixin, clone, ClassifierMixin
    from sklearn.ensemble import VotingClassifier
    from itertools import combinations

Load Datasets¶

I start with load the datasets with pandas, and concatenate them.

train = pd.read_csv('train.csv') 

    test = pd.read_csv('test.csv') 
    Test_ID = test.PassengerId
    test.insert(loc=1, column='Survived', value=-1)

    data = pd.concat([train, test], ignore_index=True)

Exploratory Data Analysis (EDA) & Feature Engineering¶

Take a First Look of our Data:¶

I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.

If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the independent variable to the pred parameter.

Also, if your return is stored in a variable you can evaluate it in more detail, specific of a field, or sort them from different perspectives

def rstr(df, pred=None): 
        obs = df.shape[0]
        types = df.dtypes
        counts = df.apply(lambda x: x.count())
        uniques = df.apply(lambda x: [x.unique()])
        nulls = df.apply(lambda x: x.isnull().sum())
        distincts = df.apply(lambda x: x.unique().shape[0])
        missing_ration = (df.isnull().sum()/ obs) * 100
        skewness = df.skew()
        kurtosis = df.kurt() 
        print('Data shape:', df.shape)
        
        if pred is None:
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing ration', 'uniques', 'skewness', 'kurtosis']
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis], axis = 1)

        else:
            corr = df.corr()[pred]
            str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
            corr_col = 'corr '  + pred
            cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
        
        str.columns = cols
        dtypes = str.types.value_counts()
        print('___________________________\nData types:\n',str.types.value_counts())
        print('___________________________')
        return str

details = rstr(data.loc[: ,'Survived' : 'Embarked'], 'Survived')
    details.sort_values(by='corr Survived', ascending=False)

Data shape: (1309, 11)
    ___________________________
    Data types:
     object     5
    int64      4
    float64    2
    Name: types, dtype: int64
    ___________________________

Data Dictionary

Survived: 0 = No, 1 = Yes. I use -1 to can separate test data from training data.
Fare: The passenger fare
Parch: # of parents / children aboard the Titanic
SibSp: # of siblings / spouses aboard the Titanic
Age: Age in years
Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Name: Name of the passenger
Sex: Sex of the passenger male and female
Ticket: Ticket number
Cabin: Cabin number
Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

The points of attention we have here are:

Fare: Kaggle affirm that is the passenger fare, but with some data inspection in group of Tickets we discover that is the total amount fare paid for a ticket, and the existence of tickets for a group of passengers.
Parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children traveled only with a nanny, therefore parch=0 for them.
SibSP: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
Age: Have 20% of nulls, so we have to find a more efficient way of filling them out with just a single value, like the median for example.
Name: is a categorical data with a high distinct values as expect. The first reaction is drop this column, but we can use it to training different data engineering techniques, to see if we can get some valuable data. Besides that, notice that has two pairs of passengers with the same name?
Ticket Other categorical data, but in this case it have only 71% of distinct values and don’t have nulls. So, it is possible that some passenger voyage in groups and use the same ticket. Beyond that, we can check to if we can extract other interesting thought form it.
Cabin: the high number of distinct values (187) and nulls (77.5%).
This is a categorical data of which I use to train different techniques to extract some information of value and null input, but with given the high rate of null the recommended would be to simplify the filling or even to exclude this attribute.

First see of some stats of Numeric Data¶

So, for the main statistics of our numeric data describe the function (like the summary of R)

print('Data is not balanced! Has {:2.2%} survives'.format(train.Survived.describe()[1]))
    display(data.loc[: ,'Pclass' : 'Embarked'].describe().transpose())
    print('Survived: [1] Survived; [0] Died; [-1] Test Data set:\n',data.Survived.value_counts())

Data is not balanced! Has 38.38% survives

Survived: [1] Survived; [0] Died; [-1] Test Data set:
      0    549
    -1    418
     1    342
    Name: Survived, dtype: int64

EDA and Feature Engineering by Feature¶

Standard Data Visualization for Discrete or Binning Data¶

def charts(feature, df):
        print('\n           ____________________________ Plots of', feature, 'per Survived and Dead: ____________________________')
        # Pie of all Data
        fig = plt.figure(figsize=(20,5))
        f1 = fig.add_subplot(131)
        cnt = df[feature].value_counts()
        g = plt.pie(cnt, labels=cnt.index, autopct='%1.1f%%', shadow=True, startangle=90)
        
        # Count Plot By Survived and Dead
        f = fig.add_subplot(132)
        g = sns.countplot(x=feature, hue='Survived', hue_order=[1,0], data=df, ax=f)

        # Percent stacked Plot
        survived = df[df['Survived']==1][feature].value_counts()
        dead = df[df['Survived']==0][feature].value_counts()
        df2 = pd.DataFrame([survived,dead])
        df2.index = ['Survived','Dead']
        df2 = df2.T
        df2 = df2.fillna(0)
        df2['Total'] = df2.Survived + df2.Dead
        df2.Survived = df2.Survived/df2.Total
        df2.Dead = df2.Dead/df2.Total
        df2.drop(['Total'], axis=1, inplace=True)
        f = fig.add_subplot(133)
        df2.plot(kind='bar', stacked=True, ax=f)
        del df2, g, f, cnt, dead, fig

Ticket¶

Since Ticket is a transaction and categorical data, the first insight is drop this feature, but we may note that it has some hidden value information. At first look, safe few cases, we could affirms that:

families and group of persons that traveled together bought the same ticket.
People with alphanumerics Tickets has some special treatment (crew family, employees, VIP, free tickets, etc.)

So, we start by a new feature creation to quantify the number of passengers by ticket, and join this
quantity to each passenger with the same ticket.

same_ticket = data.Ticket.value_counts()
    data['qtd_same_ticket'] = data.Ticket.apply(lambda x: same_ticket[x])
    del same_ticket
    charts('qtd_same_ticket', data[data.Survived>=0])

               ____________________________ Plots of qtd_same_ticket per Survived and Dead: ____________________________

As we can see above:

the majority (54%) bought only one ticket per passenger, and have lower survival rate than passengers that bought tickets for 2, 3, 4, 5 and 8 people.
the survival rate is growing between 1 and 4, dropped a lot at 5. From the bar chart we can see that after 5 the number of samples is too low (84 out of 891, 9.4%, 1/4 of this is 5), and this data is skewed with a long tail to right. We can reduce this tail by binning all data after 4 in the same ordinal, its better to prevent overfitting, but we lose some others interesting case, see the next bullet. As alternative we can apply a box cox at this measure.
The case of 11 people with same ticket probably is a huge family that all samples on the training data died. Let’s check this below.

data[(data.qtd_same_ticket==11)]

We confirm our hypothesis, and we notice that Fare is not the price of each passenger, but the price of each ticket, its total amount. Since our data is per passenger, this information has some distortion, because only one passenger that bought a ticket alone of 69.55 pounds is different from 11 passenger that bought a special ticket, with discount for group, by 6.32 pounds per passenger. It suggest to create a new feature that represents the real fare by passenger.

Back to the quantity of persons with same ticket, if we keep this and the model can capture this pattern, you’ll probably predict that the respective test samples also died! However, even if true, can be a sign of overfitting, because we only have 1.2% of these cases in the training samples.

In order to increase representativeness and lose the minimum of information, since we have only 44 (4.9%) training samples that bought tickets for 4 people and 101 (11.3%) of 3, we binning the quantity of 3 and 4 together as 3 (16,3%, over than 5 as 5 (84 samples). Let’s see the results below.

data['qtd_same_ticket_bin'] = data.qtd_same_ticket.apply(lambda x: 3 if (x>2 and x<5) else (5 if x>4 else x))
    charts('qtd_same_ticket_bin', data[data.Survived>=0])

               ____________________________ Plots of qtd_same_ticket_bin per Survived and Dead: ____________________________

Other option, is create a binary feature that indicates if the passenger use a same ticket or not (not share his ticket)

print('Percent. survived from unique ticket: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket==1) & (data.Survived>=0)].count()))
    print('Percent. survived from same tickets: {:3.2%}'.\
          format(data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].sum()/
                 data.Survived[(data.qtd_same_ticket>1) & (data.Survived>=0)].count()))

    data['same_tckt'] = data.qtd_same_ticket.apply(lambda x: 1 if (x> 1) else 0)
    charts('same_tckt', data[data.Survived>=0])

Percent. survived from unique ticket: 27.03%
    Percent. survived from same tickets: 51.71%

               ____________________________ Plots of same_tckt per Survived and Dead: ____________________________

In this case we lose information that the chances of survival increase from 1 to 4, and fall from 5. In addition, cases 1 and 0 of the two measures are exactly the same. Then we will not use this option, and go work on Fare.

Finally, we have one more information to extract directly from Ticket, and check the possible special treatment!

data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x))).value_counts().head(7)

           957
    PC          92
    CA          68
    A           39
    SOTONOQ     24
    STONO       21
    WC          15
    Name: Ticket, dtype: int64

data['distinction_in_tikect'] =\
       (data.Ticket.str.findall('[A-z]').apply(lambda x: ''.join(map(str, x)).strip('[]')))

    data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Without' if y=='' else y if (y in ['PC', 'CA', 'A', 'SOTONOQ', 'STONO', 'WC', 'SCPARIS']) else 'Others')

    charts('distinction_in_tikect', data[(data.Survived>=0)])

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________

By the results, passengers with PC distinction in their tickets had best survival rate. Without, Others and CA is very close and can be grouped in one category and the we can do the same between STONO and SCAPARIS, and between A, SOTONOQ and WC.

data.distinction_in_tikect = data.distinction_in_tikect.\
      apply(lambda y: 'Others' if (y in ['Without', 'Others', 'CA']) else\
            'Low' if (y in ['A', 'SOTONOQ', 'WC']) else\
            'High' if (y in ['STONO', 'SCPARIS']) else y)

    charts('distinction_in_tikect', data[(data.Survived>=0)])

               ____________________________ Plots of distinction_in_tikect per Survived and Dead: ____________________________

Fare¶

First, we treat the unique null fare case, then we take a look of the distribution of Fare (remember that is the total amount Fare of the Ticket).

# Fill null with median of most likely type passenger
    data.loc[data.Fare.isnull(), 'Fare'] = data.Fare[(data.Pclass==3) & (data.qtd_same_ticket==1) & (data.Age>60)].median()

    fig = plt.figure(figsize=(20,5))
    f = fig.add_subplot(121)
    g = sns.distplot(data[(data.Survived>=0)].Fare)
    f = fig.add_subplot(122)
    g = sns.boxplot(y='Fare', x='Survived', data=data[data.Survived>=0], notch = True)

Let’s take a look at how the fare per passenger is and how much it differs from the total

data['passenger_fare'] = data.Fare / data.qtd_same_ticket

    fig = plt.figure(figsize=(20,6))
    a = fig.add_subplot(141)
    g = sns.distplot(data[(data.Survived>=0)].passenger_fare)
    a = fig.add_subplot(142)
    g = sns.boxplot(y='passenger_fare', x='Survived', data=data[data.Survived>=0], notch = True)
    a = fig.add_subplot(143)
    g = pd.qcut(data.Fare[(data.Survived==0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Died')
    a = fig.add_subplot(144)
    g = pd.qcut(data.Fare[(data.Survived>0)], q=[.0, .25, .50, .75, 1.00]).value_counts().plot(kind='bar', ax=a, title='Survived')
    plt.tight_layout(); plt.show()

From the comparison, we can see that:

the distributions are not exactly the same, with two spouts slightly apart on passenger fare.
Class and how much paid per passenger make differences!
Although the number of survivors among the quartiles is approximately the same as expected, when we look at passenger fares, it is more apparent that the mortality rate is higher in the lower Fares, since the top of Q4 died is at the same height as the median plus a confidence interval of the fare paid by survivors.
the number of outliers is lower in the fare per passenger, especially among survivors.
We can not rule out these outliers if there are cases of the same type in the test data set. In addition, these differences in values may be due to probably first class with additional fees for certain exclusives and cargo.

Below, you can see that the largest outlier all survival in the train data set, and has one case (1235 Passenger Id,
the matriarch of one son and two companions) to predict. Among all outlier cases of survivors, we see that all cases are first class, and different from the largest outlier, 27% actually died, and we have 18 cases to predict.

print('Passengers with higets passenger fare:')
    display(data[data.passenger_fare>120])
    print('\nSurivived of passenger fare more than 50:\n',
        pd.pivot_table(data.loc[data.passenger_fare>50, ['Pclass', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Pclass']))

Passengers with higets passenger fare:

    Surivived of passenger fare more than 50:
     Survived  -1   0   1
    Pclass              
    1         18   4  22

Note that if we leave this way, if the model succeeds in capturing this pattern of largest outlier we are again thinking of a model that is at risk of overfitting (0.03% of cases).

Pclass¶

Notwithstanding the fact that class 3 presents greater magnitude, as we see with Fare by passenger, we notice that survival rate is greater with greater fare by passenger. Its make to think that has some socioeconomic discrimination. It is confirmed when we saw the data distribution over the classes, and see the percent bar has a clearer aggressive decreasing survival rate through the first to the third classes.

charts('Pclass', data[(data.Survived>=0)])

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________

SibSp¶

charts('SibSp', data[(data.Survived>=0)])

               ____________________________ Plots of SibSp per Survived and Dead: ____________________________

Since more than 2 siblings has too few cases and lowest survival rate, we can aggregate all this case into unique bin in order to increase representativeness and lose the minimum of information.

data['SibSp_bin'] = data.SibSp.apply(lambda x: 6 if x > 2 else x)
    charts('SibSp_bin', data[(data.Survived>=0)])

               ____________________________ Plots of SibSp_bin per Survived and Dead: ____________________________

Parch¶

charts('Parch', data[data.Survived>=0])

               ____________________________ Plots of Parch per Survived and Dead: ____________________________

As we did with siblings, we will aggregate the Parch cases with more than 3, even with the highest survival rate with 3 Parch.

data['Parch_bin'] = data.Parch.apply(lambda x: x if x< 3 else 4)
    charts('Parch_bin', data[(data.Survived>=0)])

               ____________________________ Plots of Parch_bin per Survived and Dead: ____________________________

Family and non-relatives¶

If you investigate the data, you will notice that total family members It can be obtained by the sum of Parch and SibSp plus 1 (1 for the person of respective record). So, let’s create the Family and see what we get.

data['family'] = data.SibSp + data.Parch + 1
    charts('family', data[data.Survived>=0])

               ____________________________ Plots of family per Survived and Dead: ____________________________

As we can see, family groups of up to 4 people were more likely to survive than people without relatives on board.
However from 5 family members we see a drastic fall and the leveling of the 7-member cases with the unfamiliar ones.
You may be led to think that this distortion clearly has some relation to the social condition. Better see the right data!

charts('Pclass', data[(data.family>4) & (data.Survived>=0)])

               ____________________________ Plots of Pclass per Survived and Dead: ____________________________

Yes, we have more cases in the third class, but on the other hand, what we see is that the numbers of cases with more than 4 relatives were rarer. n a more careful look, you will see that from 6 family members we only have third class (25 in training, 10 in test). So we confirmed that a large number of family members made a difference, yes, if you were from upper classes

You must have a feeling of déjà vu, and yes, this metric is very similar to the one we have already created, the amount of passengers with the same ticket.

So what’s the difference. At first you have only the amount of people aboard with family kinship plus herself, in the previous you have people reportedly grouped, family members or not. So, in cases where relatives bought tickets separately we see the family considering them, but the ticket separating them. On the other hand, as a family we do not consider travelers with their non-family companions, employees or friends, while in the other yes.

With this, we can now obtain the number of fellows or companions per passenger. This is the number of non-relatives who traveled with the passenger

data['non_relatives'] = data.qtd_same_ticket - data.family
    charts('non_relatives', data[data.Survived>=0])

               ____________________________ Plots of non_relatives per Survived and Dead: ____________________________

Here you see negative numbers because there are groups of travelers with the number of unrelated members larger than those with kinship.

Sex¶

As everybody knows, in that case women has more significant survival rate than men.

charts('Sex', data[(data.Survived>=0)])

               ____________________________ Plots of Sex per Survived and Dead: ____________________________

Embarked¶

First, we check the 2 embarked null cases to find the most likely pattern to considerate to fill with the respective mode.

In sequence, we take a look at the Embarked data. As we can see, the passengers that embarked from Cherbourg had best survival rates and most of the passengers embarked from Southampton and had the worst survival rate.

display(data[data.Embarked.isnull()])
    data.loc[data.Embarked=='NA', 'Embarked'] = data[(data.Cabin.str.match('B2')>0) & (data.Pclass==1)].Embarked.mode()[0]
    charts('Embarked', data[(data.Survived>=0)])

               ____________________________ Plots of Embarked per Survived and Dead: ____________________________

Name¶

Name feature has too much variance and is not significant, but has some value information to extracts and checks, like:

Personal Titles
Existence of nicknames
Existence of references to another person
Family names

def Personal_Titles(df):
        df['Personal_Titles'] = df.Name.str.findall('Mrs\.|Mr\.|Miss\.|Maste[r]|Dr\.|Lady\.|Countess\.|'
                                                    +'Sir\.|Rev\.|Don\.|Major\.|Col\.|Jonkheer\.|'
                                                    + 'Capt\.|Ms\.|Mme\.|Mlle\.').apply(lambda x: ''.join(map(str, x)).strip('[]'))

        df.Personal_Titles[df.Personal_Titles=='Mrs.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Mr.'] = 'Mr'
        df.Personal_Titles[df.Personal_Titles=='Miss.'] = 'Miss'
        df.Personal_Titles[df.Personal_Titles==''] = df[df.Personal_Titles==''].Sex.apply(lambda x: 'Mr' if (x=='male') else 'Mrs')
        df.Personal_Titles[df.Personal_Titles=='Mme.'] = 'Mrs' 
        df.Personal_Titles[df.Personal_Titles=='Ms.'] = 'Mrs'
        df.Personal_Titles[df.Personal_Titles=='Lady.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Mlle.'] = 'Miss'
        df.Personal_Titles[(df.Personal_Titles=='Miss.') & (df.Age>-1) & (df.Age<15)] = 'Kid' 
        df.Personal_Titles[df.Personal_Titles=='Master'] = 'Kid'
        df.Personal_Titles[df.Personal_Titles=='Don.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Jonkheer.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Capt.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Rev.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Sir.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Countess.'] = 'Royalty'
        df.Personal_Titles[df.Personal_Titles=='Major.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Col.'] = 'Technical'
        df.Personal_Titles[df.Personal_Titles=='Dr.'] = 'Technical'

    Personal_Titles(data)
    display(pd.pivot_table(data[['Personal_Titles', 'Survived']], aggfunc=np.count_nonzero, 
                           columns=['Survived'] , index=['Personal_Titles']).T)

    charts('Personal_Titles', data[(data.Survived>=0)])

               ____________________________ Plots of Personal_Titles per Survived and Dead: ____________________________

As you can see above, I opted to add some titles, but at first keep 2 small sets (Technical and Royalty), Because there are interesting survival rate variations.

Next, we identify the names with mentions to other people or with nicknames and create a boolean feature.

data['distinction_in_name'] =\
       ((data.Name.str.findall('\(').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='(')
        | (data.Name.str.findall(r'"[A-z"]*"').apply(lambda x: ''.join(map(str, x)).strip('""'))!=''))

    data.distinction_in_name = data.distinction_in_name.apply(lambda x: 1 if x else 0)

    charts('distinction_in_name', data[(data.Survived>=0)])

               ____________________________ Plots of distinction_in_name per Survived and Dead: ____________________________

It is interesting to note that those who have some type of reference or distinction in their names had a higher survival rate.

Next, we find 872 surnames in this dataset. Even adding loners in a single category, we have 229 with more than one member. It’s a huge categorical data to work, and it is to much sparse. The most of then has too few samples to really has significances to almost of algorithms, without risk to occurs overfitting. In addition, there are 18 surnames cases with more than one member exclusively in the test data set.

So, we create this feature with aggregation of unique member into one category and use this at models that could work on it to check if we get better results. Alternatively, we can use dimensionality reduction methods.

print('Total of differents surnames aboard:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).shape[0])
    print('More then one persons aboard with smae surnames:',
          ((data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()>1).sum())

    surnames = (data.Name.str.findall(r'[A-z]*\,').apply(lambda x: ''.join(map(str, x)).strip(','))).value_counts()

    data['surname'] = (data.Name.str.findall(r'[A-z]*\,').\
     apply(lambda x: ''.join(map(str, x)).strip(','))).apply(lambda x: x if surnames.get_value(x)>1 else 'Alone')

    test_surnames = set(data.surname[data.Survived>=0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the test data set:', 
          240-len(test_surnames))

    train_surnames = set(data.surname[data.Survived<0].unique().tolist())
    print('Surnames with more than one member aboard that happens only in the train data set:', 
          240-len(train_surnames))

    both_surnames = test_surnames.intersection(train_surnames)

    data.surname = data.surname.apply(lambda x : x if test_surnames.issuperset(set([x])) else 'Exclude')

    del surnames, both_surnames, test_surnames, train_surnames

Total of differents surnames aboard: 872
    More then one persons aboard with smae surnames: 239
    Surnames with more than one member aboard that happens only in the test data set: 18
    Surnames with more than one member aboard that happens only in the train data set: 76

Cabin¶

This information has to many nulls, but when it exist we can know what is the deck of the passenger, and some distinguish passengers from the same class.

Let’s start applying the same cabin to null cases where there are samples with cabins for the same ticket.

CabinByTicket = data.loc[~data.Cabin.isnull(), ['Ticket', 'Cabin']].groupby(by='Ticket').agg(min)
    before = data.Cabin.isnull().sum()
    data.loc[data.Cabin.isnull(), 'Cabin'] = data.loc[data.Cabin.isnull(), 'Ticket'].\
       apply(lambda x: CabinByTicket[CabinByTicket.index==x].min())
    print('Cabin nulls reduced:', (before - data.Cabin.isnull().sum()))
    del CabinByTicket, before

Cabin nulls reduced: 16

data.Cabin[data.Cabin.isnull()] = 'N999'
    data['Cabin_Letter'] = data.Cabin.str.findall('[^a-z]\d\d*')
    data['Cabin_Number'] = data.apply(lambda x: 0 if len(str(x.Cabin))== 1 else np.int(np.int(x.Cabin_Letter[0][1:])/10), axis=1)
    data.Cabin_Letter = data.apply(lambda x: x.Cabin if len(str(x.Cabin))== 1 else x.Cabin_Letter[0][0], axis=1)

    display(data[['Fare', 'Cabin_Letter']].groupby(['Cabin_Letter']).agg([np.median, np.mean, np.count_nonzero, np.max, np.min]))

Doesn’t exist Cabin T in test dataset. This passenger is from first class and his passenger fare is the same from others 5 first class passengers. So, changed to ‘C’ to made same distribution between the six.

display(data[data.Cabin=='T'])

    display(data.Cabin_Letter[data.passenger_fare==35.5].value_counts())

    data.Cabin_Letter[data.Cabin_Letter=='T'] = 'C'

B    2
    A    2
    C    1
    T    1
    Name: Cabin_Letter, dtype: int64

Fill Cabins letters NAs of third class with most common patterns of the same passenger fare range with one or lessen possible cases.

data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<6.237) & (data.passenger_fare>=0.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.225) & (data.passenger_fare>=6.237) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.65) & (data.passenger_fare>=7.225) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<7.75) & (data.passenger_fare>=7.65) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare<8.0) & (data.passenger_fare>=7.75) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.min()

    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin=='N999'), 'Cabin_Number'] =\
      data[(data.passenger_fare>=8.0) & (data.Pclass==3) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

Fill Cabins letters NAs of second class with most common patterns of the same passenger fare range with one or lessen possible cases.

data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=0) & (data.passenger_fare<8.59) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=8.59) & (data.passenger_fare<10.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.5) & (data.passenger_fare<10.501) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=10.501) & (data.passenger_fare<12.5) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=12.5) & (data.passenger_fare<13.) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.) & (data.passenger_fare<13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>=13.1) & (data.Pclass==2) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

Fill Cabins letters NAs of first class with most common patterns of the same passenger fare range with one or lessen possible cases.

data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare==0) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>0) & (data.passenger_fare<=19.69) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>19.69) & (data.passenger_fare<=23.374) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>23.374) & (data.passenger_fare<=25.25) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.69) & (data.passenger_fare<=25.929) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>25.99) & (data.passenger_fare<=26.) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>26.549) & (data.passenger_fare<=26.55) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.4) & (data.passenger_fare<=27.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>27.7207) & (data.passenger_fare<=27.7208) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>29.69) & (data.passenger_fare<=29.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.49) & (data.passenger_fare<=30.5) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>30.6) & (data.passenger_fare<=30.7) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>31.67) & (data.passenger_fare<=31.684) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>39.599) & (data.passenger_fare<=39.6) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>41) & (data.passenger_fare<=41.2) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>45.49) & (data.passenger_fare<=45.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>49.5) & (data.passenger_fare<=49.51) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Letter'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Letter.mode()[0]
    data.loc[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin=='N999'), 'Cabin_Number'] =\
        data[(data.passenger_fare>65) & (data.passenger_fare<=70) & (data.Pclass==1) & (data.Cabin!='N999')].Cabin_Number.mode()[0]

See below that we conquered a good results after filling nulls, but we need attention since they have too many nulls originally. In addition, the cabin may actually have made more difference in the deaths caused by the impact and not so much among those who drowned.

charts('Cabin_Letter', data[(data.Survived>=0)])

               ____________________________ Plots of Cabin_Letter per Survived and Dead: ____________________________

Rescue of family relationships¶

After some work, we notice that is difficult to understand SibSp and Patch isolated, and is difficult to extract directly families relationships from this data without a closer look.

So, in that configuration we not have clearly families relationships, and this information is primary to use for apply ages to ages with null with better distribution and accuracy.

Let’s start to rescue:

The first treatment, I discovered when check the results and noticed that I didn’t apply any relationship to a one case. Look the details below, we can see that is a case of a family with more than one ticket and the son has no age. So, I just manually applied this one case as a son, since the others member the father and the mother, and the son has the pattern 0 in SibSp and Parch 2

display(data[data.Name.str.findall('Bourke').apply(lambda x: ''.join(map(str, x)).strip('[]'))=='Bourke'])
    family_w_age = data.Ticket[(data.Parch>0) & (data.SibSp>0) & (data.Age==-1)].unique().tolist()
    data[data.Ticket.isin(family_w_age)].sort_values('Ticket')

data['sons'] = data.apply(lambda x : \
                              1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp>0)) else 0, axis=1)

    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp>1)) else 0, axis=1)


    data.sons += data.apply(lambda x : \
                            1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles not in (['Mr', 'Mrs']))) else 0, axis=1)

    data.sons += data.apply(lambda x: 1 if ((x.Parch>0) & (x.Age>=0) & (x.Age<20)) else 0, axis=1)
    data.sons.loc[data.PassengerId==594] = 1 # Sun with diferente pattern (family with two tickets)
    data.sons.loc[data.PassengerId==1252] = 1 # Case of 'CA. 2343' and last rule
    data.sons.loc[data.PassengerId==1084] = 1 # Case of 'A/5. 851' and last rule
    data.sons.loc[data.PassengerId==1231] = 1 # Case of 'A/5. 851' and last rule

    charts('sons', data[(data.Survived>=0)])

               ____________________________ Plots of sons per Survived and Dead: ____________________________

We observe that has only 12.1% of sons, and their had better survival rate than others.

Next, we rescue the parents, and check cases where we have both (mother and father), and cases where we have only one aboard.

data['parents'] = data.apply(lambda x : \
                                  1 if ((x.Ticket in (['2661', '2668', 'A/5. 851', '4133'])) & (x.SibSp==0)) else 0, axis=1)

    data.parents += data.apply(lambda x : \
                                  1 if ((x.Ticket in (['CA. 2343'])) & (x.SibSp==1)) else 0, axis=1)

    data.parents += data.apply(lambda x : 1 if ((x.Ticket in (['W./C. 6607'])) & (x.Personal_Titles in (['Mr', 'Mrs']))) \
                                    else 0, axis=1)

    # Identify parents and care nulls ages
    data.parents += data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parents', data[(data.Survived>=0)])

               ____________________________ Plots of parents per Survived and Dead: ____________________________

data['parent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>19) & (x.Age<=45) ) else 0, axis=1)
    charts('parent_alone', data[(data.Survived>=0)])

               ____________________________ Plots of parent_alone per Survived and Dead: ____________________________

We can notice that the both cases are to similar and it is not significant to has this two information separately.

Before I put them together, as I had learned in assembling the sons, I made a visual inspection and discovered some cases of sons and parents that required different rules for assigning them. As I did visually and this is not a rule for a pipeline, I proceeded with the settings manually.

t_p_alone = data.Ticket[data.parent_alone==1].tolist()

    data[data.Ticket.isin(t_p_alone)].sort_values('Ticket')[96:]

    data.parent_alone.loc[data.PassengerId==141] = 1

    data.parent_alone.loc[data.PassengerId==541] = 0
    data.sons.loc[data.PassengerId==541] = 1

    data.parent_alone.loc[data.PassengerId==1078] = 0
    data.sons.loc[data.PassengerId==1078] = 1

    data.parent_alone.loc[data.PassengerId==98] = 0
    data.sons.loc[data.PassengerId==98] = 1

    data.parent_alone.loc[data.PassengerId==680] = 0
    data.sons.loc[data.PassengerId==680] = 1

    data.parent_alone.loc[data.PassengerId==915] = 0
    data.sons.loc[data.PassengerId==915] = 1

    data.parent_alone.loc[data.PassengerId==333] = 0
    data.sons.loc[data.PassengerId==333] = 1

    data.parent_alone.loc[data.PassengerId==119] = 0
    data.sons[data.PassengerId==119] = 1

    data.parent_alone.loc[data.PassengerId==319] = 0
    data.sons.loc[data.PassengerId==319] = 1

    data.parent_alone.loc[data.PassengerId==103] = 0
    data.sons.loc[data.PassengerId==103] = 1

    data.parents.loc[data.PassengerId==154] = 0
    data.sons.loc[data.PassengerId==1084] = 1

    data.parents.loc[data.PassengerId==581] = 0
    data.sons.loc[data.PassengerId==581] = 1

    data.parent_alone.loc[data.PassengerId==881] = 0
    data.sons.loc[data.PassengerId==881] = 1

    data.parent_alone.loc[data.PassengerId==1294] = 0
    data.sons.loc[data.PassengerId==1294] = 1

    data.parent_alone.loc[data.PassengerId==378] = 0
    data.sons.loc[data.PassengerId==378] = 1

    data.parent_alone.loc[data.PassengerId==167] = 1
    data.parent_alone.loc[data.PassengerId==357] = 0
    data.sons.loc[data.PassengerId==357] = 1

    data.parent_alone.loc[data.PassengerId==918] = 0
    data.sons.loc[data.PassengerId==918] = 1

    data.parent_alone.loc[data.PassengerId==1042] = 0
    data.sons.loc[data.PassengerId==1042] = 1

    data.parent_alone.loc[data.PassengerId==540] = 0
    data.sons.loc[data.PassengerId==540] = 1

    data.parents += data.parent_alone 
    charts('parents', data[(data.Survived>=0)])

               ____________________________ Plots of parents per Survived and Dead: ____________________________

Next, we rescue the grandparents and grandparents alone. We found the same situations with less cases and decided put all parents and grandparents in one feature and leave to age distinguish them.

data['grandparents'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp>0) & (x.Age>19) & (x.Age>45) ) else 0, axis=1)
    charts('grandparents', data[(data.Survived>=0)])

               ____________________________ Plots of grandparents per Survived and Dead: ____________________________

data['grandparent_alone'] = data.apply(lambda x: 1 if ((x.Parch>0) & (x.SibSp==0) & (x.Age>45) ) else 0, axis=1)
    charts('grandparent_alone', data[(data.Survived>=0)])

               ____________________________ Plots of grandparent_alone per Survived and Dead: ____________________________

data.parents += data.grandparent_alone + data.grandparents
    charts('parents', data[(data.Survived>=0)])

               ____________________________ Plots of parents per Survived and Dead: ____________________________

Next, we identify the relatives aboard:

data['relatives'] = data.apply(lambda x: 1 if ((x.SibSp>0) & (x.Parch==0)) else 0, axis=1)
    charts('relatives', data[(data.Survived>=0)])

               ____________________________ Plots of relatives per Survived and Dead: ____________________________

And then, the companions, persons who traveled with a family but do not have family relationship with them.

data['companions'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==1)) else 0, axis=1)
    charts('companions', data[(data.Survived>=0)])

               ____________________________ Plots of companions per Survived and Dead: ____________________________

Finally, we rescue the passengers that traveled alone.

data['alone'] = data.apply(lambda x: 1 if ((x.SibSp==0) & (x.Parch==0) & (x.same_tckt==0)) else 0, axis=1)
    charts('alone', data[(data.Survived>=0)])

               ____________________________ Plots of alone per Survived and Dead: ____________________________

As we can see, people with a family relationship, even if only as companions, had better survival rates and very close, than people who traveled alone.

Now we can work on issues of nulls ages and then on own information of age.

Age¶

We start with the numbers of nulls case by survived to remember that is too high.

Then, we plot the distributions of Ages, to check how is fit into the normal and see the distortions when apply a unique value (zero) to the nulls cases.

Next, we made the scatter plot of Ages and siblings, and see hat age decreases with the increase in the number of siblings, but with a great range

fig = plt.figure(figsize=(20, 10))
    fig1 = fig.add_subplot(221)
    g = sns.distplot(data.Age.fillna(0), fit=norm, label='Nulls as Zero')
    g = sns.distplot(data[~data.Age.isnull()].Age, fit=norm, label='Withou Nulls')
    plt.legend(loc='upper right')
    print('Survived without Age:')
    display(data[data.Age.isnull()].Survived.value_counts())
    fig2 = fig.add_subplot(222)
    g = sns.scatterplot(data = data[(~data.Age.isnull())], y='Age', x='SibSp',  hue='Survived')

Survived without Age:

 0    125
    -1     86
     1     52
    Name: Survived, dtype: int64

From the tables below, we can see that our enforce to get Personal Titles and rescue family relationships produce better medians to apply on nulls ages.

print('Mean and median ages by siblings:')
    data.loc[data.Age.isnull(), 'Age'] = -1
    display(data.loc[(data.Age>=0), ['SibSp', 'Age']].groupby('SibSp').agg([np.mean, np.median]).T)

    print('\nMedian ages by Personal_Titles:')
    Ages = { 'Age' : {'median'}}
    display(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg(Ages).T)

    print('\nMedian ages by Personal Titles and Family Relationships:')
    display(pd.pivot_table(data[data.Age>=0][['Age', 'Personal_Titles', 'parents', 'grandparents', 
                                              'sons', 'relatives', 'companions','alone']],
                           aggfunc=np.median, 
                           index=['parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone'] , 
                           columns=['Personal_Titles']))

    print('\nNulls ages by Personal Titles and Family Relationships:')
    display(data[data.Age<0][['Personal_Titles', 'parents', 'grandparents', 'sons', 'relatives', 'companions', 'alone']].\
            groupby('Personal_Titles').agg([sum]))

Mean and median ages by siblings:

    Median ages by Personal_Titles:

    Median ages by Personal Titles and Family Relationships:

    Nulls ages by Personal Titles and Family Relationships:

So, we apply to the nulls ages the respectively median of same personal title and same family relationship, but before, we create a binary feature to maintain the information of the presence of nulls.

data['Without_Age'] = data.Age.apply(lambda x: 0 if x>0 else 1)

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.companions==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.companions==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Technical')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Technical')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.alone==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.alone==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.parents==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.parents==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()
    data.Age.loc[(data.Age.isnull()) & (data.sons==1) & (data.Personal_Titles=='Kid')] = \
       data.Age[(data.Age>=0) & (data.Personal_Titles=='Kid')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.sons==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mrs')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Miss')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Miss')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mr')] = \
       data.Age[(data.Age>=0) & (data.sons==1) & (data.Personal_Titles=='Mr')].median()

    data.Age.loc[(data.Age<0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')] = \
       data.Age[(data.Age>=0) & (data.relatives==1) & (data.Personal_Titles=='Mrs')].median()

Finally, we check how age distribution lines after fill the nulls.

print('Age correlation with survived:',data.corr()['Survived'].Age)
    g = sns.distplot(data.Age, fit=norm, label='With nulls filled')
    plt.legend(loc='upper right')
    plt.show()

Age correlation with survived: -0.04046792444732172

	types	counts	distincts	nulls	missing_ration	uniques	skewness	kurtosis	corr Survived
Survived	int64	1309	3	0	0.000000	[[0, 1, -1]]	0.097425	-1.263022	1.000000
Fare	float64	1308	282	1	0.076394	[[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51…	4.367709	27.027986	0.081545
Parch	int64	1309	8	0	0.000000	[[0, 1, 2, 5, 3, 4, 6, 9]]	3.669078	21.541079	0.028196
SibSp	int64	1309	7	0	0.000000	[[1, 0, 3, 4, 2, 5, 8]]	3.844220	20.043251	0.012470
Age	float64	1046	99	263	20.091673	[[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0…	0.407675	0.146948	-0.049620
Pclass	int64	1309	3	0	0.000000	[[3, 1, 2]]	-0.598647	-1.315079	-0.126769
Name	object	1309	1307	0	0.000000	[[Braund, Mr. Owen Harris, Cumings, Mrs. John …	NaN	NaN	NaN
Sex	object	1309	2	0	0.000000	[[male, female]]	NaN	NaN	NaN
Ticket	object	1309	929	0	0.000000	[[A/5 21171, PC 17599, STON/O2. 3101282, 11380…	NaN	NaN	NaN
Cabin	object	295	187	1014	77.463713	[[nan, C85, C123, E46, G6, C103, D56, A6, C23 …	NaN	NaN	NaN
Embarked	object	1307	4	2	0.152788	[[S, C, Q, nan]]	NaN	NaN	NaN

	count	mean	std	min	25%	50%	75%	max
Pclass	1309.0	2.294882	0.837836	1.00	2.0000	3.0000	3.000	3.0000
Age	1046.0	29.881138	14.413493	0.17	21.0000	28.0000	39.000	80.0000
SibSp	1309.0	0.498854	1.041658	0.00	0.0000	0.0000	1.000	8.0000
Parch	1309.0	0.385027	0.865560	0.00	0.0000	0.0000	0.000	9.0000
Fare	1308.0	33.295479	51.758668	0.00	7.8958	14.4542	31.275	512.3292

Personal_Titles	Kid	Miss	Mr	Mrs	Royalty	Technical
Survived
-1	42.0	156.0	480.0	148.0	NaN	10.0
0	17.0	55.0	436.0	26.0	2.0	13.0
1	46.0	258.0	162.0	202.0	6.0	10.0

	Fare
	median	mean	count_nonzero	amax	amin
Cabin_Letter
A	35.07710	41.244314	21.0	81.8583	0.0000
B	84.38335	131.484438	65.0	512.3292	0.0000
C	90.00000	112.161300	104.0	263.0000	25.7000
D	52.55420	53.007339	46.0	113.2750	12.8750
E	51.86250	49.557091	47.0	134.5000	7.2292
F	19.50000	22.303571	14.0	39.0000	7.7500
G	10.46250	11.291667	9.0	16.7000	7.6500
N	10.50000	16.861973	985.0	133.6500	0.0000
T	35.50000	35.500000	1.0	35.5000	35.5000

	SibSp	0	1	2	3	4	5	8
Age	mean	30.921766	31.058071	23.569444	16.3125	8.772727	10.166667	14.5
Age	median	28.000000	30.000000	21.500000	14.5000	7.000000	10.500000	14.5

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	qtd_same_ticket
159	160	0	3	Sage, Master. Thomas Henry	male	NaN	8	2	CA. 2343	69.55	NaN	S	11
180	181	0	3	Sage, Miss. Constance Gladys	female	NaN	8	2	CA. 2343	69.55	NaN	S	11
201	202	0	3	Sage, Mr. Frederick	male	NaN	8	2	CA. 2343	69.55	NaN	S	11
324	325	0	3	Sage, Mr. George John Jr	male	NaN	8	2	CA. 2343	69.55	NaN	S	11
792	793	0	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.55	NaN	S	11
846	847	0	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.55	NaN	S	11
863	864	0	3	Sage, Miss. Dorothy Edith “Dolly”	female	NaN	8	2	CA. 2343	69.55	NaN	S	11
1079	1080	-1	3	Sage, Miss. Ada	female	NaN	8	2	CA. 2343	69.55	NaN	S	11
1233	1234	-1	3	Sage, Mr. John George	male	NaN	1	9	CA. 2343	69.55	NaN	S	11
1251	1252	-1	3	Sage, Master. William Henry	male	14.5	8	2	CA. 2343	69.55	NaN	S	11
1256	1257	-1	3	Sage, Mrs. John (Annie Bullen)	female	NaN	1	9	CA. 2343	69.55	NaN	S	11

	PassengerId	Survived	Pclass	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked	qtd_same_ticket	qtd_same_ticket_bin	same_tckt	distinction_in_tikect	passenger_fare
258	259	1	1	Ward, Miss. Anna	female	35.0	0	PC 17755	512.3292	NaN	C	4	3	1	PC	128.0823
679	680	1	1	Cardeza, Mr. Thomas Drake Martinez	male	36.0	1	PC 17755	512.3292	B51 B53 B55	C	4	3	1	PC	128.0823
737	738	1	1	Lesurer, Mr. Gustave J	male	35.0	0	PC 17755	512.3292	B101	C	4	3	1	PC	128.0823
1234	1235	-1	1	Cardeza, Mrs. James Warburton Martinez (Charlo…	female	58.0	1	PC 17755	512.3292	B51 B53 B55	C	4	3	1	PC	128.0823

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	…	passenger_fare	SibSp_bin	Parch_bin	family	non_relatives	Personal_Titles	distinction_in_name	surname	Cabin_Letter	Cabin_Number
188	189	3	Bourke, Mr. John	male	40.0	1	1	364849	15.50	…	7.75	1	1	3	-1	Mr	0	Bourke	F	3
593	594	3	Bourke, Miss. Mary	female	NaN	0	2	364848	7.75	…	7.75	0	2	3	-2	Miss	0	Bourke	F	3
657	658	3	Bourke, Mrs. John (Catherine)	female	32.0	1	1	364849	15.50	…	7.75	1	1	3	-1	Mrs	1	Bourke	F	3

						Age
					Personal_Titles	Kid	Miss	Mr	Mrs	Royalty	Technical
parents	grandparents	sons	relatives	companions	alone
0	0	0	0	0	1	NaN	23.0	28.75	36.0	39.0	50.0
			0	1	0	NaN	30.0	28.00	40.0	33.0	NaN
			1	0	0	12.0	23.0	30.00	33.0	48.5	48.5
		1	0	0	0	4.0	8.5	18.00	19.0	NaN	NaN
1	0	0	0	0	0	NaN	24.0	38.00	35.5	NaN	29.0
1	1	0	0	0	0	NaN	NaN	56.00	50.0	NaN	61.5

Titanic: Machine Learning from Disaster – Kaggle Copetitions