image.png

About the Data¶

The analysis is based on the historical dataset of RMS Titanic passengers, which sank on April 15, 1912, after colliding with an iceberg. The dataset includes demographic and travel information such as age, gender, ticket class, family relationships on board, ticket price, and port of embarkation. The key variable is whether the passenger survived the disaster. In total, Titanic carried over 2,200 people, of which more than 1,500 perished, making this tragedy one of the most serious in maritime history.

Exploratory data analysis on the Titanic disaster¶

1. Preliminary Data Analysis¶

1. Basic information about the data¶

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1310 entries, 0 to 1309
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   float64
 1   survived   1309 non-null   float64
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   float64
 6   parch      1309 non-null   float64
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(7), object(7)
memory usage: 143.4+ KB

2. Let's take a look at the summary of all numeric columns.¶

pclass survived age sibsp parch fare body
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 121.000000
mean 2.294882 0.381971 29.881135 0.498854 0.385027 33.295479 160.809917
std 0.837836 0.486055 14.413500 1.041658 0.865560 51.758668 97.696922
min 1.000000 0.000000 0.166700 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000 7.895800 72.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000 14.454200 155.000000
75% 3.000000 1.000000 39.000000 1.000000 0.000000 31.275000 256.000000
max 3.000000 1.000000 80.000000 8.000000 9.000000 512.329200 328.000000
  • We can observe that the average ticket price was $33.3

  • We can say that the survived column does not make sense here, as 1 means the person survived, 0 means they died, and NaN means unknown

  • We also see that there was an average of 0.5 siblings and 0.39 parents per passenger, meaning there weren't many children compared to other passengers

2. We are looking for duplicates and missing values¶

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

We can see that there is a lack of information about:

  • the age of passengers,

  • cabin numbers,

  • body numbers, many passengers' bodies were not found, and some survived

  • lifeboats, because many passengers were not in the boats and did not survive

  • final destinations.

2. We are looking for duplicates¶

0

No duplicates

3. We will fix the columns with missing values¶

1. We create a dataframe where the missing age values are filled with the average age of that class, and the missing ticket fee values with the median of that class.¶

2. We can create a Data frame by removing the rows with missing gender.¶

4. We analyze individual columns¶

1. Let's analyze how many people were on board, distinguishing them by gender.¶

No description has been provided for this image

We see that there were many more men on board than women, to be precise, men made up 64.4% of the passengers while women accounted for 35.6%.

2. We check how many women and how many men were in each class.¶

No description has been provided for this image

We can observe that:

  • in class 1 there was a similar number of men and women, ranging from 160 to 180

  • in class 2 the majority were men, there were about 160, and women around 110

  • in class 3, however, the vast majority were men, they numbered about 500, while there were 200 women.

3. Let's check the data regarding ticket prices in the various classes¶

fare
mean median max min
pclass
1.0 87.508992 60.0000 512.3292 0.0
2.0 21.179196 15.0458 73.5000 0.0
3.0 13.295480 8.0500 69.5500 0.0

1. Bar chart of ticket price data¶

No description has been provided for this image

We can notice that:

  • the maximum ticket price in 1st class, that is, the best class, was high, 5 times more than the average price of tickets in this class, in 2nd class it was also significantly higher than the average, just like in 1st class

  • the median value in the first class shows us that most ticket prices were around the average or not lower than it

  • some passengers on the ship did not pay for the cruise.

2. Let's prepare the data that will show us only the tickets that were paid for.¶

fare
mean median max min
pclass
1.0 89.447482 61.3792 512.3292 5.0000
2.0 21.648108 15.0500 73.5000 9.6875
3.0 13.370915 8.0500 69.5500 3.1708

3. The bar chart presents the prices of paid tickets only¶

No description has been provided for this image

We see that the minimum prices for tickets in each class deviated significantly from the maximum ticket prices, or the average.

4. Summary of information about paid tickets¶

count mean std min 25% 50% 75% max
pclass
1.0 316.0 89.447482 80.259713 5.0000 31.682275 61.3792 108.9000 512.3292
2.0 271.0 21.648108 13.382064 9.6875 13.000000 15.0500 26.0000 73.5000
3.0 705.0 13.370915 11.476600 3.1708 7.750000 8.0500 15.2458 69.5500

In each of the 3 classes, we can see that the ticket prices oscillated around the average, as even 75% were close to it, and only a few were significantly more expensive.

5. Histogram of ticket prices in classes¶

No description has been provided for this image

We can see that each of these histograms has a similar structure and show that the most in each class were sold from the cheaper tickets in the entire class pool.

4. Let's check the information about the ages of passengers from different classes.¶

1. Age histograms¶

No description has been provided for this image

Based on each of the 3 histograms, we can observe that in each class, the largest number of passengers were people aged around 20 to 30 years, while the smallest group consisted of seniors aged 65 to 70 years.

2. Detailed information regarding age¶

No description has been provided for this image

We can notice that:

  • there were people on board even up to 80 years old

  • in each class, the average age of passengers is similar to the median, meaning that 50% of passengers were of an age less than or equal to the average in each class.

5. Let's analyze the column with the survivors¶

1. We create a graph of how many men and women survived.¶

sex sum_survived
0 female 339.0
1 male 161.0
No description has been provided for this image

The chart shows that more women survived than men, as about 340 women and around 155 men survived.

2. Analysis of the chances of survival of passengers in different classes¶

  1. Table of overall survival odds in classes
  pclass survived_sum survived_count percentage_survived
0 1.000000 200.000000 323 61.919505
1 2.000000 119.000000 277 42.960289
2 3.000000 181.000000 709 25.528914

We see that:

  • passengers in 1st class had a 62% chance of survival

  • passengers in 3rd class had a 25% chance of survival

  1. Survival odds chart in the individual classes
No description has been provided for this image

We see that the passengers in 1st class had the highest percentage chance of survival, while those in 3rd class had the lowest.

3. Chart of survivors in different classes differentiated by gender¶

  1. Auxiliary table
  pclass sex survived_sum survived_count percentage_survived
0 1.000000 female 139.000000 144 96.527778
1 1.000000 male 61.000000 179 34.078212
2 2.000000 female 94.000000 106 88.679245
3 2.000000 male 25.000000 171 14.619883
4 3.000000 female 106.000000 216 49.074074
5 3.000000 male 75.000000 493 15.212982

Based on the table, we see that about 97% of the first class passengers survived the disaster.

  1. A chart showing the survival chances of women and men
No description has been provided for this image

We can observe that: women had a much greater chance of surviving the disaster, especially from class 1 and 2 where around or over 90% of women from those classes survived* men, regardless of class, had a very small chance of survival, as it was less than 35%

5. Finding patterns and correlations in data¶

1. Study of the correlation for passengers regarding their age¶

  1. A table illustrating the survival chances of all passengers, taking age into account.
  Age_group Age_group_number pclass survived_sum survived_count percentage_survived
0 0-18 1 1.000000 18.000000 21 85.714286
1 0-18 1 2.000000 31.000000 43 72.093023
2 0-18 1 3.000000 46.000000 132 34.848485
3 18-29 2 1.000000 40.000000 57 70.175439
4 18-29 2 2.000000 46.000000 114 40.350877
5 18-29 2 3.000000 106.000000 426 24.882629
6 30-44 3 1.000000 82.000000 132 62.121212
7 30-44 3 2.000000 31.000000 85 36.470588
8 30-44 3 3.000000 24.000000 123 19.512195
9 45-59 4 1.000000 50.000000 87 57.471264
10 45-59 4 2.000000 10.000000 27 37.037037
11 45-59 4 3.000000 4.000000 22 18.181818
12 60+ 5 1.000000 10.000000 26 38.461538
13 60+ 5 2.000000 1.000000 8 12.500000
14 60+ 5 3.000000 1.000000 6 16.666667
  1. A chart illustrating the survival chances of all passengers, taking age into account.

Having a lot of data, we can say that in class 3 a very small number of men survived the disaster, about 1/5 aged 30-44.

No description has been provided for this image

On the chart, we observe that:* passengers aged 0-18 in classes 1 and 2 had high survival chances of 84% and 72% respectively.

In contrast, passengers aged 60+ had the lowest chances of surviving the disaster when compared to other age categories within the same class.

2. Sprawdźmy teraz jak wygląda korelacja gdy pasażerów rozdzielimy na płcie¶

1. Szanse na przeżycie wśród męskich pasażerów¶
  1. Survival odds table for men
  pclass Age_group survived_sum survived_count percentage_survived
0 1.000000 0-18 6.000000 8 75.000000
1 1.000000 18-29 11.000000 27 40.740741
2 1.000000 30-44 18.000000 47 38.297872
3 1.000000 45-59 16.000000 52 30.769231
4 1.000000 60+ 2.000000 17 11.764706
5 2.000000 0-18 11.000000 22 50.000000
6 2.000000 18-29 6.000000 57 10.526316
7 2.000000 30-44 5.000000 56 8.928571
8 2.000000 45-59 0.000000 16 0.000000
9 2.000000 60+ 1.000000 7 14.285714
10 3.000000 0-18 15.000000 73 20.547945
11 3.000000 18-29 30.000000 164 18.292683
12 3.000000 30-44 13.000000 93 13.978495
13 3.000000 45-59 1.000000 14 7.142857
14 3.000000 60+ 0.000000 5 0.000000

Based on the table, we need to consider whether 8, 7, and 5 passengers is a sufficient number to draw binding conclusions.

  1. A chart showing the chance in men
No description has been provided for this image

We see that:

  • men aged 0-18 years had about 75% chance of surviving the disaster

  • men from class 2 aged 0-18 years survived to a small extent because about 50% of them survived

  • male passengers from classes 2 and 3, who were older than 18 years, had a survival chance of 21% or less.

2. Analysis for the female part of the passengers¶
  1. Table of survival chances for women
  pclass Age_group survived_sum survived_count percentage_survived
0 1.000000 0-18 12.000000 13 92.307692
1 1.000000 18-29 29.000000 30 96.666667
2 1.000000 30-44 45.000000 46 97.826087
3 1.000000 45-59 34.000000 35 97.142857
4 1.000000 60+ 8.000000 9 88.888889
5 2.000000 0-18 20.000000 21 95.238095
6 2.000000 18-29 36.000000 41 87.804878
7 2.000000 30-44 26.000000 29 89.655172
8 2.000000 45-59 10.000000 11 90.909091
9 2.000000 60+ 0.000000 1 0.000000
10 3.000000 0-18 31.000000 59 52.542373
11 3.000000 18-29 26.000000 54 48.148148
12 3.000000 30-44 11.000000 30 36.666667
13 3.000000 45-59 3.000000 8 37.500000
14 3.000000 60+ 1.000000 1 100.000000

With only one passenger to observe, we should not pay much attention to the fact that in class 3, 100% of female passengers aged 60+ survived.

  1. The chart of women's survival chances
No description has been provided for this image

Looking at the chart, we can state that:

  • women in 1st and 2nd class, regardless of age, had a very high chance of survival over 85%, not taking into account the age 60+ in class 2, as there was 1 such person who did not survive.

  • Meanwhile, in class 3, women had a less than 55% chance of survival.

3. Let's check the correlations between ticket prices and the survival of passengers.¶

1. We check the dependence of ticket prices on the chances of survival¶

  1. A table of survival chances looking at ticket prices
  fare_groups pclass survived_sum survived_count percentage_survived
0 0-19 1.000000 1.000000 8 12.500000
1 0-19 2.000000 52.000000 152 34.210526
2 0-19 3.000000 156.000000 593 26.306914
3 100-310 1.000000 56.000000 80 70.000000
4 20-39 1.000000 48.000000 94 51.063830
5 20-39 2.000000 61.000000 109 55.963303
6 20-39 3.000000 19.000000 89 21.348315
7 40-79 1.000000 64.000000 104 61.538462
8 40-79 2.000000 6.000000 16 37.500000
9 40-79 3.000000 6.000000 27 22.222222
10 80-99 1.000000 27.000000 33 81.818182
11 >310 1.000000 4.000000 4 100.000000

We have a small comparative sample to declare that if a passenger paid more than 310 they had a huge chance of survival.

  1. Chart of chcnces to ticket prices
No description has been provided for this image

From the charts we can conclude that:

  • among first-class passengers who bought tickets for 80-99, 81% survived* passengers from first class who bought cheaper tickets in the price range of 0-40 had lower survival chances because less than 50% compared to others in that class

4. Let's analyze whether the size of the passenger's family. Did the number of siblings or spouse affect the chances of survival?¶

1. Survival chances table considering the number of spouses/siblings¶

2. The survival chance chart considering the number of spouses/siblings¶

  sibsp pclass survived_sum survived_count percentage_survived
0 0.000000 1.000000 111.000000 198 56.060606
1 0.000000 2.000000 69.000000 182 37.912088
2 0.000000 3.000000 129.000000 511 25.244618
3 1.000000 1.000000 79.000000 113 69.911504
4 1.000000 2.000000 43.000000 82 52.439024
5 1.000000 3.000000 41.000000 124 33.064516
6 2.000000 1.000000 7.000000 8 87.500000
7 2.000000 2.000000 6.000000 12 50.000000
8 2.000000 3.000000 6.000000 22 27.272727
9 3.000000 1.000000 3.000000 4 75.000000
10 3.000000 2.000000 1.000000 1 100.000000
11 3.000000 3.000000 2.000000 15 13.333333
12 4.000000 3.000000 3.000000 22 13.636364
13 5.000000 3.000000 0.000000 6 0.000000
14 8.000000 3.000000 0.000000 9 0.000000

Looking at the table, we cannot focus too much on the fact that the survival chances for a passenger who had 3 spouses/siblings on board in class 2 was 100%, since we had only 1 such passenger to observe.

No description has been provided for this image

We can notice here that among the first-class passengers who had 2 parents/spouses on board, 88% of them survived.

5. The impact of the number of parents/children on the survival chances of a passenger¶

1. Survival chance table considering the number of parents/children¶

  parch pclass survived_sum survived_count percentage_survived
0 0.000000 1.000000 141.000000 242 58.264463
1 0.000000 2.000000 64.000000 206 31.067961
2 0.000000 3.000000 131.000000 554 23.646209
3 1.000000 1.000000 36.000000 50 72.000000
4 1.000000 2.000000 31.000000 43 72.093023
5 1.000000 3.000000 33.000000 77 42.857143
6 2.000000 1.000000 21.000000 27 77.777778
7 2.000000 2.000000 21.000000 25 84.000000
8 2.000000 3.000000 15.000000 61 24.590164
9 3.000000 1.000000 1.000000 2 50.000000
10 3.000000 2.000000 3.000000 3 100.000000
11 3.000000 3.000000 1.000000 3 33.333333
12 4.000000 1.000000 1.000000 2 50.000000
13 4.000000 3.000000 0.000000 4 0.000000
14 5.000000 3.000000 1.000000 6 16.666667
15 6.000000 3.000000 0.000000 2 0.000000
16 9.000000 3.000000 0.000000 2 0.000000

We should not take the percentage survival rates literally for passengers who had between 3 to 9 parents/children on board, as there were relatively few such passengers.

2. The survival rate chart considering the number of parents/children¶

No description has been provided for this image

We can notice here that among the second-class passengers who had 2 children/parents on board, 84% of them survived.

6. We are looking for outlier values¶

1. Box plot for all passengers¶

No description has been provided for this image

We see that:

  • there are many outlier values in the age column, which means that there were also elderly people on the ship,

  • there are outlier values in the sibsp and parch columns, which means that there were large families on the ship,

  • we see many outlier values for the fare column, which means that many tickets for the ship were significantly more expensive than normal tickets.

2. Box plot for third class passengers¶

No description has been provided for this image

We can notice that the outliers for third class passengers are similar to the outliers for all passengers, but we can observe that in this class there was no ticket that deviated significantly in price from the others.

3. Box plot for Class 2 passengers¶

No description has been provided for this image

In class 2, we observe that there were very small children on board, and there were not many prices deviating from the normal ticket prices.

4. Box plot for first class passengers¶

No description has been provided for this image

Some ticket prices in first class were significantly inflated compared to other ticket prices in that class.

7. Summary¶

In summary, we found in the data that:

  • there were more men than women on board the ship - 500 people survived the disaster

  • the most people survived from 1st class, while the least from 3rd

  • more women survived than men

  • women in 1st and 2nd class had very high chances of survival, around

90%, regardless of age

  • in 3rd class, the chances of survival for women, considering age, were significantly lower, ranging from 55% chance to 33%

  • men considering all classes had very low chances of survival, below 35%

  • men from 1st class aged 0-18 survived at a rate of 75%, but there were only 8 men of that age in 1st class

  • a fairly large proportion of passengers survived, around 75% from 1st and 2nd class, who had 1 or 2 family members on board.