About the Data¶
The analysis is based on the historical dataset of RMS Titanic passengers, which sank on April 15, 1912, after colliding with an iceberg. The dataset includes demographic and travel information such as age, gender, ticket class, family relationships on board, ticket price, and port of embarkation. The key variable is whether the passenger survived the disaster. In total, Titanic carried over 2,200 people, of which more than 1,500 perished, making this tragedy one of the most serious in maritime history.
Exploratory data analysis on the Titanic disaster¶
1. Preliminary Data Analysis¶
1. Basic information about the data¶
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1310 entries, 0 to 1309 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pclass 1309 non-null float64 1 survived 1309 non-null float64 2 name 1309 non-null object 3 sex 1309 non-null object 4 age 1046 non-null float64 5 sibsp 1309 non-null float64 6 parch 1309 non-null float64 7 ticket 1309 non-null object 8 fare 1308 non-null float64 9 cabin 295 non-null object 10 embarked 1307 non-null object 11 boat 486 non-null object 12 body 121 non-null float64 13 home.dest 745 non-null object dtypes: float64(7), object(7) memory usage: 143.4+ KB
2. Let's take a look at the summary of all numeric columns.¶
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
We can observe that the average ticket price was $33.3
We can say that the survived column does not make sense here, as 1 means the person survived, 0 means they died, and NaN means unknown
We also see that there was an average of 0.5 siblings and 0.39 parents per passenger, meaning there weren't many children compared to other passengers
2. We are looking for duplicates and missing values¶
pclass 1 survived 1 name 1 sex 1 age 264 sibsp 1 parch 1 ticket 1 fare 2 cabin 1015 embarked 3 boat 824 body 1189 home.dest 565 dtype: int64
We can see that there is a lack of information about:
the age of passengers,
cabin numbers,
body numbers, many passengers' bodies were not found, and some survived
lifeboats, because many passengers were not in the boats and did not survive
final destinations.
2. We are looking for duplicates¶
0
No duplicates
3. We will fix the columns with missing values¶
1. We create a dataframe where the missing age values are filled with the average age of that class, and the missing ticket fee values with the median of that class.¶
2. We can create a Data frame by removing the rows with missing gender.¶
4. We analyze individual columns¶
1. Let's analyze how many people were on board, distinguishing them by gender.¶
We see that there were many more men on board than women, to be precise, men made up 64.4%
of the passengers while women accounted for 35.6%
.
2. We check how many women and how many men were in each class.¶
We can observe that:
in class 1 there was a similar number of men and women, ranging from
160 to 180
in class 2 the majority were men, there were
about 160
, and womenaround 110
in class 3, however, the vast majority were men, they numbered
about 500
, while there were200
women.
3. Let's check the data regarding ticket prices in the various classes¶
fare | ||||
---|---|---|---|---|
mean | median | max | min | |
pclass | ||||
1.0 | 87.508992 | 60.0000 | 512.3292 | 0.0 |
2.0 | 21.179196 | 15.0458 | 73.5000 | 0.0 |
3.0 | 13.295480 | 8.0500 | 69.5500 | 0.0 |
1. Bar chart of ticket price data¶
We can notice that:
the maximum ticket price in 1st class, that is, the best class, was high, 5 times more than the average price of tickets in this class, in 2nd class it was also significantly higher than the average, just like in 1st class
the median value in the first class shows us that most ticket prices were around the average or not lower than it
some passengers on the ship did not pay for the cruise.
2. Let's prepare the data that will show us only the tickets that were paid for.¶
fare | ||||
---|---|---|---|---|
mean | median | max | min | |
pclass | ||||
1.0 | 89.447482 | 61.3792 | 512.3292 | 5.0000 |
2.0 | 21.648108 | 15.0500 | 73.5000 | 9.6875 |
3.0 | 13.370915 | 8.0500 | 69.5500 | 3.1708 |
3. The bar chart presents the prices of paid tickets only¶
We see that the minimum prices for tickets in each class deviated significantly from the maximum ticket prices, or the average.
4. Summary of information about paid tickets¶
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
pclass | ||||||||
1.0 | 316.0 | 89.447482 | 80.259713 | 5.0000 | 31.682275 | 61.3792 | 108.9000 | 512.3292 |
2.0 | 271.0 | 21.648108 | 13.382064 | 9.6875 | 13.000000 | 15.0500 | 26.0000 | 73.5000 |
3.0 | 705.0 | 13.370915 | 11.476600 | 3.1708 | 7.750000 | 8.0500 | 15.2458 | 69.5500 |
In each of the 3 classes, we can see that the ticket prices oscillated around the average, as even 75%
were close to it, and only a few were significantly more expensive.
5. Histogram of ticket prices in classes¶
We can see that each of these histograms has a similar structure and show that the most in each class were sold from the cheaper tickets in the entire class pool.
4. Let's check the information about the ages of passengers from different classes.¶
1. Age histograms¶
Based on each of the 3 histograms, we can observe that in each class, the largest number of passengers were people aged around 20 to 30 years
, while the smallest group consisted of seniors aged 65 to 70 years
.
2. Detailed information regarding age¶
We can notice that:
there were people on board even
up to 80 years
oldin each class, the average age of passengers is similar to the median, meaning that
50%
of passengers were of an age less than or equal to the average in each class.
5. Let's analyze the column with the survivors¶
1. We create a graph of how many men and women survived.¶
sex | sum_survived | |
---|---|---|
0 | female | 339.0 |
1 | male | 161.0 |
The chart shows that more women survived than men, as about 340 women and around 155 men survived.
2. Analysis of the chances of survival of passengers in different classes¶
- Table of overall survival odds in classes
pclass | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|
0 | 1.000000 | 200.000000 | 323 | 61.919505 |
1 | 2.000000 | 119.000000 | 277 | 42.960289 |
2 | 3.000000 | 181.000000 | 709 | 25.528914 |
We see that:
passengers in 1st class had a
62%
chance of survivalpassengers in 3rd class had a
25%
chance of survival
- Survival odds chart in the individual classes
We see that the passengers in 1st class
had the highest percentage chance of survival, while those in 3rd class
had the lowest.
3. Chart of survivors in different classes differentiated by gender¶
- Auxiliary table
pclass | sex | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 1.000000 | female | 139.000000 | 144 | 96.527778 |
1 | 1.000000 | male | 61.000000 | 179 | 34.078212 |
2 | 2.000000 | female | 94.000000 | 106 | 88.679245 |
3 | 2.000000 | male | 25.000000 | 171 | 14.619883 |
4 | 3.000000 | female | 106.000000 | 216 | 49.074074 |
5 | 3.000000 | male | 75.000000 | 493 | 15.212982 |
Based on the table, we see that about 97%
of the first class passengers survived the disaster.
- A chart showing the survival chances of women and men
We can observe that: women had a much greater chance of surviving the disaster, especially from class 1 and 2 where around or over 90% of women
from those classes survived* men, regardless of class, had a very small chance of survival, as it was less than 35%
5. Finding patterns and correlations in data¶
1. Study of the correlation for passengers regarding their age¶
- A table illustrating the survival chances of all passengers, taking age into account.
Age_group | Age_group_number | pclass | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|---|
0 | 0-18 | 1 | 1.000000 | 18.000000 | 21 | 85.714286 |
1 | 0-18 | 1 | 2.000000 | 31.000000 | 43 | 72.093023 |
2 | 0-18 | 1 | 3.000000 | 46.000000 | 132 | 34.848485 |
3 | 18-29 | 2 | 1.000000 | 40.000000 | 57 | 70.175439 |
4 | 18-29 | 2 | 2.000000 | 46.000000 | 114 | 40.350877 |
5 | 18-29 | 2 | 3.000000 | 106.000000 | 426 | 24.882629 |
6 | 30-44 | 3 | 1.000000 | 82.000000 | 132 | 62.121212 |
7 | 30-44 | 3 | 2.000000 | 31.000000 | 85 | 36.470588 |
8 | 30-44 | 3 | 3.000000 | 24.000000 | 123 | 19.512195 |
9 | 45-59 | 4 | 1.000000 | 50.000000 | 87 | 57.471264 |
10 | 45-59 | 4 | 2.000000 | 10.000000 | 27 | 37.037037 |
11 | 45-59 | 4 | 3.000000 | 4.000000 | 22 | 18.181818 |
12 | 60+ | 5 | 1.000000 | 10.000000 | 26 | 38.461538 |
13 | 60+ | 5 | 2.000000 | 1.000000 | 8 | 12.500000 |
14 | 60+ | 5 | 3.000000 | 1.000000 | 6 | 16.666667 |
- A chart illustrating the survival chances of all passengers, taking age into account.
Having a lot of data, we can say that in class 3 a very small number of men survived the disaster, about 1/5
aged 30-44
.
On the chart, we observe that:* passengers aged 0-18 in classes 1 and 2 had high survival chances of 84% and 72% respectively.
In contrast, passengers aged 60+ had the lowest chances of surviving the disaster when compared to other age categories within the same class.
2. Sprawdźmy teraz jak wygląda korelacja gdy pasażerów rozdzielimy na płcie¶
1. Szanse na przeżycie wśród męskich pasażerów¶
- Survival odds table for men
pclass | Age_group | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 1.000000 | 0-18 | 6.000000 | 8 | 75.000000 |
1 | 1.000000 | 18-29 | 11.000000 | 27 | 40.740741 |
2 | 1.000000 | 30-44 | 18.000000 | 47 | 38.297872 |
3 | 1.000000 | 45-59 | 16.000000 | 52 | 30.769231 |
4 | 1.000000 | 60+ | 2.000000 | 17 | 11.764706 |
5 | 2.000000 | 0-18 | 11.000000 | 22 | 50.000000 |
6 | 2.000000 | 18-29 | 6.000000 | 57 | 10.526316 |
7 | 2.000000 | 30-44 | 5.000000 | 56 | 8.928571 |
8 | 2.000000 | 45-59 | 0.000000 | 16 | 0.000000 |
9 | 2.000000 | 60+ | 1.000000 | 7 | 14.285714 |
10 | 3.000000 | 0-18 | 15.000000 | 73 | 20.547945 |
11 | 3.000000 | 18-29 | 30.000000 | 164 | 18.292683 |
12 | 3.000000 | 30-44 | 13.000000 | 93 | 13.978495 |
13 | 3.000000 | 45-59 | 1.000000 | 14 | 7.142857 |
14 | 3.000000 | 60+ | 0.000000 | 5 | 0.000000 |
Based on the table, we need to consider whether 8, 7, and 5
passengers is a sufficient number to draw binding conclusions.
- A chart showing the chance in men
We see that:
men aged 0-18 years had about
75%
chance of surviving the disastermen from class 2 aged 0-18 years survived to a small extent because
about 50%
of them survivedmale passengers from classes 2 and 3, who were older than 18 years, had a survival chance of
21% or less
.
2. Analysis for the female part of the passengers¶
- Table of survival chances for women
pclass | Age_group | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 1.000000 | 0-18 | 12.000000 | 13 | 92.307692 |
1 | 1.000000 | 18-29 | 29.000000 | 30 | 96.666667 |
2 | 1.000000 | 30-44 | 45.000000 | 46 | 97.826087 |
3 | 1.000000 | 45-59 | 34.000000 | 35 | 97.142857 |
4 | 1.000000 | 60+ | 8.000000 | 9 | 88.888889 |
5 | 2.000000 | 0-18 | 20.000000 | 21 | 95.238095 |
6 | 2.000000 | 18-29 | 36.000000 | 41 | 87.804878 |
7 | 2.000000 | 30-44 | 26.000000 | 29 | 89.655172 |
8 | 2.000000 | 45-59 | 10.000000 | 11 | 90.909091 |
9 | 2.000000 | 60+ | 0.000000 | 1 | 0.000000 |
10 | 3.000000 | 0-18 | 31.000000 | 59 | 52.542373 |
11 | 3.000000 | 18-29 | 26.000000 | 54 | 48.148148 |
12 | 3.000000 | 30-44 | 11.000000 | 30 | 36.666667 |
13 | 3.000000 | 45-59 | 3.000000 | 8 | 37.500000 |
14 | 3.000000 | 60+ | 1.000000 | 1 | 100.000000 |
With only one passenger to observe, we should not pay much attention to the fact that in class 3, 100%
of female passengers aged 60+ survived.
- The chart of women's survival chances
Looking at the chart, we can state that:
women in 1st and 2nd class, regardless of age, had a very high chance of survival
over 85%
, not taking into account the age 60+ in class 2, as there was 1 such person who did not survive.Meanwhile, in class 3, women had a
less than 55%
chance of survival.
3. Let's check the correlations between ticket prices and the survival of passengers.¶
1. We check the dependence of ticket prices on the chances of survival¶
- A table of survival chances looking at ticket prices
fare_groups | pclass | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 0-19 | 1.000000 | 1.000000 | 8 | 12.500000 |
1 | 0-19 | 2.000000 | 52.000000 | 152 | 34.210526 |
2 | 0-19 | 3.000000 | 156.000000 | 593 | 26.306914 |
3 | 100-310 | 1.000000 | 56.000000 | 80 | 70.000000 |
4 | 20-39 | 1.000000 | 48.000000 | 94 | 51.063830 |
5 | 20-39 | 2.000000 | 61.000000 | 109 | 55.963303 |
6 | 20-39 | 3.000000 | 19.000000 | 89 | 21.348315 |
7 | 40-79 | 1.000000 | 64.000000 | 104 | 61.538462 |
8 | 40-79 | 2.000000 | 6.000000 | 16 | 37.500000 |
9 | 40-79 | 3.000000 | 6.000000 | 27 | 22.222222 |
10 | 80-99 | 1.000000 | 27.000000 | 33 | 81.818182 |
11 | >310 | 1.000000 | 4.000000 | 4 | 100.000000 |
We have a small comparative sample to declare that if a passenger paid more than 310 they had a huge chance of survival.
- Chart of chcnces to ticket prices
From the charts we can conclude that:
- among first-class passengers who bought tickets for 80-99,
81%
survived* passengers from first class who bought cheaper tickets in the price range of 0-40 had lower survival chances becauseless than 50%
compared to others in that class
4. Let's analyze whether the size of the passenger's family. Did the number of siblings or spouse affect the chances of survival?¶
1. Survival chances table considering the number of spouses/siblings¶
2. The survival chance chart considering the number of spouses/siblings¶
sibsp | pclass | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 0.000000 | 1.000000 | 111.000000 | 198 | 56.060606 |
1 | 0.000000 | 2.000000 | 69.000000 | 182 | 37.912088 |
2 | 0.000000 | 3.000000 | 129.000000 | 511 | 25.244618 |
3 | 1.000000 | 1.000000 | 79.000000 | 113 | 69.911504 |
4 | 1.000000 | 2.000000 | 43.000000 | 82 | 52.439024 |
5 | 1.000000 | 3.000000 | 41.000000 | 124 | 33.064516 |
6 | 2.000000 | 1.000000 | 7.000000 | 8 | 87.500000 |
7 | 2.000000 | 2.000000 | 6.000000 | 12 | 50.000000 |
8 | 2.000000 | 3.000000 | 6.000000 | 22 | 27.272727 |
9 | 3.000000 | 1.000000 | 3.000000 | 4 | 75.000000 |
10 | 3.000000 | 2.000000 | 1.000000 | 1 | 100.000000 |
11 | 3.000000 | 3.000000 | 2.000000 | 15 | 13.333333 |
12 | 4.000000 | 3.000000 | 3.000000 | 22 | 13.636364 |
13 | 5.000000 | 3.000000 | 0.000000 | 6 | 0.000000 |
14 | 8.000000 | 3.000000 | 0.000000 | 9 | 0.000000 |
Looking at the table, we cannot focus too much on the fact that the survival chances for a passenger who had 3 spouses/siblings on board in class 2 was 100%
, since we had only 1 such passenger to observe.
We can notice here that among the first-class passengers who had 2 parents/spouses on board, 88%
of them survived.
5. The impact of the number of parents/children on the survival chances of a passenger¶
1. Survival chance table considering the number of parents/children¶
parch | pclass | survived_sum | survived_count | percentage_survived | |
---|---|---|---|---|---|
0 | 0.000000 | 1.000000 | 141.000000 | 242 | 58.264463 |
1 | 0.000000 | 2.000000 | 64.000000 | 206 | 31.067961 |
2 | 0.000000 | 3.000000 | 131.000000 | 554 | 23.646209 |
3 | 1.000000 | 1.000000 | 36.000000 | 50 | 72.000000 |
4 | 1.000000 | 2.000000 | 31.000000 | 43 | 72.093023 |
5 | 1.000000 | 3.000000 | 33.000000 | 77 | 42.857143 |
6 | 2.000000 | 1.000000 | 21.000000 | 27 | 77.777778 |
7 | 2.000000 | 2.000000 | 21.000000 | 25 | 84.000000 |
8 | 2.000000 | 3.000000 | 15.000000 | 61 | 24.590164 |
9 | 3.000000 | 1.000000 | 1.000000 | 2 | 50.000000 |
10 | 3.000000 | 2.000000 | 3.000000 | 3 | 100.000000 |
11 | 3.000000 | 3.000000 | 1.000000 | 3 | 33.333333 |
12 | 4.000000 | 1.000000 | 1.000000 | 2 | 50.000000 |
13 | 4.000000 | 3.000000 | 0.000000 | 4 | 0.000000 |
14 | 5.000000 | 3.000000 | 1.000000 | 6 | 16.666667 |
15 | 6.000000 | 3.000000 | 0.000000 | 2 | 0.000000 |
16 | 9.000000 | 3.000000 | 0.000000 | 2 | 0.000000 |
We should not take the percentage survival rates literally for passengers who had between 3 to 9 parents/children on board, as there were relatively few such passengers.
2. The survival rate chart considering the number of parents/children¶
We can notice here that among the second-class passengers who had 2 children/parents on board, 84%
of them survived.
6. We are looking for outlier values¶
1. Box plot for all passengers¶
We see that:
there are many outlier values in the age column, which means that there were also elderly people on the ship,
there are outlier values in the sibsp and parch columns, which means that there were large families on the ship,
we see many outlier values for the fare column, which means that many tickets for the ship were significantly more expensive than normal tickets.
2. Box plot for third class passengers¶
We can notice that the outliers for third class passengers are similar to the outliers for all passengers, but we can observe that in this class there was no ticket that deviated significantly in price from the others.
3. Box plot for Class 2 passengers¶
In class 2, we observe that there were very small children on board, and there were not many prices deviating from the normal ticket prices.
4. Box plot for first class passengers¶
Some ticket prices in first class were significantly inflated compared to other ticket prices in that class.
7. Summary¶
In summary, we found in the data that:
there were more men than women on board the ship -
500 people
survived the disasterthe most people survived from 1st class, while the least from 3rd
more women survived than men
women in 1st and 2nd class had very high chances of survival, around
90%
, regardless of age
in 3rd class, the chances of survival for women, considering age, were significantly lower, ranging from
55%
chance to33%
men considering all classes had very low chances of survival, below
35%
men from 1st class aged
0-18
survived at a rate of 75%, but there were only 8 men of that age in 1st classa fairly large proportion of passengers survived,
around 75%
from 1st and 2nd class, who had 1 or 2 family members on board.