Python for AI & ML-Day 19: Mini-project: Analyze a Dataset (e.g., Titanic Dataset)

Feb 26, 2025

This mini-project focuses on applying data manipulation, aggregation, and visualization skills to analyze the Titanic dataset. You’ll use Pandas, Matplotlib, and Seaborn to uncover patterns in passenger survival, demographics, and other factors. Let’s dive in!

1. Project Objectives

Load and explore a real-world dataset.
Clean and preprocess data.
Use grouping and aggregation (Day 18 skills) to analyze survival rates.
Visualize insights to answer questions like:
- Did survival rates differ by gender, class, or age?
- Which factors most influenced survival?

2. Dataset Overview

The Titanic dataset contains information about 891 passengers, including:

Survived: 0 (No), 1 (Yes)
Pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
Sex: Male or Female
Age: Passenger age
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Fare: Ticket price
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

3. Step-by-Step Analysis

Step 1: Load the Data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (ensure the CSV file is in your working directory)
titanic = pd.read_csv("titanic.csv")

# Display the first 5 rows
print(titanic.head())

Step 2: Data Exploration

Check for missing values:

print(titanic.isnull().sum())

- Age: 177 missing values.
- Cabin: 687 missing values (mostly missing; we may drop this column).
- Embarked: 2 missing values.
Basic statistics:
print(titanic.describe())

Step 3: Data Cleaning

Drop irrelevant columns:

titanic = titanic.drop(columns=["Cabin", "PassengerId", "Name", "Ticket"])

Handle missing values:
- Fill missing Age values with the median age:

titanic["Age"].fillna(titanic["Age"].median(), inplace=True)

Fill missing Embarked values with the mode (most frequent port):

titanic["Embarked"].fillna(titanic["Embarked"].mode()[0], inplace=True)

Step 4: Feature Engineering

Create a FamilySize column:

titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

Categorize age groups:

titanic["AgeGroup"] = pd.cut(titanic["Age"], bins=[0, 18, 30, 50, 100], labels=["Child", "Young Adult", "Adult", "Senior"])

Step 5: Data Aggregation and Grouping

Use groupby() to analyze survival rates by categories:

Survival Rate by Class:

survival_by_class = titanic.groupby("Pclass")["Survived"].mean() * 100
print(survival_by_class)

Output:

Pclass
1    62.96
2    47.28
3    24.24
Name: Survived, dtype: float64

Insight: 1st-class passengers had the highest survival rate (63%).

Survival Rate by Gender:

survival_by_gender = titanic.groupby("Sex")["Survived"].mean() * 100
print(survival_by_gender)

Output:

Sex
female    74.20
male      18.89
Name: Survived, dtype: float64

Insight: 74% of females survived vs. 19% of males.

Survival Rate by Age Group:

survival_by_age = titanic.groupby("AgeGroup")["Survived"].mean() * 100
print(survival_by_age)

Output:

AgeGroup
Child          53.98
Young Adult    36.92
Adult          43.55
Senior         34.62
Name: Survived, dtype: float64

Insight: Children had the highest survival rate (54%).

Step 6: Data Visualization

Visualize insights using Seaborn and Matplotlib.

Survival Rate by Class and Gender:

sns.barplot(x="Pclass", y="Survived", hue="Sex", data=titanic)
plt.title("Survival Rate by Class and Gender")
plt.ylabel("Survival Rate (%)")
plt.show()

Observation: Females in 1st/2nd class had near 100% survival rates.

Age Distribution of Survivors vs. Non-Survivors:

sns.histplot(data=titanic, x="Age", hue="Survived", kde=True, bins=20)
plt.title("Age Distribution of Survivors vs. Non-Survivors")
plt.show()

Observation: Children under 10 had higher survival rates.

Survival Rate by Family Size:

sns.barplot(x="FamilySize", y="Survived", data=titanic)
plt.title("Survival Rate by Family Size")
plt.show()

Observation: Passengers with 1-3 family members had better survival odds.

4. Key Findings

Class and Gender Bias:
- 1st-class passengers and females were prioritized during evacuation.
Age Mattered:
- Children had higher survival rates due to the "women and children first" protocol.
Family Size:
- Moderate family sizes (1-3) correlated with better survival odds.

5. Further Analysis Ideas

Explore survival rates by Embarked port.
Investigate interactions between Fare and survival.
Use a heatmap to visualize correlations between variables:

plt.figure(figsize=(10, 6))
sns.heatmap(titanic.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

6. Summary

Data Cleaning: Handled missing values and irrelevant columns.
Aggregation: Used groupby() to analyze survival rates by class, gender, and age.
Visualization: Created bar plots, histograms, and heatmaps to highlight patterns.
Insights: Class, gender, and age were critical factors in survival.

Real-World Applications

Business: Analyze customer demographics to improve services.
Healthcare: Study patient outcomes based on treatment and demographics.
Policy-Making: Use data to prioritize resources during emergencies.

Practice: Try this analysis on other datasets (e.g., Iris, Housing Prices)!

Python for AI & ML-Day 19: Mini-project: Analyze a Dataset (e.g., Titanic Dataset)

1. Project Objectives

2. Dataset Overview

3. Step-by-Step Analysis

Step 1: Load the Data

Step 2: Data Exploration

Step 3: Data Cleaning

Step 4: Feature Engineering

Step 5: Data Aggregation and Grouping

Survival Rate by Class:

Survival Rate by Gender:

Survival Rate by Age Group:

Step 6: Data Visualization

Survival Rate by Class and Gender:

Age Distribution of Survivors vs. Non-Survivors:

Survival Rate by Family Size:

4. Key Findings

5. Further Analysis Ideas

6. Summary

Real-World Applications

Discussion about this post