Python for AI & ML-Day 19: Mini-project: Analyze a Dataset (e.g., Titanic Dataset)
This mini-project focuses on applying data manipulation, aggregation, and visualization skills to analyze the Titanic dataset. You’ll use Pandas, Matplotlib, and Seaborn to uncover patterns in passenger survival, demographics, and other factors. Let’s dive in!
1. Project Objectives
Load and explore a real-world dataset.
Clean and preprocess data.
Use grouping and aggregation (Day 18 skills) to analyze survival rates.
Visualize insights to answer questions like:
Did survival rates differ by gender, class, or age?
Which factors most influenced survival?
2. Dataset Overview
The Titanic dataset contains information about 891 passengers, including:
Survived: 0 (No), 1 (Yes)
Pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
Sex: Male or Female
Age: Passenger age
SibSp: Number of siblings/spouses aboard
Parch: Number of parents/children aboard
Fare: Ticket price
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
3. Step-by-Step Analysis
Step 1: Load the Data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset (ensure the CSV file is in your working directory)
titanic = pd.read_csv("titanic.csv")
# Display the first 5 rows
print(titanic.head())Step 2: Data Exploration
Check for missing values:
print(titanic.isnull().sum())Age: 177 missing values.
Cabin: 687 missing values (mostly missing; we may drop this column).
Embarked: 2 missing values.
Basic statistics:
print(titanic.describe())
Step 3: Data Cleaning
Drop irrelevant columns:
titanic = titanic.drop(columns=["Cabin", "PassengerId", "Name", "Ticket"])Handle missing values:
Fill missing Age values with the median age:
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)Fill missing Embarked values with the mode (most frequent port):
titanic["Embarked"].fillna(titanic["Embarked"].mode()[0], inplace=True)
Step 4: Feature Engineering
Create a
FamilySizecolumn:
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]Categorize age groups:
titanic["AgeGroup"] = pd.cut(titanic["Age"], bins=[0, 18, 30, 50, 100], labels=["Child", "Young Adult", "Adult", "Senior"])Step 5: Data Aggregation and Grouping
Use groupby() to analyze survival rates by categories:
Survival Rate by Class:
survival_by_class = titanic.groupby("Pclass")["Survived"].mean() * 100
print(survival_by_class)Output:
Pclass
1 62.96
2 47.28
3 24.24
Name: Survived, dtype: float64Insight: 1st-class passengers had the highest survival rate (63%).
Survival Rate by Gender:
survival_by_gender = titanic.groupby("Sex")["Survived"].mean() * 100
print(survival_by_gender)Output:
Sex
female 74.20
male 18.89
Name: Survived, dtype: float64Insight: 74% of females survived vs. 19% of males.
Survival Rate by Age Group:
survival_by_age = titanic.groupby("AgeGroup")["Survived"].mean() * 100
print(survival_by_age)Output:
AgeGroup
Child 53.98
Young Adult 36.92
Adult 43.55
Senior 34.62
Name: Survived, dtype: float64Insight: Children had the highest survival rate (54%).
Step 6: Data Visualization
Visualize insights using Seaborn and Matplotlib.
Survival Rate by Class and Gender:
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=titanic)
plt.title("Survival Rate by Class and Gender")
plt.ylabel("Survival Rate (%)")
plt.show()Observation: Females in 1st/2nd class had near 100% survival rates.
Age Distribution of Survivors vs. Non-Survivors:
sns.histplot(data=titanic, x="Age", hue="Survived", kde=True, bins=20)
plt.title("Age Distribution of Survivors vs. Non-Survivors")
plt.show()Observation: Children under 10 had higher survival rates.
Survival Rate by Family Size:
sns.barplot(x="FamilySize", y="Survived", data=titanic)
plt.title("Survival Rate by Family Size")
plt.show()Observation: Passengers with 1-3 family members had better survival odds.
4. Key Findings
Class and Gender Bias:
1st-class passengers and females were prioritized during evacuation.
Age Mattered:
Children had higher survival rates due to the "women and children first" protocol.
Family Size:
Moderate family sizes (1-3) correlated with better survival odds.
5. Further Analysis Ideas
Explore survival rates by Embarked port.
Investigate interactions between Fare and survival.
Use a heatmap to visualize correlations between variables:
plt.figure(figsize=(10, 6))
sns.heatmap(titanic.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()6. Summary
Data Cleaning: Handled missing values and irrelevant columns.
Aggregation: Used
groupby()to analyze survival rates by class, gender, and age.Visualization: Created bar plots, histograms, and heatmaps to highlight patterns.
Insights: Class, gender, and age were critical factors in survival.
Real-World Applications
Business: Analyze customer demographics to improve services.
Healthcare: Study patient outcomes based on treatment and demographics.
Policy-Making: Use data to prioritize resources during emergencies.
Practice: Try this analysis on other datasets (e.g., Iris, Housing Prices)!


