Python for AI & ML -Day 20: Recap and Q&A Session

Feb 27, 2025

Goal: Reinforce key concepts from Days 11–20 (Data Manipulation, Visualization, and Mini-Projects) and address common questions to solidify foundational skills for Phase 2.

Part 1: Recap of Key Concepts (Days 11–20)

1. Data Manipulation with NumPy & Pandas

Key Topics:

NumPy Arrays:
- Purpose: Efficient numerical operations (e.g., np.zeros(), arr1 + arr2).
- Example:

import numpy as np
arr = np.array([[1, 2], [3, 4]])  # 2D array

Pandas DataFrames:
- Structure: Tabular data with labeled rows/columns.
- Operations:
  - Filtering: df[df["Age"] > 30]
  - Aggregation: df.groupby("Pclass").mean()
  - Merging: pd.merge(df1, df2, on="key")

Why It Matters: Clean, structured data is the backbone of ML. Without proper manipulation, models produce unreliable results.

2. Data Cleaning

Key Techniques:

Handling Missing Data:
- Drop rows: df.dropna()
- Fill gaps: df["Age"].fillna(df["Age"].median())
Outlier Detection:

Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
df = df[(df["Fare"] > Q1 - 1.5*(Q3-Q1)) & (df["Fare"] < Q3 + 1.5*(Q3-Q1))]

Why It Matters: Garbage in, garbage out! Dirty data leads to biased models.

3. Data Visualization

Key Libraries:

Matplotlib: Basic plots (line, bar, scatter).

plt.bar(df["Pclass"], df["Survived"])

Seaborn: Advanced statistical visualizations.

sns.histplot(data=df, x="Age", hue="Survived", kde=True)

Why It Matters: Visuals reveal patterns (e.g., survival bias in the Titanic dataset) that raw numbers hide.

4. Mini-Project: Titanic Dataset Analysis

Key Takeaways:

Aggregation: Survival rates by class/gender using groupby().
Visualization: Bar plots for class/gender trends, histograms for age distributions.
Insights: Women and children in higher classes had survival advantages.

Example Code:

# Survival rate by class and gender
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=titanic)

Part 2: Q&A Session

Common Questions & Answers

Q1: When should I use NumPy vs. Pandas?
A:

NumPy: Math-heavy tasks (e.g., matrix operations, linear algebra).
Pandas: Tabular data manipulation (filtering, grouping, merging).

Q2: How do I handle large datasets that crash my notebook?
A:

Use dtype to reduce memory (e.g., df["Age"] = df["Age"].astype("float32")).
Process data in chunks: pd.read_csv("data.csv", chunksize=1000).

Q3: Why did my Seaborn plot not display labels correctly?
A: Add plt.xlabel(), plt.ylabel(), or plt.title() after plotting. Example:

sns.histplot(...)
plt.title("Age Distribution")
plt.show()

Q4: How do I export cleaned data for ML models?
A: Save to CSV: df.to_csv("cleaned_data.csv", index=False).

Q5: What’s the best way to visualize correlations?
A: Use a heatmap:

sns.heatmap(df.corr(), annot=True)

Part 3: Common Pitfalls & Fixes

Overplotting:
- Issue: Cluttered visuals (e.g., too many categories in a bar plot).
- Fix: Use sns.FacetGrid or limit categories.
Misleading Aggregations:
- Issue: mean() skewed by outliers.
- Fix: Use median() or remove outliers first.
Ignoring Data Types:
- Issue: Treating categorical data (e.g., "Embarked") as numerical.
- Fix: Convert to categorical: df["Embarked"] = df["Embarked"].astype("category").

Part 4: Preparing for Phase 2 (Core ML Concepts)

Next Steps:

Supervised Learning: Decision trees, SVMs, KNN (Days 31–40).
Unsupervised Learning: Clustering, PCA (Days 41–50).

Action Items:

Master Pandas operations (e.g., merge, pivot_table).
Practice visualizing relationships between variables (e.g., scatter plots for correlation).
Revisit the Titanic project and try predicting survival with LogisticRegression.

Final Tips

Documentation is Your Friend: Bookmark Pandas docs and Seaborn examples.
Learn Shortcuts: Use df.describe() or df.info() for quick data summaries.
Stay Curious: Ask why patterns exist (e.g., Why did 1st-class passengers survive more?).

Quote of the Day:
“Without data, you’re just another person with an opinion.”
— Phase 1 equipped you with data skills. Phase 2 will turn you into a storyteller with models. 🚀