Python for AI & ML -Day 20: Recap and Q&A Session
Goal: Reinforce key concepts from Days 11–20 (Data Manipulation, Visualization, and Mini-Projects) and address common questions to solidify foundational skills for Phase 2.
Part 1: Recap of Key Concepts (Days 11–20)
1. Data Manipulation with NumPy & Pandas
Key Topics:
NumPy Arrays:
Purpose: Efficient numerical operations (e.g.,
np.zeros(),arr1 + arr2).Example:
import numpy as np
arr = np.array([[1, 2], [3, 4]]) # 2D arrayPandas DataFrames:
Structure: Tabular data with labeled rows/columns.
Operations:
Filtering:
df[df["Age"] > 30]Aggregation:
df.groupby("Pclass").mean()Merging:
pd.merge(df1, df2, on="key")
Why It Matters: Clean, structured data is the backbone of ML. Without proper manipulation, models produce unreliable results.
2. Data Cleaning
Key Techniques:
Handling Missing Data:
Drop rows:
df.dropna()Fill gaps:
df["Age"].fillna(df["Age"].median())
Outlier Detection:
Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
df = df[(df["Fare"] > Q1 - 1.5*(Q3-Q1)) & (df["Fare"] < Q3 + 1.5*(Q3-Q1))] Why It Matters: Garbage in, garbage out! Dirty data leads to biased models.
3. Data Visualization
Key Libraries:
Matplotlib: Basic plots (line, bar, scatter).
plt.bar(df["Pclass"], df["Survived"]) Seaborn: Advanced statistical visualizations.
sns.histplot(data=df, x="Age", hue="Survived", kde=True) Why It Matters: Visuals reveal patterns (e.g., survival bias in the Titanic dataset) that raw numbers hide.
4. Mini-Project: Titanic Dataset Analysis
Key Takeaways:
Aggregation: Survival rates by class/gender using
groupby().Visualization: Bar plots for class/gender trends, histograms for age distributions.
Insights: Women and children in higher classes had survival advantages.
Example Code:
# Survival rate by class and gender
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=titanic)Part 2: Q&A Session
Common Questions & Answers
Q1: When should I use NumPy vs. Pandas?
A:
NumPy: Math-heavy tasks (e.g., matrix operations, linear algebra).
Pandas: Tabular data manipulation (filtering, grouping, merging).
Q2: How do I handle large datasets that crash my notebook?
A:
Use
dtypeto reduce memory (e.g.,df["Age"] = df["Age"].astype("float32")).Process data in chunks:
pd.read_csv("data.csv", chunksize=1000).
Q3: Why did my Seaborn plot not display labels correctly?
A: Add plt.xlabel(), plt.ylabel(), or plt.title() after plotting. Example:
sns.histplot(...)
plt.title("Age Distribution")
plt.show()Q4: How do I export cleaned data for ML models?
A: Save to CSV: df.to_csv("cleaned_data.csv", index=False).
Q5: What’s the best way to visualize correlations?
A: Use a heatmap:
sns.heatmap(df.corr(), annot=True)Part 3: Common Pitfalls & Fixes
Overplotting:
Issue: Cluttered visuals (e.g., too many categories in a bar plot).
Fix: Use
sns.FacetGridor limit categories.
Misleading Aggregations:
Issue:
mean()skewed by outliers.Fix: Use
median()or remove outliers first.
Ignoring Data Types:
Issue: Treating categorical data (e.g., "Embarked") as numerical.
Fix: Convert to categorical:
df["Embarked"] = df["Embarked"].astype("category").
Part 4: Preparing for Phase 2 (Core ML Concepts)
Next Steps:
Supervised Learning: Decision trees, SVMs, KNN (Days 31–40).
Unsupervised Learning: Clustering, PCA (Days 41–50).
Action Items:
Master Pandas operations (e.g.,
merge,pivot_table).Practice visualizing relationships between variables (e.g., scatter plots for correlation).
Revisit the Titanic project and try predicting survival with
LogisticRegression.
Final Tips
Documentation is Your Friend: Bookmark Pandas docs and Seaborn examples.
Learn Shortcuts: Use
df.describe()ordf.info()for quick data summaries.Stay Curious: Ask why patterns exist (e.g., Why did 1st-class passengers survive more?).
Quote of the Day:
“Without data, you’re just another person with an opinion.”
— Phase 1 equipped you with data skills. Phase 2 will turn you into a storyteller with models. 🚀


