Goal: Master data cleaning techniques using Pandas to transform raw, messy datasets into analysis-ready structured data. Learn to handle missing values, duplicates, outliers, and inconsistencies for robust AI/ML pipelines.
1. What is Data Cleaning?
Data cleaning (or data preprocessing) is the process of fixing or removing:
Missing values
Duplicate records
Inconsistent formatting (e.g., dates, categorical labels)
Outliers
Irrelevant columns
Why It Matters in AI/ML:
Garbage in = Garbage out. Clean data is critical for model accuracy.
Most real-world datasets are messy (e.g., 60% of data science time is spent cleaning data).
2. Common Data Issues & Solutions
1. Missing Values
Causes: Sensor errors, manual entry omissions, extraction failures.
Solutions:
Drop rows/columns:
df.dropna(axis=0, how="any") # Drop rows with any missing values
df.dropna(axis=1, how="all") # Drop columns where all values are missing
Impute values:
# Fill with mean/median/mode
df["Age"].fillna(df["Age"].median(), inplace=True)
# Forward/backward fill (time-series data)
df["Revenue"].fillna(method="ffill", inplace=True)
2. Duplicates
Causes: Data entry errors, merging datasets.
Solution:
df.drop_duplicates(subset=["Email"], keep="first", inplace=True)
3. Inconsistent Data
Examples: Mixed date formats (YYYY-MM-DD
vs DD/MM/YYYY
), typos in categories.
Solutions:
Standardize formats:
# Convert to datetime
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# Fix categorical typos
df["Category"] = df["Category"].str.lower().replace("electrnic", "electronic")
4. Outliers
Causes: Sensor errors, rare events.
Detection:
Z-Score: Values beyond ±3 standard deviations.
IQR Method: Values outside Q1−1.5×IQRQ1−1.5×IQR or Q3+1.5×IQRQ3+1.5×IQR.
Handling:
# Calculate IQR
Q1 = df["Income"].quantile(0.25)
Q3 = df["Income"].quantile(0.75)
IQR = Q3 - Q1
# Filter outliers
df_clean = df[(df["Income"] >= Q1 - 1.5*IQR) & (df["Income"] <= Q3 + 1.5*IQR)]
5. Irrelevant Data
Examples: Unique IDs, free-text columns.
Solution:
df.drop(["User_ID", "Comments"], axis=1, inplace=True)
3. Advanced Cleaning Techniques
1. Text Data Cleaning
Remove whitespace:
df["Name"] = df["Name"].str.strip()
Extract patterns (regex):
df["Phone"] = df["Contact"].str.extract(r"(\d{3}-\d{3}-\d{4})")
2. Categorical Data Encoding
Convert text categories to numbers for ML models:
# One-Hot Encoding
df = pd.get_dummies(df, columns=["City"], prefix="City")
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Category"] = le.fit_transform(df["Category"])
3. Scaling/Normalization
Ensure features are on similar scales:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["Income", "Age"]] = scaler.fit_transform(df[["Income", "Age"]])
4. Real-World Example: Titanic Dataset
Steps:
Load Data:
df = pd.read_csv("titanic.csv")
Handle Missing Values:
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)
Drop Irrelevant Columns:
df.drop(["PassengerId", "Cabin", "Ticket"], axis=1, inplace=True)
Fix Inconsistencies:
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})
5. Best Practices
Document Changes: Track cleaning steps in Jupyter notebooks.
Validate Data: Use assertions to check assumptions:
assert df["Age"].isnull().sum() == 0, "Missing values in Age!"
Automate Repetitive Tasks: Create reusable functions for common cleaning workflows.
6. Practice Exercise
Load the House Prices dataset (
pd.read_csv("house_prices.csv")
).Drop columns with >30% missing values.
Fill missing values in numeric columns with the median.
Convert categorical columns to lowercase and remove duplicates.
Remove outliers in "Price" using the IQR method.
Solution:
# 1. Load data
df = pd.read_csv("house_prices.csv")
# 2. Drop columns with >30% missing
threshold = len(df) * 0.3
df.dropna(thresh=threshold, axis=1, inplace=True)
# 3. Fill numeric missing values
numeric_cols = df.select_dtypes(include=np.number).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# 4. Clean categorical data
categorical_cols = df.select_dtypes(include="object").columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.str.lower())
df.drop_duplicates(inplace=True)
# 5. Remove price outliers
Q1 = df["Price"].quantile(0.25)
Q3 = df["Price"].quantile(0.25)
IQR = Q3 - Q1
df = df[(df["Price"] >= Q1 - 1.5*IQR) & (df["Price"] <= Q3 + 1.5*IQR)]
Key Takeaways
Data Cleaning is the foundation of reliable AI/ML models.
Use Pandas’
fillna()
,dropna()
,drop_duplicates()
, andastype()
for basic cleaning.Handle outliers with domain knowledge or statistical methods like IQR.