Python AI & ML - Day 17: Working with CSV and Excel Files
Goal: Master reading, writing, and manipulating CSV and Excel files using Pandas. Learn to handle common challenges like large datasets, encoding issues, multi-sheet Excel files, and data formatting for robust AI/ML workflows.
1. CSV Files
CSV (Comma-Separated Values) is a plain-text format for tabular data. Each line represents a row, with values separated by a delimiter (commas, tabs, etc.).
Reading CSV Files
Use pd.read_csv():
import pandas as pd
# Basic read
df = pd.read_csv("data.csv")
# Custom parameters
df = pd.read_csv(
"data.csv",
sep=";", # Delimiter (default: ",")
header=0, # Row number for headers (default: 0)
encoding="utf-8", # Handle special characters
dtype={"Age": int},# Specify column data types
parse_dates=["Date"], # Convert columns to datetime
na_values=["?", "NA"], # Treat specific strings as NaN
skiprows=2, # Skip rows at the start
usecols=["Name", "Age"] # Read specific columns
) Writing CSV Files
Use df.to_csv():
df.to_csv(
"output.csv",
index=False, # Exclude row indices
sep="\t", # Use tabs as delimiters
encoding="latin1",
na_rep="N/A" # Replace NaN with "N/A"
) 2. Excel Files
Excel (.xlsx/.xls) files support multiple sheets, formulas, and formatting.
Reading Excel Files
Use pd.read_excel():
# Read first sheet
df = pd.read_excel("data.xlsx")
# Read specific sheet by name or index
df = pd.read_excel("data.xlsx", sheet_name="Sheet2")
df = pd.read_excel("data.xlsx", sheet_name=1) # Second sheet
# Handle headers and data
df = pd.read_excel(
"data.xlsx",
skiprows=3, # Skip the first 3 rows
usecols="A:C", # Read columns A to C
engine="openpyxl" # Required for .xlsx files
) Writing Excel Files
Use df.to_excel() and pd.ExcelWriter for multi-sheet files:
# Single sheet
df.to_excel("output.xlsx", sheet_name="Data", index=False)
# Multiple sheets
with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer:
df1.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2") 3. Handling Large Files
Chunking (CSV)
Read large files in manageable chunks:
chunk_size = 10_000
chunks = pd.read_csv("large_data.csv", chunksize=chunk_size)
for chunk in chunks:
process(chunk) # Process each chunk Optimizing Excel
Use
openpyxlfor.xlsx(modern) andxlrdfor.xls(legacy).Avoid formulas or formatting for faster read/write.
4. Common Challenges & Solutions
1. Encoding Issues
Fix: Specify
encoding(e.g.,"utf-8","latin1").
df = pd.read_csv("data.csv", encoding="latin1") 2. Missing Headers
Fix: Assign headers manually.
df = pd.read_csv("data.csv", header=None)
df.columns = ["Name", "Age", "City"] 3. Merged Cells (Excel)
Fix: Pandas unmerges cells, filling duplicates. Clean up post-read:
df["Category"].fillna(method="ffill", inplace=True) 4. Date Parsing
Fix: Use
parse_datesor convert manually.
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y") 5. Advanced Techniques
1. Dynamic File Handling
Loop through multiple files:
import glob
all_files = glob.glob("data/*.csv")
dfs = [pd.read_csv(f) for f in all_files]
combined_df = pd.concat(dfs) 2. Data Validation
Check data integrity after reading:
assert df["Age"].between(0, 100).all(), "Invalid age values!" 3. Excel Formatting
Use openpyxl or xlsxwriter for styling:
with pd.ExcelWriter("styled.xlsx", engine="xlsxwriter") as writer:
df.to_excel(writer, sheet_name="Sheet1")
workbook = writer.book
worksheet = writer.sheets["Sheet1"]
format = workbook.add_format({"num_format": "$#,##0.00"})
worksheet.set_column("B:B", 15, format) # Format column B 6. Real-World Example
Task: Process sales data from CSV and generate an Excel report.
# Read CSV
sales = pd.read_csv("sales.csv", parse_dates=["Date"])
# Clean data
sales.dropna(subset=["CustomerID"], inplace=True)
sales["Total"] = sales["Quantity"] * sales["Price"]
# Write to Excel with summary
with pd.ExcelWriter("sales_report.xlsx", engine="openpyxl") as writer:
sales.to_excel(writer, sheet_name="Raw Data", index=False)
summary = sales.groupby("Region")["Total"].sum().reset_index()
summary.to_excel(writer, sheet_name="Summary", index=False) 7. Best Practices
Validate Data: Check for missing values, duplicates, and outliers after reading.
Specify Data Types: Use
dtypeto avoid incorrect type inference.Use Context Managers: Ensure files are properly closed with
withstatements.Document Workflow: Track file sources, cleaning steps, and transformations.
8. Practice Exercise
Download this sales data CSV.
Read the CSV, filtering rows where
Quantity> 10.Calculate a
Totalcolumn (Quantity * Price).Save the filtered data to an Excel file with two sheets:
Raw Data: Original filtered data.
Summary: Total sales per product.
Solution:
# Read and filter
sales = pd.read_csv("sales.csv")
filtered = sales[sales["Quantity"] > 10]
filtered["Total"] = filtered["Quantity"] * filtered["Price"]
# Write to Excel
with pd.ExcelWriter("filtered_sales.xlsx", engine="openpyxl") as writer:
filtered.to_excel(writer, sheet_name="Raw Data", index=False)
summary = filtered.groupby("Product")["Total"].sum().reset_index()
summary.to_excel(writer, sheet_name="Summary", index=False) Key Takeaways
Use
pd.read_csv()/pd.read_excel()anddf.to_csv()/df.to_excel()for I/O operations.Handle large files with chunking and optimize memory with
dtype.Validate data and use advanced libraries for Excel formatting.


