Python for AI & ML - Day 18: Data Aggregation and Grouping

Feb 25, 2025

Data aggregation and grouping are fundamental techniques in data analysis, enabling you to summarize and derive insights from datasets. This lesson focuses on using Pandas, a powerful Python library, to group data and apply aggregation functions like sum, mean, and count. Let’s break down the concepts and examples step by step.

1. What is Data Aggregation and Grouping?

Grouping: Splitting data into subsets (groups) based on one or more columns (e.g., grouping sales data by region).
Aggregation: Applying a summary function (e.g., sum, mean) to each group to condense the data into meaningful statistics.
Use Cases:
- Calculate total sales per region.
- Find average student scores per class.
- Analyze monthly website traffic.

2. The Split-Apply-Combine Paradigm

Split: Divide the dataset into groups based on a key (e.g., Region).
Apply: Compute a summary statistic (e.g., sum, mean) for each group.
Combine: Merge the results into a new DataFrame.

3. Grouping Data with `groupby`

The groupby() method in Pandas creates a GroupBy object, which can then be aggregated.

Syntax:

grouped = df.groupby("column_name")

Example:

import pandas as pd

# Sample DataFrame
data = {
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 50, 300, 75],
    'Quantity': [10, 15, 20, 5, 30, 7]
}
df = pd.DataFrame(data)

# Group by 'Region'
grouped = df.groupby("Region")

# Calculate total sales per region
total_sales = grouped["Sales"].sum()
print(total_sales)

Output:

Region
East    600
West    275
Name: Sales, dtype: int64

4. Aggregation Functions

Use agg() to apply multiple or custom aggregation functions.

Common Aggregations:

sum(), mean(), count(), min(), max(), std().

Example 1: Multiple Aggregations:

# Group by 'Region' and aggregate Sales and Quantity
result = df.groupby("Region").agg({
    "Sales": ["sum", "mean"],  # Total and average sales
    "Quantity": "count"        # Number of transactions
})
print(result)

Output:

       Sales         Quantity
        sum   mean     count
Region                      
East    600  200.0         3
West    275   91.666667    3

Example 2: Custom Aggregation:

# Define a custom function (e.g., sales range)
def sales_range(x):
    return x.max() - x.min()

# Apply custom aggregation
grouped_custom = df.groupby("Region")["Sales"].agg(sales_range)
print(grouped_custom)

Output:

Region
East    200
West    100
Name: Sales, dtype: int64

5. Grouping by Multiple Columns

Group by multiple keys to analyze interactions between variables.

Example:

# Group by 'Region' and 'Product'
multi_group = df.groupby(["Region", "Product"]).agg({
    "Sales": "sum",
    "Quantity": "mean"
})

print(multi_group)

Output:

                Sales  Quantity
Region Product                
East   A         600      20.0
West   B         275       9.0

6. Resetting the Index

Convert grouped indices (e.g., Region) into columns for readability:

multi_group_reset = multi_group.reset_index()
print(multi_group_reset)

Output:

  Region Product  Sales  Quantity
0   East       A    600      20.0
1   West       B    275       9.0

7. Best Practices and Pitfalls

Handling Missing Data: Use dropna() or fillna() before grouping.
Non-Numeric Columns: Aggregation functions like mean automatically exclude non-numeric columns.
Named Aggregations (Pandas ≥ 0.25):

grouped_named = df.groupby("Region").agg(
    total_sales=("Sales", "sum"),
    avg_quantity=("Quantity", "mean")
)
print(grouped_named)

Output:

      total_sales  avg_quantity
Region                        
East          600          20.0
West          275           9.0

8. Real-World Applications

Business: Analyze sales performance by region/product.
Healthcare: Compare average patient recovery times by treatment.
Education: Calculate average test scores by school district.

Summary

Grouping: Split data into subsets using groupby().
Aggregation: Summarize groups with functions like sum(), mean(), or custom logic.
Multi-Column Grouping: Use lists in groupby() (e.g., ["Region", "Product"]).
Output Control: Use reset_index() or agg() with named aggregations for clarity.

Practice with real datasets (e.g., Titanic) to master these techniques!

Discussion about this post

Ready for more?

Python for AI & ML - Day 18: Data Aggregation and Grouping

1. What is Data Aggregation and Grouping?

2. The Split-Apply-Combine Paradigm

3. Grouping Data with groupby

Syntax:

Example:

4. Aggregation Functions

Common Aggregations:

Example 1: Multiple Aggregations:

Example 2: Custom Aggregation:

5. Grouping by Multiple Columns

Example:

6. Resetting the Index

7. Best Practices and Pitfalls

8. Real-World Applications

Summary

Discussion about this post

Ready for more?

3. Grouping Data with `groupby`