Data aggregation and grouping are fundamental techniques in data analysis, enabling you to summarize and derive insights from datasets. This lesson focuses on using Pandas, a powerful Python library, to group data and apply aggregation functions like sum, mean, and count. Let’s break down the concepts and examples step by step.
1. What is Data Aggregation and Grouping?
Grouping: Splitting data into subsets (groups) based on one or more columns (e.g., grouping sales data by region).
Aggregation: Applying a summary function (e.g., sum, mean) to each group to condense the data into meaningful statistics.
Use Cases:
Calculate total sales per region.
Find average student scores per class.
Analyze monthly website traffic.
2. The Split-Apply-Combine Paradigm
Split: Divide the dataset into groups based on a key (e.g.,
Region
).Apply: Compute a summary statistic (e.g.,
sum
,mean
) for each group.Combine: Merge the results into a new DataFrame.
3. Grouping Data with groupby
The groupby()
method in Pandas creates a GroupBy
object, which can then be aggregated.
Syntax:
grouped = df.groupby("column_name")
Example:
import pandas as pd
# Sample DataFrame
data = {
'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [100, 150, 200, 50, 300, 75],
'Quantity': [10, 15, 20, 5, 30, 7]
}
df = pd.DataFrame(data)
# Group by 'Region'
grouped = df.groupby("Region")
# Calculate total sales per region
total_sales = grouped["Sales"].sum()
print(total_sales)
Output:
Region
East 600
West 275
Name: Sales, dtype: int64
4. Aggregation Functions
Use agg()
to apply multiple or custom aggregation functions.
Common Aggregations:
sum()
,mean()
,count()
,min()
,max()
,std()
.
Example 1: Multiple Aggregations:
# Group by 'Region' and aggregate Sales and Quantity
result = df.groupby("Region").agg({
"Sales": ["sum", "mean"], # Total and average sales
"Quantity": "count" # Number of transactions
})
print(result)
Output:
Sales Quantity
sum mean count
Region
East 600 200.0 3
West 275 91.666667 3
Example 2: Custom Aggregation:
# Define a custom function (e.g., sales range)
def sales_range(x):
return x.max() - x.min()
# Apply custom aggregation
grouped_custom = df.groupby("Region")["Sales"].agg(sales_range)
print(grouped_custom)
Output:
Region
East 200
West 100
Name: Sales, dtype: int64
5. Grouping by Multiple Columns
Group by multiple keys to analyze interactions between variables.
Example:
# Group by 'Region' and 'Product'
multi_group = df.groupby(["Region", "Product"]).agg({
"Sales": "sum",
"Quantity": "mean"
})
print(multi_group)
Output:
Sales Quantity
Region Product
East A 600 20.0
West B 275 9.0
6. Resetting the Index
Convert grouped indices (e.g., Region
) into columns for readability:
multi_group_reset = multi_group.reset_index()
print(multi_group_reset)
Output:
Region Product Sales Quantity
0 East A 600 20.0
1 West B 275 9.0
7. Best Practices and Pitfalls
Handling Missing Data: Use
dropna()
orfillna()
before grouping.Non-Numeric Columns: Aggregation functions like
mean
automatically exclude non-numeric columns.Named Aggregations (Pandas ≥ 0.25):
grouped_named = df.groupby("Region").agg(
total_sales=("Sales", "sum"),
avg_quantity=("Quantity", "mean")
)
print(grouped_named)
Output:
total_sales avg_quantity
Region
East 600 20.0
West 275 9.0
8. Real-World Applications
Business: Analyze sales performance by region/product.
Healthcare: Compare average patient recovery times by treatment.
Education: Calculate average test scores by school district.
Summary
Grouping: Split data into subsets using
groupby()
.Aggregation: Summarize groups with functions like
sum()
,mean()
, or custom logic.Multi-Column Grouping: Use lists in
groupby()
(e.g.,["Region", "Product"]
).Output Control: Use
reset_index()
oragg()
with named aggregations for clarity.
Practice with real datasets (e.g., Titanic) to master these techniques!