35.4 Grouping and Aggregating: groupby, accumulate
How groupby Works: The Sorting Requirement
The groupby function groups consecutive elements from an iterable that share a common key. It is crucial to understand that groupby only forms a new group when the key value changes. This means it does not retrospectively group identical items scattered throughout the iterable; it only works on sequential duplicates. For this reason, the input iterable must be sorted on the same key function that you plan to use for grouping. If the data is not sorted, items with the same key will be split into separate groups, leading to incorrect results.
The function returns an iterator that yields tuples. Each tuple contains two elements:
- The key for the group (the result of the key function).
- An iterator that yields all the items in that group.
You must consume the group iterator before moving to the next group. Since the group iterator shares the underlying iterator with groupby, advancing the main groupby iterator will invalidate the previous group’s iterator.
from itertools import groupby
# Example 1: Incorrect usage without sorting
data = ['apple', 'avocado', 'banana', 'blueberry', 'cherry', 'apricot']
for key, group in groupby(data, key=lambda x: x[0]):
print(f"Key: {key}, Group: {list(group)}")
# Output:
# Key: a, Group: ['apple', 'avocado']
# Key: b, Group: ['banana', 'blueberry']
# Key: c, Group: ['cherry']
# Key: a, Group: ['apricot'] # 'a' appears again in a separate group!
# Example 2: Correct usage with sorting
sorted_data = sorted(data, key=lambda x: x[0]) # Sort by the first letter
for key, group in groupby(sorted_data, key=lambda x: x[0]):
print(f"Key: {key}, Group: {list(group)}")
# Output:
# Key: a, Group: ['apple', 'avocado', 'apricot'] # All 'a' items are together
# Key: b, Group: ['banana', 'blueberry']
# Key: c, Group: ['cherry']
Practical Applications of groupby
groupby is exceptionally powerful for processing and aggregating data that has natural runs or is already sorted. A common use case is processing data from a database query that has been ordered by a specific field.
# Simulating data from a database sorted by department
employees = [
{'name': 'Alice', 'dept': 'Engineering'},
{'name': 'Bob', 'dept': 'Engineering'},
{'name': 'Charlie', 'dept': 'Marketing'},
{'name': 'Diana', 'dept': 'Marketing'},
{'name': 'Evan', 'dept': 'Marketing'},
{'name': 'Faith', 'dept': 'Sales'}
]
# Group by department and perform an aggregation (count employees per dept)
for dept, emp_iter in groupby(employees, key=lambda x: x['dept']):
emp_list = list(emp_iter)
print(f"Department: {dept}, Count: {len(emp_list)}")
# You could also calculate averages, sums, etc., here.
# Output:
# Department: Engineering, Count: 2
# Department: Marketing, Count: 3
# Department: Sales, Count: 1
Understanding accumulate
The accumulate function returns a series of accumulated sums (or the accumulated results of any other two-argument function) from an iterable. It generates a sequence where each element is the result of applying the function to the previous result and the next item from the iterable. If no function is provided, it defaults to addition.
Conceptually, for an iterable [a, b, c, d, ...], accumulate yields:
afunc(a, b)func(func(a, b), c)func(func(func(a, b), c), d)...
This makes it perfect for calculating running totals, cumulative products, or any other state that builds upon itself.
from itertools import accumulate
# Default behavior: running total
numbers = [1, 2, 3, 4, 5]
running_total = list(accumulate(numbers))
print(running_total) # Output: [1, 3, 6, 10, 15]
# Using a different function: running product or maximum
running_product = list(accumulate(numbers, func=lambda x, y: x * y))
print(running_product) # Output: [1, 2, 6, 24, 120]
running_max = list(accumulate([3, 1, 4, 2, 5], func=max))
print(running_max) # Output: [3, 3, 4, 4, 5]
Combining accumulate with Other Functions
The real power of accumulate emerges when it’s combined with other iterable tools. A classic example is using it to find the running average or to pair with zip to align accumulated data with the original.
# Calculating a running average
data = [10, 20, 30, 40]
running_total = accumulate(data)
counts = accumulate(1 for _ in data) # A running count of items seen
running_avg = [total / count for total, count in zip(running_total, counts)]
print(running_avg) # Output: [10.0, 15.0, 20.0, 25.0]
# A more efficient one-liner using enumerate
running_avg_2 = [total / (i+1) for i, total in enumerate(accumulate(data))]
print(running_avg_2) # Output: [10.0, 15.0, 20.0, 25.0]
Pitfalls and Best Practices
- Memory for Infinite Iterators: Be cautious when using
accumulatewith potentially infinite iterators (e.g.,count()). The accumulated value can grow very large very quickly, potentially causing memory or computation issues. - Non-Associative Functions: The function passed to
accumulateshould be associative (func(func(a, b), c) == func(a, func(b, c))). While it will work with non-associative functions, the result is a left-fold accumulation, which may not be the intended behavior for all use cases (e.g., right-associative operations like exponentiation). - Initial Value: Standard
accumulatedoes not take an initial value. The first value yielded is always the first element of the iterable. If you need a different starting point, you can prefix your iterable usingchain.from itertools import chain numbers = [1, 2, 3] # Start accumulation with an initial value of 10 result = list(accumulate(chain([10], numbers))) print(result) # Output: [10, 11, 13, 16] - Consuming group Iterators: Always convert the group iterator from
groupbyto a list or tuple within the loop if you need to use the data more than once. The iterator is consumed when you advance to the next group.data = [1, 1, 2, 2, 2] grouped = groupby(data) for key, group_iter in grouped: group_list = list(group_iter) # Convert to list NOW print(f"Key: {key}") print(f"First use: {group_list}") print(f"Second use: {group_list}") # This works # print(f"This would be empty: {list(group_iter)}") # DON'T do this