75.5 Indexing: Basic, Advanced, and Boolean Mask Indexing
Alright, let’s talk about indexing. This is where NumPy goes from being a mildly interesting spreadsheet to a superpower. You’re about to learn how to grab, slice, dice, and reshape your data with a precision that would make a neurosurgeon jealous. Forget clumsy loops; this is data manipulation at the speed of thought.
The Basics: It’s Just Like a List (Until It’s Not)
If you’ve used Python lists, you already know the basics. Zero-based indexing, negative indices to count from the end, and the trusty colon (:) for slicing. NumPy arrays play along nicely.
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # 10
print(arr[-1]) # 50
print(arr[1:4]) # [20 30 40] - remember, slices are [start:stop), so stop is exclusive.
The first “gotcha” comes with a 2D array. You might instinctively think [row, column], and you’d be right. NumPy uses this comma-separated syntax within a single set of brackets. It’s elegant and, frankly, correct. MATLAB users, look upon our works and despair.
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_2d[1, 2]) # 6 (Second row, third column)
print(arr_2d[0, :]) # [1 2 3] (First row, all columns)
print(arr_2d[:, 1]) # [2 5 8] (All rows, second column)
Crucially, slicing an array returns a view, not a copy. This is a performance masterstroke. NumPy doesn’t waste time and memory creating a whole new array; it just gives you a new way to look at the same underlying data. Change the view, and you change the original. This is a feature, not a bug, but it will bite you if you forget it.
view_of_arr = arr[1:4]
view_of_arr[0] = 999 # Modify the first element of the view
print(arr) # [ 10 999 30 40 50] The original is changed!
# If you need a copy, be explicit: arr[1:4].copy()
Fancy Indexing: Because Sometimes You Need a Specific Shopping List
What if you don’t want a neat, contiguous slice? What if you want rows 1 and 3, and columns 0 and 2? Enter “fancy indexing” (I didn’t name it, but it fits). You pass in a list (or array) of indices. This is where the magic starts.
arr = np.arange(16).reshape(4, 4) # A 4x4 array for clarity
print(arr)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [12 13 14 15]]
# Grab rows at index 1 and 3
print(arr[[1, 3]])
# [[ 4 5 6 7]
# [12 13 14 15]]
# Grab rows 1 and 3, and from those rows, grab columns 0 and 2
print(arr[[1, 3]][:, [0, 2]])
# Alternatively, and more efficiently: arr[[1, 3], :][:, [0, 2]]
# [[ 4 6]
# [12 14]]
Here’s the critical insight: fancy indexing always returns a copy. The result isn’t a view onto the original array because the selected data isn’t arranged in a regular, strided pattern in memory. NumPy has to assemble it fresh. Remember this distinction; it’s a cornerstone of understanding performance.
Boolean Mask Indexing: The Real Superpower
This is, without exaggeration, one of the best features in all of scientific computing. Instead of providing integer indices, you provide a boolean array (a “mask”) of the same shape as your array. True means “keep this value,” and False means “skip it.”
You usually create the mask by performing a logical operation on the array itself.
arr = np.array([1, 5, 2, 10, 3, 8])
mask = arr > 4
print(mask) # [False True False True False True]
print(arr[mask]) # [ 5 10 8]
# This is equivalent to the gloriously concise:
print(arr[arr > 4]) # [ 5 10 8]
It gets even better with 2D arrays. Let’s say you have a dataset and want to filter out all values above a certain threshold.
data = np.random.randn(5, 5) # 5x5 of random numbers
print(data)
# [[ 0.928 1.234 -0.456 0.112 -1.402]
# [-0.129 0.555 2.789 -0.891 0.761]
# [ 1.003 -0.432 -0.567 0.099 1.876]
# [ 0.888 -2.101 0.345 -0.654 0.123]
# [ 1.999 0.001 -1.111 0.888 -0.444]]
# Find all values greater than 1.5
high_values = data[data > 1.5]
print(high_values) # A 1D array: [1.999 2.789 1.876]
This is brutally efficient and readable. The alternative—nested for loops with if statements—is an order of magnitude slower and an eyesore. This is the NumPy way.
Assignment: The Power to Change Reality
All these indexing methods work flawlessly on the left-hand side of an assignment operator. This is how you perform targeted, bulk updates.
arr = np.arange(10)
arr[arr % 2 == 1] = -1 # Set all odd numbers to -1
print(arr) # [ 0 -1 2 -1 4 -1 6 -1 8 -1]
arr_2d = np.zeros((4, 4))
arr_2d[[0, -1], :] = 1 # Set first and last row to all 1s
arr_2d[:, [0, -1]] = 2 # Set first and last column to all 2s
print(arr_2d)
# [[2. 1. 1. 2.]
# [2. 0. 0. 2.]
# [2. 0. 0. 2.]
# [2. 1. 1. 2.]]
The one major “gotcha” here is trying to use a boolean mask for assignment when the data types don’t match. If you have an integer array and try to assign a float through a mask, NumPy will silently truncate the decimal. It’s not wrong, per se—it’s following the rules of the array’s dtype—but it has ruined many a scientist’s afternoon. Always know your dtypes.
int_arr = np.arange(5)
mask = np.array([True, False, False, True, False])
int_arr[mask] = 3.14 # The 3.14 gets truncated to 3
print(int_arr) # [3 1 2 3 4] # Wait, where's my .14?!
Master these three forms of indexing—basic, fancy, and boolean—and you’ve effectively learned the language NumPy uses to talk about data. Everything else is just vocabulary.