75.5 Indexing: Basic, Advanced, and Boolean Mask Indexing

Alright, let’s talk about indexing. This is where NumPy goes from being a mildly interesting spreadsheet to a superpower. You’re about to learn how to grab, slice, dice, and reshape your data with a precision that would make a neurosurgeon jealous. Forget clumsy loops; this is data manipulation at the speed of thought.

The Basics: It’s Just Like a List (Until It’s Not)

If you’ve used Python lists, you already know the basics. Zero-based indexing, negative indices to count from the end, and the trusty colon (:) for slicing. NumPy arrays play along nicely.

import numpy as np

arr = np.array([10, 20, 30, 40, 50])
print(arr[0])   # 10
print(arr[-1])  # 50
print(arr[1:4]) # [20 30 40] - remember, slices are [start:stop), so stop is exclusive.

The first “gotcha” comes with a 2D array. You might instinctively think [row, column], and you’d be right. NumPy uses this comma-separated syntax within a single set of brackets. It’s elegant and, frankly, correct. MATLAB users, look upon our works and despair.

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_2d[1, 2])  # 6 (Second row, third column)
print(arr_2d[0, :])  # [1 2 3] (First row, all columns)
print(arr_2d[:, 1])  # [2 5 8] (All rows, second column)

Crucially, slicing an array returns a view, not a copy. This is a performance masterstroke. NumPy doesn’t waste time and memory creating a whole new array; it just gives you a new way to look at the same underlying data. Change the view, and you change the original. This is a feature, not a bug, but it will bite you if you forget it.

view_of_arr = arr[1:4]
view_of_arr[0] = 999  # Modify the first element of the view
print(arr)  # [ 10 999  30  40  50]  The original is changed!
# If you need a copy, be explicit: arr[1:4].copy()

Fancy Indexing: Because Sometimes You Need a Specific Shopping List

What if you don’t want a neat, contiguous slice? What if you want rows 1 and 3, and columns 0 and 2? Enter “fancy indexing” (I didn’t name it, but it fits). You pass in a list (or array) of indices. This is where the magic starts.

arr = np.arange(16).reshape(4, 4)  # A 4x4 array for clarity
print(arr)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]
#  [12 13 14 15]]

# Grab rows at index 1 and 3
print(arr[[1, 3]])
# [[ 4  5  6  7]
#  [12 13 14 15]]

# Grab rows 1 and 3, and from those rows, grab columns 0 and 2
print(arr[[1, 3]][:, [0, 2]])
# Alternatively, and more efficiently: arr[[1, 3], :][:, [0, 2]]
# [[ 4  6]
#  [12 14]]

Here’s the critical insight: fancy indexing always returns a copy. The result isn’t a view onto the original array because the selected data isn’t arranged in a regular, strided pattern in memory. NumPy has to assemble it fresh. Remember this distinction; it’s a cornerstone of understanding performance.

Boolean Mask Indexing: The Real Superpower

This is, without exaggeration, one of the best features in all of scientific computing. Instead of providing integer indices, you provide a boolean array (a “mask”) of the same shape as your array. True means “keep this value,” and False means “skip it.”

You usually create the mask by performing a logical operation on the array itself.

arr = np.array([1, 5, 2, 10, 3, 8])
mask = arr > 4
print(mask)  # [False  True False  True False  True]

print(arr[mask])  # [ 5 10  8]
# This is equivalent to the gloriously concise:
print(arr[arr > 4])  # [ 5 10  8]

It gets even better with 2D arrays. Let’s say you have a dataset and want to filter out all values above a certain threshold.

data = np.random.randn(5, 5)  # 5x5 of random numbers
print(data)
# [[ 0.928  1.234 -0.456  0.112 -1.402]
#  [-0.129  0.555  2.789 -0.891  0.761]
#  [ 1.003 -0.432 -0.567  0.099  1.876]
#  [ 0.888 -2.101  0.345 -0.654  0.123]
#  [ 1.999  0.001 -1.111  0.888 -0.444]]

# Find all values greater than 1.5
high_values = data[data > 1.5]
print(high_values)  # A 1D array: [1.999 2.789 1.876]

This is brutally efficient and readable. The alternative—nested for loops with if statements—is an order of magnitude slower and an eyesore. This is the NumPy way.

Assignment: The Power to Change Reality

All these indexing methods work flawlessly on the left-hand side of an assignment operator. This is how you perform targeted, bulk updates.

arr = np.arange(10)
arr[arr % 2 == 1] = -1  # Set all odd numbers to -1
print(arr)  # [ 0 -1  2 -1  4 -1  6 -1  8 -1]

arr_2d = np.zeros((4, 4))
arr_2d[[0, -1], :] = 1  # Set first and last row to all 1s
arr_2d[:, [0, -1]] = 2  # Set first and last column to all 2s
print(arr_2d)
# [[2. 1. 1. 2.]
#  [2. 0. 0. 2.]
#  [2. 0. 0. 2.]
#  [2. 1. 1. 2.]]

The one major “gotcha” here is trying to use a boolean mask for assignment when the data types don’t match. If you have an integer array and try to assign a float through a mask, NumPy will silently truncate the decimal. It’s not wrong, per se—it’s following the rules of the array’s dtype—but it has ruined many a scientist’s afternoon. Always know your dtypes.

int_arr = np.arange(5)
mask = np.array([True, False, False, True, False])
int_arr[mask] = 3.14  # The 3.14 gets truncated to 3
print(int_arr)  # [3 1 2 3 4]  # Wait, where's my .14?!

Master these three forms of indexing—basic, fancy, and boolean—and you’ve effectively learned the language NumPy uses to talk about data. Everything else is just vocabulary.