80.8 GPU Acceleration: .to(device) and CUDA

Right, let’s talk about making your models go brrrrr. You’ve built this beautiful neural network, you hit ’train’, and then… you go make a cup of coffee. And then lunch. Maybe you take a nap. This is the universe telling you that your model is probably still running on your laptop’s CPU, which for deep learning is about as effective as using a bicycle to tow a freight train.

The solution is to move your model and its data onto a Graphics Processing Unit (GPU). These things are basically massive, parallel number-crunching factories, and they are the only reason modern deep learning is even possible. Now, the way you do this in code is deceptively simple, but the devil, as always, is in the details. Let’s get you out of the bicycle business.

The First Rule of GPU Club: Always Check if a GPU is Available

Never, ever assume a GPU is present. Your code might run on your beefy desktop, but then it’ll fall flat on its face when your collaborator tries to run it on their Raspberry Pi. You must ask politely.

In PyTorch:

import torch

# This is the incantation. Learn it.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In TensorFlow/Keras: TensorFlow is a bit more… forward. It will typically grab all available GPU memory the moment it’s imported, which is both arrogant and pragmatic. You can check what it sees.

import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# TensorFlow will automatically use the GPU if it finds one.

Explicit is Better Than Implicit: Moving Your Stuff to .to(device)

Both frameworks have the same core concept: your model and your data need to be on the same device. You can’t bake a cake if the bowl is in the kitchen (CPU) and the oven is in the garage (GPU). This is the most common bug people run into: RuntimeError: Expected all tensors to be on the same device....

PyTorch Code Example: PyTorch requires you to be explicit. You are the boss. You tell everything where to go.

import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 1. Define a model
model = nn.Sequential(nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10))
# 2. Move the entire model to the device (this moves all its parameters)
model = model.to(device)

# 3. Now, your data must follow!
dummy_input = torch.randn(32, 100)  # This is on CPU by default
dummy_input = dummy_input.to(device) # Move it to the same device as the model

# Now you can perform a forward pass
output = model(dummy_input)
print(f"Output is on: {output.device}")

TensorFlow/Keras Code Example: TensorFlow, bless its heart, tries to be simpler. It generally handles device placement for you, but this magic can backfire. For full control, you can use tf.device() scope.

import tensorflow as tf

# Typically, TensorFlow just puts everything on the GPU by default if one exists.
# But to be explicit (a good practice):
with tf.device('/GPU:0'):  # Use '/CPU:0' if you want to force CPU
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(50, input_shape=(100,), activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    dummy_input = tf.random.normal((32, 100))
    output = model(dummy_input)

print(f"Output is on: {output.device}")

The CUDA Kernel Launch Overhead: Why the First Batch is Always Slow

You run your training loop and notice the first epoch takes 10 seconds while the others take 0.5. You didn’t break anything. This is normal. The first time you call a GPU operation, PyTorch/TensorFlow has to do a few things: it has to check the input types, figure out which specific CUDA kernel to use for that operation, load it onto the GPU, and initialize its context. This one-time cost is called the “CUDA kernel launch overhead.” After that, the pipeline is primed and everything runs at full tilt. So always benchmark your code after a warm-up run.

The Memory Wall: You’re Not Just Moving Compute, You’re Moving Data

Here’s the part everyone forgets: your data lives on the CPU’s RAM. Before the GPU can work on a batch, that data has to be copied over the PCIe bus to the GPU’s VRAM. This takes time. If your data preprocessing pipeline is slow on the CPU, your mighty GPU will sit there idle, twiddling its thousands of transistors, waiting for its next packet of data. This is called being “CPU-bound.”

The solution? Make your data loading asynchronous. Use torch.utils.data.DataLoader with num_workers > 0 to parallelize data loading and preprocessing. In TensorFlow, use the tf.data.Dataset API with prefetching. This lets the CPU prepare batch N+1 while the GPU is crunching on batch N.

The Nuclear Option: .cuda() and Why You Shouldn’t Use It

You’ll see old PyTorch code using .cuda() directly. This is the legacy way.

# DON'T DO THIS (most of the time)
model = model.cuda()
tensor = tensor.cuda()

Why is this bad? It’s hardcoded. It will fail with a nasty error if no GPU is present. .to(device) is always safer and more readable because it abstracts the specific device away. The same goes for TensorFlow’s with tf.device('/GPU:0')—it’s better to write logic that checks for availability first.

The Multi-GPU Mindset

If you’re lucky enough to have more than one GPU, the game changes. You can’t just .to(device) because you have to decide which device. device = torch.device('cuda:0') for the first GPU, 'cuda:1' for the second, and so on. For true parallel training across multiple GPUs, you’ll need to graduate to torch.nn.DataParallel or the more efficient torch.nn.parallel.DistributedDataParallel. But that, my friend, is a topic for another section. For now, master the single GPU. It’s the workhorse that will carry you 95% of the way.