81.3 Hugging Face Transformers: Loading Pretrained Models

Right, let’s get our hands dirty. You’ve heard the hype, you’ve seen the demos, and now you want to actually use one of these so-called “transformers.” Welcome to the main event. Hugging Face’s transformers library is the reason a lot of us can actually do this without needing a PhD and a bank loan for compute time. It’s a brilliantly engineered abstraction layer over a frankly absurd number of pretrained models. Our first job is to stop staring at the menu and actually get a model into your Python runtime.

The absolute workhorse for this is the AutoModel and AutoTokenizer classes. Think of them as your savvy, multilingual friend who can figure out exactly what you need just by looking at the name of a model. You don’t need to know if it’s a BertModel or a RobertaForSequenceClassification; the Auto class handles that tedious mapping for you. This is a killer feature because it means your code doesn’t instantly break when you want to try a different model architecture.

The Dynamic Duo: Model and Tokenizer

You will always need two things: the model itself and its corresponding tokenizer. They’re a matched set. The tokenizer’s job is to take your human-readable text, chop it up into tokens (which are like words/subwords), and convert those into numbers (input IDs) that the model can actually understand. It also creates the attention mask (which tells the model which tokens to pay attention to and which to ignore, like padding) and token type IDs (for tasks like question answering). The model then takes these prepared numbers and does its magical math.

Trying to use a GPT-2 tokenizer with a BERT model is like trying to use a French dictionary to parse a Japanese sentence. It will fail, spectacularly and confusingly. Always load them as a pair from the same checkpoint.

from transformers import AutoModel, AutoTokenizer

# Pick a model. This is the tiny, friendly version of BERT.
model_name = "prajjwal1/bert-tiny"

# This is the incantation. It downloads the model and tokenizer from Hugging Face's hub.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

And just like that, you’ve got the power of a transformer (albeit a tiny one) on your machine. The first time you run this, it’ll download the model files and cache them (usually in ~/.cache/huggingface/hub), so the next time it’s lightning fast.

Picking Your Poison: Model Classes Matter

Here’s where the designers were both clever and, let’s be honest, a bit confusing. AutoModel is great, but it just gives you the base “raw” transformer architecture—the body without a head. It outputs hidden states, which are incredibly useful, but not what you want for a specific task like classification.

This is why you have task-specific AutoClasses. This is a critical choice and a common pitfall. Using the wrong one is like expecting a car’s engine block to drive you home; you’re missing the steering wheel, seats, and, well, the whole car.

# Need to do sequence classification? (e.g., sentiment analysis) Use this:
from transformers import AutoModelForSequenceClassification
model_for_classification = AutoModelForSequenceClassification.from_pretrained(model_name)

# How about question answering? Use this:
from transformers import AutoModelForQuestionAnswering
model_for_qa = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Or maybe token classification? (e.g., Named Entity Recognition)
from transformers import AutoModelForTokenClassification
model_for_ner = AutoModelForTokenClassification.from_pretrained(model_name)

Each of these has a task-specific head on top of the base model, pre-trained and ready to go. The AutoModel class is the foundation; these are the finished houses.

The Device Dilemma: CPU, GPU, and MPS

By default, these models load onto your CPU. For anything beyond a toy example, this is a recipe for watching paint dry. You need to explicitly tell the model to move to your GPU (or Apple’s Metal Performance Shaders, MPS, on modern Macs).

import torch

# Check if a CUDA GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move the model to the device
model.to(device)

# For Apple Silicon Macs (M1/M2/M3 chips), you can often get a nice speedup with:
# device = torch.device("mps")
# model.to(device)

A crucial “gotcha”: when you process your inputs with the tokenizer, the resulting dictionary (with input_ids, attention_mask, etc.) is still on the CPU. You need to send that to the same device as your model before you feed it in, or you’ll get a frustrating device error.

# Tokenize some sample text
inputs = tokenizer("This is a sample input.", return_tensors="pt")

# This line is VITALLY important. Move the entire inputs dict to the same device.
inputs = {k: v.to(device) for k, v in inputs.items()}

# Now you can run the model
outputs = model(**inputs)

Saving and Loading: Your Local Cache

You’re not just limited to downloading from the hub. You can save your model and tokenizer to a local directory after you’ve fine-tuned it or even just if you want a local backup. This uses the same save format that from_pretrained expects.

# Save the model and tokenizer to a directory
model.save_pretrained("./my_local_model_directory")
tokenizer.save_pretrained("./my_local_model_directory")

# Later, or in another script, load them back just by pointing to the path
local_model = AutoModel.from_pretrained("./my_local_model_directory")
local_tokenizer = AutoTokenizer.from_pretrained("./my_local_model_directory")

This is the foundation. You now know how to pull a model off the shelf, choose the right type for your task, get it running on your hardware, and save your work. Now the real fun begins: actually feeding it data.