18.7 BERT: Bidirectional Encoder Pre-Training

Right, so you’ve heard of Transformers. You’ve seen the diagrams with all the “Attention” arrows pointing everywhere like a conspiracy theorist’s bulletin board. But BERT? BERT is the one that actually read the manual. While every other model was busy staring left-to-right like it was reading a particularly dull novel, BERT had a brilliant, simple idea: maybe words are defined by the words on both sides of them. You know, like in every human conversation ever.

That’s the core of it. Bidirectionality. BERT’s pre-training is a masterclass in clever, self-supervised learning. It doesn’t need you to sit there and label a million examples. It just needs a big pile of text (hello, Wikipedia) and it plays two games: Masked Language Model (MLM) and Next Sentence Prediction (NSP). We’ll get to the latter, which is… let’s say contentious.

The Masked Language Model (MLM) - The Core Innovation

This is the star of the show. Instead of predicting the next word in a sequence (a painfully slow, autoregressive process), BERT randomly masks out 15% of the tokens in its input sentence. Its job is to fill in the blanks. This forces the model to integrate contextual information from both the left and the right to make its prediction. It’s the difference between a detective who only interviews people who were at the crime scene before it happened, and one who interviews everyone.

Here’s the quirky part of that 15%: it’s not just masked tokens. To make the model more robust, the authors did something a bit weird and wonderful:

80% of the time: Replace the word with [MASK]. The quick brown [MASK] jumps.
10% of the time: Replace it with a random word. The quick brown banana jumps. (This prevents the model from getting too cozy with the [MASK] token and forgetting how to handle real words during fine-tuning).
10% of the time: Leave it unchanged. The quick brown fox jumps. (This biases the model towards the correct answer).

This is a small but crucial implementation detail. It’s why BERT works so well when you fine-tune it on actual downstream tasks where there are no [MASK] tokens.

Next Sentence Prediction (NSP) - The Controversial Sibling

BERT was also trained to understand the relationship between two sentences. During pre-training, it’s fed pairs of sentences (A and B). 50% of the time, sentence B is the actual next sentence from the corpus. The other 50% of the time, it’s a random sentence plucked from somewhere else. The model’s job is to predict: “Is this next sentence plausible or not?”

Now, let’s be direct: most follow-up research suggests NSP is a bit of a dud. The task is too easy; the model seems to learn topic detection more than deep, logical sentence cohesion. Later models like RoBERTa famously ditched it entirely and got a performance boost. But for the original BERT, it was part of the recipe, and you need to understand it because it defines the input format.

The Input Soup: How to Feed BERT

Because of these two tasks, a BERT input is a special concoction. It’s not just a string of words. It’s a carefully formatted sequence of tokens:

[CLS] The quick brown fox [SEP] jumps over the lazy dog [SEP]

[CLS]: The Classification token. We stick this at the beginning of every input sequence. The hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks (e.g., sentiment analysis). You can think of it as the model’s “summary” of the entire input.
[SEP]: The Separator token. It’s used to separate the two sentences in tasks like NSP or Question Answering. Even if you’re only using one sentence, you often still need to terminate it with a [SEP] due to how the model was trained.

This structure is non-negotiable. If you just throw a raw sentence at a BERT model without these special tokens, you’re asking for subpar results.

Putting It Into Practice: A Hugging Face Example

Enough theory. Let’s see how you actually use this thing. The transformers library from Hugging Face is the canonical way to do this. First, you need a tokenizer. Remember, BERT uses WordPiece tokenization, so words get broken down into subwords.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Prepare your input text. Let's do sentiment analysis.
inputs = tokenizer("I loved that movie, the acting was incredible!", return_tensors="pt")

# Let's see what the tokenizer actually did. This is crucial for debugging.
print(inputs)
# {'input_ids': tensor([[ 101, 1045, 4680, 2008, 3185,  1010, 1996, 5161, 2001, 12689,  1006,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

# The input_ids are the numerical tokens. Decode them back to see the special tokens.
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
# ['[CLS]', 'i', 'loved', 'that', 'movie', ',', 'the', 'acting', 'was', 'incredible', '!', '[SEP]']
# See? The [CLS] and [SEP] are added automatically.

# Pass the inputs through the model
outputs = model(**inputs)

# The logits for our classification (e.g., positive/negative sentiment)
logits = outputs.logits
print(logits)
# tensor([[-0.2345, 1.9421]], grad_fn=<AddmmBackward>)
# We'd then apply a softmax to get probabilities. Index 1 seems higher -> positive sentiment.

Common Pitfalls and Best Practices

You Forgot the Attention Mask: In the example above, all our sentences were the same length. In the real world, you batch sentences of different lengths. You pad the shorter ones with [PAD] tokens to make them equal length. The attention_mask tensor (all 1s above) tells the model which tokens are real (1) and which are padding (0). You must pass this to the model or it will try to attend to meaningless padding tokens and ruin your results.
Max Sequence Length is 512: This is a hard limit. BERT’s positional embeddings are only pre-trained for up to 512 tokens. If your document is longer, you must truncate it or chunk it. This is its biggest architectural weakness. Don’t try to finagle longer sequences; it won’t work well.
Fine-Tuning is Everything: The pre-trained BERT is a knowledge-rich but task-agnostic genius. You must fine-tune it on your specific dataset (e.g., your own set of movie reviews for sentiment). The out-of-the-box performance on a specific task is mediocre. The power comes from a few epochs of focused training on your data, which adapts those contextualized representations to your problem.
The [CLS] Token is a Workhorse: For sentence-level classification tasks, you will almost always take the final hidden state of the [CLS] token, run it through a small neural network classifier, and use that for your prediction. It’s what it’s there for.

BERT isn’t the newest model anymore, but its bidirectional insight is forever. It changed the game from “process the sequence” to “understand the context.” Everyone who came after stood on its shoulders, even if they eventually kicked the NSP toy off the playground.