21.8 Token Counting and Cost Estimation

Right, let’s talk about money. Not the abstract “compute” kind, but the very real, “oh god, my API bill” kind. When you’re working with Large Language Models, every token is a tiny little digital penny flying out of your wallet. Understanding how to count them and estimate cost isn’t just academic; it’s a survival skill. I’ve seen more than one developer’s jaw drop when they got their first monthly invoice after a careless for loop. Let’s make sure that’s not you.

The first and most important rule: Never, ever trust your human intuition for counting tokens. You might look at the string “Hello, world!” and think “that’s 13 characters, maybe 2 or 3 tokens.” The model’s tokenizer, however, might see it as ['Hello', ',', ' world', '!'] — 4 tokens. Or it might be ['Hel', 'lo', ',', ' world', '!'] — 5. The discrepancy is where the financial pain lives.

Why Token ≠ Word (or even Character)

This is the core of the issue. Subword tokenization algorithms like BPE, WordPiece, and SentencePiece are designed to handle massive vocabularies without blowing up memory. They do this by breaking words into smaller, reusable chunks. The word “indivisibility” might be tokenized as ['ind', 'iv', 'is', 'ibility']. This is brilliant for the model’s vocabulary size, but a nightmare for our accounting. The mapping from text string to token list is not linear, and it’s different for every single tokenizer. This is why you must use the tokenizer itself to do the counting. There is no shortcut.

How to Actually Count Tokens (The Right Way)

The only way to get an accurate count is to use the exact same tokenizer your target model uses. If you’re using OpenAI’s API, you must use tiktoken. If you’re using an open-weight model like Llama 3, you must use its Hugging Face tokenizer. Let’s get our hands dirty with code.

For OpenAI’s Models (using tiktoken):

First, install it: pip install tiktoken. Now, let’s count.

import tiktoken

# You MUST specify the encoding for the model you're using.
# Using the wrong encoding will give you wildly wrong counts.
enc = tiktoken.encoding_for_model("gpt-4o")

text = "Your elaborate text goes here. æå¯†ç ？ 🤯"

# Encode the text to get the token integers
tokens = enc.encode(text)
print(f"Tokens (integers): {tokens}")

# Decode them to see what they look like (optional, but enlightening)
print(f"Tokens (text): [{' | '.join(enc.decode_single_token_bytes(t).decode('utf-8', errors='replace') for t in tokens)}]")

num_tokens = len(tokens)
print(f"Number of tokens: {num_tokens}")

Running this will show you that the emoji and the Chinese character are almost certainly multiple tokens each. This is the reality you need to confront.

For Hugging Face Models (using transformers):

Install: pip install transformers. This is the standard for most open-source models.

from transformers import AutoTokenizer

# Load the tokenizer for your specific model
model_name = "meta-llama/Llama-3-8B-Instruct"  # Example
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "Your elaborate text goes here. æå¯†ç ？ 🤯"

# Tokenize the text
inputs = tokenizer(text)
tokens = inputs.input_ids  # This is a list of token IDs

# To see the actual token strings, use tokenize()
token_strings = tokenizer.tokenize(text)
print(f"Token strings: {token_strings}")

num_tokens = len(tokens)
print(f"Number of tokens: {num_tokens}")

Estimating Cost and Context Windows

Now for the math. Let’s say you’re using GPT-4o. As of my writing, the input cost is $5.00 per 1 million tokens. Your 1000-token prompt will cost you (1000 / 1,000,000) * $5.00 = $0.005. That’s half a cent. Seems cheap, right? Now imagine you’re processing 10,000 customer reviews that are 500 tokens each. That’s 5 million tokens, or 5 * $5.00 = $25. And that’s just for the input. If you’re asking the model to generate a 100-token summary for each, you’re adding output costs on top. It scales terrifyingly fast.

This is also how you check if your prompt fits in the model’s context window. Llama 3 8B has a context window of 8,192 tokens. If your prompt plus your desired max output length exceeds that, the model will start forgetting the beginning of your conversation. It’s like trying to stuff a mattress into a suitcases; it just won’t fit, and you’ll lose things.

Common Pitfalls and How to Avoid Them

Assuming Whitespace Doesn’t Count: It does. A space at the beginning of a string can be a token. A newline (\n) is almost always a token. Clean your inputs, but count after cleaning.
Ignoring Non-English Text: This is the biggest trap. A single Chinese character, a Japanese kanji, or an emoji can be 2, 3, or even more tokens. The tokenizers are primarily built on English data, so other scripts get a raw deal. Always test with representative data.
Forgetting System Prompts and Messages: In a chat API, you’re not just sending the user’s message. You’re sending the entire conversation history, including all the {"role": "system", "content": "..."} and {"role": "assistant", "content": "..."} messages. You must tokenize the entire payload structure, not just the latest user input. Most client libraries provide helper functions for this (e.g., tiktoken’s num_tokens_from_messages).
Using the Wrong Tokenizer: Using a GPT-2 tokenizer to estimate for a Claude model is a recipe for financial disaster. The counts will be completely different. Match the tokenizer to the model, exactly.

The best practice is to build token counting into your application’s monitoring. Log the token counts of your requests. Set up alerts for unexpectedly high counts. This isn’t just about cost; a spike in tokens might mean you’re accidentally feeding the model an entire PDF instead of a paragraph. Consider it a canary in the coal mine for your application’s sanity.