41.7 Bedrock Fine-Tuning and Continued Pre-Training

Alright, let’s talk about making these foundation models actually yours. Because let’s be honest, out-of-the-box models are impressive, but they’re like a brilliant intern who’s read every book in the library yet has no clue about your specific business, your internal jargon, or your weirdly named projects from 2014. That’s where fine-tuning and continued pre-training come in. Think of it as giving that intern a intensive, hyper-focused crash course in your world.

The key thing to understand here is the distinction between the two main approaches, because Bedrock, in its infinite wisdom, uses the terms in a way that might make a purist’s eye twitch. I’ll clear it up for you.

Fine-Tuning: The Specialist Surgeon

Fine-tuning is what you probably think of first. You take a massive, general-purpose model (like Claude 3 Sonnet or Jurassic-2 Ultra) and you give it a small, highly curated dataset of examples specific to your task. This isn’t about teaching it new facts; it’s about teaching it a new style or format. You’re tuning its behavior, hence the name.

Common use cases are classic supervised learning problems:

Text Classification: Is this customer support email “Urgent,” “Normal,” or “Low Priority”?
Named Entity Recognition (NER): Pulling out project codenames, internal product IDs, or custom terminology from documents.
Specific Text Generation Styles: Making a model write all its responses in the style of a 19th-century sea captain, or generating product descriptions that perfectly match your existing e-commerce catalog’s tone.

Your dataset for this is typically a few hundred to a few thousand JSONL (JSON Lines) files. Each line is a perfect example of an input and the desired output.

{"prompt": "Review the following email and classify its sentiment: 'I am absolutely furious, this package never arrived!'", "completion": "Negative"}
{"prompt": "Review the following email and classify its sentiment: 'The customer service rep was a delight, solved my issue instantly.'", "completion": "Positive"}
{"prompt": "Review the following email and classify its sentiment: 'I received the invoice, reference number INV-78901.'", "completion": "Neutral"}

You kick off a fine-tuning job with the AWS CLI or SDK. Here’s the gist of it, using the CLI:

aws bedrock create-model-customization-job \
  --job-name my-classification-job \
  --customization-type FINE_TUNING \
  --role-arn arn:aws:iam::123456789012:role/BedrockFineTuningRole \
  --base-model-id anthropic.claude-3-sonnet-20240229-v1:0 \
  --training-data-config '{"s3Uri": "s3://my-bucket/training-data.jsonl"}' \
  --output-data-config '{"s3Uri": "s3://my-bucket/output/"}' \
  --hyper-parameters '{
      "epochCount": "3",
      "batchSize": "1",
      "learningRate": "0.00001"
  }'

Why it works: The model’s vast knowledge is frozen. You’re only tweaking the very last layers of its neural network—the decision-making part. It’s a surgical strike. It’s efficient, requires less data, and is less prone to the dreaded “catastrophic forgetting” (where the model forgets how to speak English because you taught it too much about sea captains).

Continued Pre-Training: The College Semester

Now, here’s where Bedrock’s terminology gets… creative. They call this “Continued Pre-Training,” but in the broader AI community, we’d often call this “Domain Adaptation” or “Further Pre-Training.” This isn’t about a specific task; it’s about knowledge infusion. You’re giving the model a textbook on a new subject so its underlying understanding of the world expands.

You use this when you need the model to understand a domain, not just perform a task.

Legal Documents: Feeding it thousands of SEC filings, legal contracts, and patent applications.
Medical Research: Training it on a corpus of medical journals, clinical trial reports, and patient outcome summaries (with proper anonymization, please).
Your Company’s Confluence Wiki: Making it the ultimate expert on every meeting note and project doc from the last decade.

The dataset here is much larger—often tens of thousands of unstructured text documents. No prompts or completions, just raw text in a .txt file.

The quantum flux capacitor, a proprietary component of our FOO-3000 series, operates at a nominal resonance frequency of 1.21 Gigawatts. The operating manual (DOC-789) clearly states that recalibration is required after every 100,000 miliwatts of output.
Project Barista's success metrics (KPI-BR-01 through 04) are tied primarily to user engagement, not raw revenue. See the Q3 2023 all-hands presentation for more details.

The CLI command looks similar, but the customization-type and the nature of the data change everything.

aws bedrock create-model-customization-job \
  --job-name my-domain-adaptation-job \
  --customization-type CONTINUED_PRE_TRAINING \
  --role-arn arn:aws:iam::123456789012:role/BedrockFineTuningRole \
  --base-model-id anthropic.claude-3-sonnet-20240229-v1:0 \
  --training-data-config '{"s3Uri": "s3://my-bucket/raw-domain-data.txt"}' \
  --output-data-config '{"s3Uri": "s3://my-bucket/output-2/"}' \
  --hyper-parameters '{
      "epochCount": "1",
      "batchSize": "1",
      "learningRate": "0.00002"
  }'

Why it works (and why it’s dangerous): Here, you’re updating all the layers of the model, not just the last few. You’re fundamentally changing its weights to absorb this new information. This is far more powerful but also far more perilous. Use too high a learning rate or too many epochs, and you’ll not only get that catastrophic forgetting I mentioned, but you might just lobotomize your multi-billion-parameter model, turning it into a babbling idiot that can only recite your company’s HR policy. Start with one epoch and a low learning rate. Always, always keep an untouched base model to compare against.

The Gotchas: Where This All Goes Pear-Shaped

Data Quality is Everything: Garbage in, gospel out. If your training data has biases, errors, or inconsistencies, the model will learn them perfectly and then confidently spew them back at your customers. Scrub your data like a surgeon scrubs in.
The Bill: These jobs aren’t free. You’re renting time on some of the most expensive hardware on the planet. A job can cost hundreds to thousands of dollars. Test on a small subset of your data first. Monitor your job in the Bedrock console like a hawk.
Evaluation is Non-Negotiable: You can’t just deploy a fine-tuned model because the job finished. You must have a rigorous, automated evaluation dataset that you did not use for training. Test its performance against the base model. Does it perform better on your task? Does it still perform acceptably on general knowledge? If it fails, you just saved yourself a production incident and a huge bill.
Cold Storage is Your Friend: That custom model you just trained? It’s stored in Bedrock and you’re charged an hourly rate for it, just for existing. The moment you’re done testing and you’ve deployed your endpoint, delete the training job artifact from Bedrock. You can always recreate the model from the checkpoints saved to your S3 bucket if you need to again. This one tip will save you a staggering amount of money.

This is the real work of generative AI—not just prompting, but teaching. It’s powerful, expensive, and fraught with pitfalls. But when you get it right, and that model starts answering questions with your company’s specific voice and knowledge, it feels like goddamn magic.