41.3 Bedrock Knowledge Bases: RAG with S3 and Vector Stores

Right, so you’ve got a big pile of documents in S3—PDFs, text files, maybe some Word docs from that one colleague who refuses to join the 21st century. You want to query them intelligently with a Large Language Model (LLM), but we all know the problem: LLMs are brilliant idiots. They have vast knowledge but are utterly clueless about your specific data. That’s where Bedrock’s Knowledge Bases come in. Think of it as giving your model a pair of glasses and a very, very good filing system. It’s Retrieval Augmented Generation (RAG) without you having to build the entire plumbing system from scratch.

Here’s the core idea: you point a Knowledge Base at an S3 bucket. Bedrock automatically chunks your documents, generates embeddings (dense numerical representations of meaning) for those chunks, and sticks them in a vector store you choose. When you ask a question, it finds the most relevant chunks from your documents and hands them, along with your question, to the model. The model then answers using that context, not just its own dated and generalized knowledge. It’s like having a hyper-competent intern who’s actually read every single one of your company reports.

The Nitty-Gritty: How It Actually Works

Don’t just think of it as magic. Under the hood, it’s a well-orchestrated, multi-step process. You need to understand this to debug it when, not if, it acts weird.

Ingestion: You sync the Knowledge Base. It pulls every file from your specified S3 bucket.
Chunking: It breaks down each document into smaller, overlapping pieces. The default chunking strategy is actually pretty sensible, but we’ll get to customizing it later because, of course, the defaults won’t work for everything.
Embedding: Each text chunk is passed to a chosen Embeddings Model (e.g., the Titan Embeddings model). This model converts the text into a vector—a long list of numbers that represents its semantic meaning. The genius here is that sentences with similar meanings will have vectors that are mathematically close together in space.
Storage: These vectors, along with their metadata and the original text, are stored in the vector database you configured (e.g., Pinecone, Redis, or AWS’s own OpenSearch).
Retrieval: You query the Knowledge Base. It takes your question, converts it into an embedding using the same model, and performs a similarity search in the vector store to find the k most relevant chunks (you can configure k).
Generation: These relevant chunks are packaged up into the prompt as context and sent to the chosen Foundation Model (e.g., Claude 3 Sonnet) to generate a final, sourced answer.

Setting Up a Knowledge Base: The Code You Actually Need

Talking is cheap. Let’s look at the boto3 code to create one. This is the part where you’ll spend 80% of your time. Notice the moving parts: the S3 bucket, the embedding model, the vector store config, and the permissions. IAM is, as always, the silent killer of good ideas.

import boto3
import json

bedrock = boto3.client('bedrock-agent')
bedrock_runtime = boto3.client('bedrock-runtime')

# Create the Knowledge Base
response = bedrock.create_knowledge_base(
    name="MyTechnicalDocsKB",
    description="A KB for our internal technical documentation",
    roleArn="arn:aws:iam::123456789012:role/MyBedrockKnowledgeBaseRole",  # This role needs S3 and vector store perms!
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1"
        }
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": "arn:aws:aoss:us-east-1:123456789012:collection/my-vector-collection",
            "vectorIndexName": "my-technical-docs-index",
            "fieldMapping": {
                "vectorField": "bedrock-knowledge-base-default-vector",
                "textField": "AMAZON_BEDROCK_TEXT_CHUNK",
                "metadataField": "AMAZON_BEDROCK_METADATA"
            }
        }
    }
)

knowledge_base_id = response['knowledgeBase']['knowledgeBaseId']
data_source_id = bedrock.create_data_source(
    knowledgeBaseId=knowledge_base_id,
    name="MyS3DataSource",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {
            "bucketArn": "arn:aws:s3:::my-technical-docs-bucket",
            # Pro tip: Start with a prefix to test. Ingesting a 10TB bucket is a costly mistake.
            "inclusionPrefixes": ["onboarding/"]
        }
    },
    serverSideEncryptionConfiguration={
        "kmsKeyArn": "arn:aws:kms:us-east-1:123456789012:key/your-key-id"
    }
)['dataSource']['dataSourceId']

# Start the ingestion job. This is async and can take a while.
ingestion_job_id = bedrock.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id
)['ingestionJob']['ingestionJobId']

Querying It: Getting Answers Out

Now for the payoff. Querying is deceptively simple. The real art is in crafting the retrieval configuration.

# Query the Knowledge Base
response = bedrock_runtime.retrieve_and_generate(
    input={'text': "What's the process for requesting a new AWS account?"},
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': knowledge_base_id,
            'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': 5  # This is 'k'. Tune this! Too few=missing context, too much=noise and cost.
                }
            }
        }
    }
)

# The answer is cleanly presented...
print(response['output']['text'])
# ...but the REAL gold is in the citations. Always check these to verify the model isn't hallucinating!
for citation in response['citations']:
    print(f"Source: {citation['retrievedReferences'][0]['location']['s3Location']['uri']}")

Common Pitfalls and How to Avoid Them (The Trench Wisdom)

This is where the manual won’t help you. I’ve learned this the hard way.

Garbage In, Garbage Out is Law: If your source documents are messy scanned PDFs where the text is a jumbled mess, the embeddings will be a jumbled mess. The model can’t reason with data it can’t read. Pre-process your docs.
Chunking is Everything: The default chunk size might be wrong for you. Legal contracts need big chunks for context; code documentation might need smaller ones. You can use advanced preprocessing to chunk yourself and have Bedrock just embed, but it’s more work.
IAM Will Hate You: The execution role for the Knowledge Base needs precise permissions to read from the S3 bucket and write to the vector store. 90% of “ingestion failed” errors are IAM. Be paranoid about this.
Check Your Citations, You Fool: Never, ever blindly trust the answer. The model is still generating text. It can misinterpret the context it’s given. Always look at the retrieved citations and ask yourself, “Does the source material actually support this answer?” This is your single most important debugging tool.
Cost Awareness: Ingestion isn’t free. You’re paying for the embedding model invocations and the vector store operations. Doing a full sync on a massive bucket can have a surprising cost. Test with a small dataset first.

Bedrock Knowledge Bases takes the immense pain out of building a RAG system, but it doesn’t absolve you of thinking. You’re still the architect. You have to choose the right chunks, the right models, and, most importantly, supply the right data. Now go point it at your S3 bucket and finally find out what that one cryptic PDF from 2018 actually means.