Overview
Context engineering is about managing what the AI knows and when. A common mistake is jumping straight to complex RAG pipelines when simpler approaches would suffice. This module teaches you to start simple and add complexity only when needed.
The progression:
- Simple context injection - Put the document directly in the prompt
- In-memory document stores - Store and retrieve full documents by ID
- RAG pipelines - Semantic search over large document collections
Each step adds capability but also complexity. Understanding these trade-offs is essential for building production AI systems.
Codebase Reference
The bootcamp application demonstrates simple context injection - the first approach you should consider:
- Document Store - In-memory document storage
- Text Extraction - PDF to text conversion
- Thread Memory Provider - Manages conversation history per thread
- User Preferences Provider - Stores long-term user facts
Topics Covered
1. Start Simple: In-Memory Document Processing
Before reaching for vector databases and embedding models, consider whether you actually need them. The bootcamp app demonstrates a simple context injection pattern that works well for many use cases.
How It Works
# From ai-bootcamp-app/backend/app/documents/store.py
class DocumentStore:
"""Simple in-memory document store - no embeddings, no vectors."""
def __init__(self):
self._documents: dict[str, Document] = {}
def add(self, document: Document) -> None:
self._documents[document.id] = document
def get(self, document_id: str) -> Document | None:
return self._documents.get(document_id)
π See implementation: store.py
The Flow
1. User uploads PDF
β PyMuPDF extracts text
β Store full text in memory (dict keyed by doc ID)
β Return doc ID to client
2. User asks a question about the document
β Retrieve full document text by ID
β Inject text directly into system prompt
β LLM processes question with full context
3. No embeddings. No vector search. No chunking.
Text Extraction
# From ai-bootcamp-app/backend/app/documents/extraction.py
def extract_text_from_pdf(pdf_bytes: bytes) -> tuple[str, int]:
"""Extract text from PDF bytes using PyMuPDF."""
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
text_parts = []
for page in doc:
text_parts.append(page.get_text())
return "\n\n".join(text_parts), len(doc)
π See implementation: extraction.py
Context Injection Pattern
The magic happens when the agent runs. Document context is concatenated directly into the system instructions:
# From ai-bootcamp-app/backend/app/api/routes/agents.py (simplified)
async def run_agent(request: AgentRunRequest):
# Collect context from memory providers
memory_context_parts = []
for provider in context_providers:
ctx = await provider.invoking()
if ctx:
memory_context_parts.append(ctx)
# Inject context into system instructions
enhanced_instructions = request.instructions
if memory_context_parts:
memory_context = "\n\n".join(memory_context_parts)
enhanced_instructions = f"{request.instructions}\n\n{memory_context}"
# Create agent with context-enhanced instructions
agent = create_agent(instructions=enhanced_instructions, ...)
Why This Works
| Benefit | Explanation |
|---|---|
| No embedding costs | No API calls to create embeddings |
| No vector store | Just a Python dictionary |
| Full context preserved | No lossy chunking or summarization |
| Explicit control | You know exactly what context the model sees |
| Simple debugging | Print the prompt, see what the model gets |
Limitations
| Limitation | Impact |
|---|---|
| Context window limits | Canβt exceed ~100k tokens per document |
| No semantic search | Must select documents explicitly |
| Memory usage | All documents in RAM |
| No relevance ranking | All context weighted equally |
2. When Simple Isnβt Enough: Enter RAG
The simple approach breaks down when:
- Documents exceed context limits - A 500-page manual wonβt fit in a prompt
- You have many documents - Canβt inject 1000 documents at once
- Semantic search is needed - βFind documents about Xβ requires embeddings
- Relevance matters - Not all context is equally important
RAG (Retrieval-Augmented Generation) solves these problems by:
- Chunking documents into smaller pieces
- Creating embeddings (vector representations) of each chunk
- Storing embeddings in a vector database
- Retrieving only relevant chunks at query time
3. RAG Complexity Trade-offs
Before adopting RAG, understand what youβre signing up for:
New Components Required
| Component | Purpose | Complexity |
|---|---|---|
| Embedding model | Convert text to vectors | API costs, model selection |
| Vector store | Store and search embeddings | Infrastructure, scaling |
| Chunking strategy | Split documents | Affects retrieval quality |
| Retrieval strategy | Find relevant chunks | Top-k, MMR, re-ranking |
New Failure Modes
| Problem | Description |
|---|---|
| Bad chunks | Poor chunking loses context across boundaries |
| Embedding drift | Different embedding models = incompatible vectors |
| Retrieval misses | Semantically similar β actually relevant |
| Context fragmentation | Retrieved chunks may lack necessary context |
| Latency | Embedding + search + retrieval adds overhead |
Decision Framework
START HERE:
β
ββ Documents fit in context window?
β ββ YES β Use simple context injection
β ββ NO β Continue...
β
ββ Need to search across documents?
β ββ NO β Use explicit document selection
β ββ YES β Continue...
β
ββ Need semantic similarity search?
β ββ NO β Consider keyword search (BM25, full-text)
β ββ YES β RAG is appropriate
β
ββ RAG REQUIRED: Accept the complexity
4. Traditional RAG Pipelines
When you do need RAG, understanding its architecture helps you implement it effectively.
How RAG Works
RAG follows a two-phase approach: indexing (offline) and retrieval + generation (at query time).
INDEXING PHASE (Offline - done once per document)
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Documents β -> β Chunking β -> β Embedding β -> βVector Store β
β (PDFs, β β (Split β β (Convert β β (Store β
β text) β β text) β β to vecs) β β & index) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
QUERY PHASE (Online - every user query)
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Query β -> β Embed β -> β Search β -> β Retrieve β
β "What is β β Query β β Vector β β Top-K β
β X?" β β β β Store β β Chunks β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
v
βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββββββββ
β Answer β <- β LLM β <- β Query + Retrieved Context β
β β β Generate β β "Given this context: [chunks] β
β β β β β Answer: What is X?" β
βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββββββββ
Core Components
1. Chunking - Splitting documents into retrievable pieces
| Strategy | Description | Best For |
|---|---|---|
| Fixed size | Split every N characters/tokens | Simple, predictable |
| Semantic | Split at paragraph/section boundaries | Preserves meaning |
| Recursive | Try multiple separators (paragraph β sentence β word) | General purpose |
| Overlap | Include N characters from previous chunk | Preserves context across boundaries |
# Example: Recursive character splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Target chunk size in characters
chunk_overlap=200, # Overlap between chunks
separators=["\n\n", "\n", ". ", " "] # Try these in order
)
chunks = splitter.split_text(document_text)
2. Embeddings - Converting text to vectors for similarity search
Embeddings map text to high-dimensional vectors where similar meanings are close together.
| Model | Dimensions | Notes |
|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | Good balance of cost/quality |
text-embedding-3-large (OpenAI) | 3072 | Higher quality, higher cost |
all-MiniLM-L6-v2 (Sentence Transformers) | 384 | Free, runs locally |
voyage-large-2 (Voyage AI) | 1024 | Strong retrieval performance |
# Example: Creating embeddings
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=["Your chunk of text here"]
)
vector = response.data[0].embedding # List of 1536 floats
3. Vector Stores - Storing and searching embeddings
| Store | Type | Best For |
|---|---|---|
| Chroma | In-memory / SQLite | Development, small datasets |
| Pinecone | Managed cloud | Production, serverless |
| Weaviate | Self-hosted / cloud | Hybrid search, filtering |
| pgvector | PostgreSQL extension | When you already use Postgres |
| FAISS | In-memory library | High performance, local |
# Example: Using Chroma for local development
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_docs")
# Add documents (Chroma handles embedding automatically)
collection.add(
documents=["chunk 1 text", "chunk 2 text"],
ids=["chunk_1", "chunk_2"]
)
# Query
results = collection.query(
query_texts=["What is machine learning?"],
n_results=5
)
4. Retrieval Strategies - Finding the most relevant chunks
| Strategy | How It Works | Trade-off |
|---|---|---|
| Top-K | Return K most similar chunks | Simple, may have redundancy |
| MMR (Maximal Marginal Relevance) | Balance relevance with diversity | Reduces redundant results |
| Hybrid | Combine vector search + keyword search (BM25) | Better for exact matches |
| Re-ranking | Use a cross-encoder to re-score top results | Higher quality, more latency |
Scaling Considerations
- S3-backed vector stores - For corpora too large to fit in memory
- Metadata filtering - Filter by date, source, category before vector search
- Hierarchical indexing - Summarize documents, search summaries first, then drill down
- Caching - Cache frequent queries to reduce latency and cost
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Chunks too small | Retrieved context lacks necessary detail | Increase chunk size, add overlap |
| Chunks too large | Irrelevant content dilutes the answer | Decrease chunk size, use semantic splitting |
| Wrong embedding model | Poor retrieval despite good chunks | Evaluate models on your domain |
| No metadata | Canβt filter by time/source | Add metadata during indexing |
| Stale index | New documents not searchable | Implement incremental indexing |
5. Progressive Context
Loading all context upfront wastes tokens and can overwhelm the model. Progressive context strategies load information as needed.
Techniques:
- On-demand retrieval - Fetch context only when relevant
- Tool flooding prevention - Limit available tools based on task
- Lazy loading - Start minimal, expand as conversation develops
| Approach | When to Use |
|---|---|
| Simple injection | Small documents, explicit selection |
| Pre-built RAG | Known document corpus, predictable queries |
| On-the-fly RAG | Dynamic sources, exploratory queries |
| Progressive | Long conversations, multiple topics |
6. Structured Output
LLMs naturally produce free-form text, but applications need predictable data structures. Structured output techniques constrain model responses to match expected schemas, enabling reliable parsing and type safety.
The Problem with Free-Form Output
# Without structured output - fragile string parsing
response = llm.complete("Extract the person's name and age from: 'John is 25 years old'")
# Response might be:
# "Name: John, Age: 25"
# "The person is John and they are 25."
# "John (25)"
# Good luck parsing all these variations reliably!
The Solution: Schema-Constrained Generation
# With structured output - guaranteed schema
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
response = llm.complete_structured(
"Extract the person's name and age from: 'John is 25 years old'",
response_model=Person
)
# response.name = "John"
# response.age = 25 (already an int!)
How It Works
| Approach | Mechanism | Trade-off |
|---|---|---|
| JSON mode | Model trained to output valid JSON | No schema enforcement |
| Function calling | Model fills function parameters | Provider-specific |
| Constrained decoding | Grammar restricts token generation | Guaranteed valid, slower |
| Retry with validation | Parse output, retry on failure | Works with any model |
Libraries
Instructor - Works with OpenAI, Anthropic, and other providers
import instructor
from openai import OpenAI
from pydantic import BaseModel
client = instructor.from_openai(OpenAI())
class UserInfo(BaseModel):
name: str
age: int
user = client.chat.completions.create(
model="gpt-4",
response_model=UserInfo,
messages=[{"role": "user", "content": "John Doe is 30 years old"}]
)
# user.name = "John Doe", user.age = 30
Pydantic AI - Native structured output support
from pydantic_ai import Agent
from pydantic import BaseModel
class CityInfo(BaseModel):
name: str
country: str
population: int
agent = Agent('openai:gpt-4', result_type=CityInfo)
result = agent.run_sync("Tell me about Paris")
# result.data.name = "Paris"
# result.data.country = "France"
Native Provider Support - OpenAI and Anthropic have built-in structured output
# OpenAI structured outputs
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
},
messages=[{"role": "user", "content": "Extract: John is 25"}]
)
When to Use Structured Output
| Use Case | Why |
|---|---|
| API responses | Downstream systems need predictable formats |
| Data extraction | Pull structured data from unstructured text |
| Tool/function calling | Arguments must match expected types |
| Multi-step workflows | Output of one step feeds into next |
| Database insertion | Data must match schema |
Best Practices
- Start with Pydantic models - Define your schema as code, get validation free
- Use optional fields -
Optional[str]for data that might not exist - Add descriptions - Help the model understand field meaning with
Field(description="...") - Validate beyond types - Use Pydantic validators for business logic
- Handle failures - Always have a fallback for when parsing fails
7. Memory Management
Memory gives agents the ability to maintain context across interactions, transforming stateless request-response systems into contextual assistants.
Why Memory Matters
Without memory, every agent interaction starts fresh:
- Users must repeat context (βAs I mentioned earlierβ¦β)
- Agents cannot learn user preferences
- Multi-turn conversations lose coherence
- Personalization is impossible
Hybrid Dual-Layer Architecture
The bootcamp app uses a hybrid memory architecture combining fast access with permanent storage:
| Layer | Storage | Purpose | Latency | Retention |
|---|---|---|---|---|
| L1 Cache | Redis | Fast access for active conversations | ~1ms | 24 hours TTL |
| L2 Storage | PostgreSQL | Permanent conversation history | ~10ms | Permanent |
How they connect: The thread_id links both layers. Redis stores messages at thread:{thread_id}:messages, and PostgreSQL has a conversations table with a unique thread_id column that references the same identifier.
Redis Key PostgreSQL
βββββββββββββββββββββ ββββββββββββββββββββββββββββββ
thread:{thread_id}:messages ββ conversations.thread_id (UNIQUE)
β
β
messages.conversation_id
This enables:
- Fast retrieval: Check Redis first for active conversations
- Hydration: If Redis TTL expired, reload from PostgreSQL and repopulate cache
- Permanent history: Conversations are never lost after Redis expiration
Memory vs RAG: When to Use Each
| Use Case | Solution | Why |
|---|---|---|
| Search documents/knowledge base | RAG | Large corpus, semantic search |
| Remember conversation history | Memory | Sequential, time-ordered |
| User preferences/facts | Memory | Personal, long-lived |
| Real-time external data | Tools/APIs | Dynamic, external source |
Key insight: RAG retrieves relevant information; memory retrieves recent or personal information.
Types of Memory
Thread Memory (Short-term)
Scoped to a single conversation, stores message history in sequence. The ThreadMemoryProvider uses a hybrid retrieval strategy:
- First checks Redis for fast access
- If Redis is empty (TTL expired), hydrates from PostgreSQL
- Falls back to PostgreSQL-only if Redis is unavailable
# From ai-bootcamp-app/backend/app/memory/thread_provider.py
class ThreadMemoryProvider:
"""Provides conversation history with hybrid Redis/PostgreSQL retrieval."""
def __init__(self, redis: Redis | None, db_session: Session, settings: MemorySettings, thread_id: str):
self._redis = redis
self._db_session = db_session
self._settings = settings
self._key = f"thread:{thread_id}:messages"
self._thread_id = thread_id
async def invoking(self) -> str:
"""Get context - Redis first, hydrate from PostgreSQL if needed."""
messages = []
if self._redis:
raw = await self._redis.lrange(self._key, 0, -1)
if raw:
messages = [json.loads(m) for m in raw]
elif self._db_session:
# Redis empty - hydrate from PostgreSQL
messages = await self._hydrate_from_db()
elif self._db_session:
messages = await self._get_messages_from_db()
# Format for context injection...
- Short TTL in Redis (24 hours default)
- Permanent storage in PostgreSQL
- Automatic sync via
ConversationSyncService
π See implementation: thread_provider.py
User Preferences (Long-term)
Scoped to a user across all conversations, stores facts and preferences.
# From ai-bootcamp-app/backend/app/memory/user_provider.py
class UserPreferencesProvider:
"""Provides user preferences across all threads."""
def __init__(self, redis: Redis | None, settings: MemorySettings, user_id: str):
self._redis = redis
self._settings = settings
self._key = f"user:{user_id}:preferences"
- Long TTL (30 days default)
- Enables: βRemember I prefer Pythonβ personalization
π See implementation: user_provider.py
Memory Decay and Prioritization
Not all memories are equal. Strategies include:
| Strategy | How It Works |
|---|---|
| Recency weighting | Recent messages matter more |
| Sliding window | Keep only last N messages |
| Summarization | Compress old context into summaries |
| TTL expiration | Automatic cleanup of stale data |
| Importance scoring | Retain high-value information longer |
Storage Options
The bootcamp app uses a hybrid approach combining Redis and PostgreSQL:
| Layer | Tool | Role | Notes |
|---|---|---|---|
| L1 Cache | Redis | Fast access, TTL-based expiration | Sub-millisecond reads for active conversations |
| L2 Storage | PostgreSQL | Permanent history | Conversations and messages tables |
| Sync | ConversationSyncService | Async persistence | Fire-and-forget sync to PostgreSQL |
Why hybrid? Redis provides speed for active conversations, PostgreSQL ensures history is never lost. The ConversationSyncService bridges them asynchronously without blocking user interactions.
Alternative options:
| Tool | Best For | Notes |
|---|---|---|
| Mem0 | Intelligent memory layer | Auto-extracts facts, manages decay |
| pgvector | Relational + vector | When you need SQL + embeddings |
| In-memory | Development/testing | No persistence, simple |
Configuration
Use environment variables to configure memory behavior:
# From ai-bootcamp-app/backend/.env.example
# Redis Settings
REDIS__HOST=localhost
REDIS__PORT=6379
REDIS__PASSWORD=
REDIS__DB=0
# Memory Settings
MEMORY__THREAD_TTL_HOURS=24
MEMORY__USER_TTL_DAYS=30
MEMORY__MAX_THREAD_MESSAGES=50
π See configuration: config.py
Deep Dive: See Course 2, Module 3: Memory Management - Hybrid Redis + PostgreSQL Architecture for production implementation patterns including the Context Provider pattern, ConversationSyncService, database schemas, and graceful degradation strategies.
8. PDF Tooling
Extract text from documents for context injection or RAG pipelines. The right tool depends on your document complexity.
PyMuPDF (fitz) - Simple and Fast
Best for straightforward text extraction from standard PDFs.
# Simple extraction with PyMuPDF (from bootcamp app)
import fitz
def extract_text(pdf_bytes: bytes) -> str:
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
return "\n\n".join(page.get_text() for page in doc)
Pros: Fast, lightweight, no external dependencies Cons: Struggles with complex layouts, tables, and images
Docling - Enterprise Document Understanding
Docling is IBMβs open-source library for parsing complex documents. It goes beyond simple text extraction to understand document structure.
What Docling handles that PyMuPDF doesnβt:
| Feature | PyMuPDF | Docling |
|---|---|---|
| Basic text extraction | β | β |
| Table extraction | β | β (as structured data) |
| Image extraction | Basic | β (with captions) |
| Layout understanding | β | β (headers, sections, lists) |
| Multi-format support | PDF only | PDF, DOCX, PPTX, HTML, images |
| OCR for scanned docs | β | β |
| Markdown output | β | β |
Installation:
pip install docling
Basic Usage:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Get markdown representation (great for LLM context)
markdown = result.document.export_to_markdown()
# Access structured elements
for table in result.document.tables:
print(table.export_to_dataframe()) # Tables as pandas DataFrames
Why Markdown output matters:
LLMs understand markdown structure. Docling preserves:
- Headings hierarchy (
#,##,###) - Tables in markdown format
- Lists and bullet points
- Code blocks
- Image references with captions
# Example: Converting a technical PDF for RAG
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("technical_manual.pdf")
# This markdown preserves structure the LLM can reason about
context = result.document.export_to_markdown()
# Now inject into your prompt
prompt = f"""Based on this documentation:
{context}
Answer: How do I configure the system?"""
When to Use Each
| Scenario | Recommended Tool |
|---|---|
| Simple text PDFs (articles, reports) | PyMuPDF |
| PDFs with tables you need to query | Docling |
| Scanned documents | Docling (with OCR) |
| Mixed document types (PDF, DOCX, PPTX) | Docling |
| Maximum speed, minimal dependencies | PyMuPDF |
| Complex layouts (multi-column, figures) | Docling |
Best Practices for Document Processing
- Preserve structure - Use markdown output to maintain headings and sections
- Handle tables specially - Extract tables as structured data, not flattened text
- Add metadata - Track source document, page numbers, section titles
- Chunk intelligently - Split at section boundaries, not arbitrary character counts
- Test with real documents - Your production documents may have quirks
Hands-On Exercise
Part 1: Test Simple Document Processing
- Start the bootcamp app with
docker-compose up - Open the Chat UI at http://localhost:3000
- Click the Upload Document button in the chat interface
- Select a PDF file from your computer
- The document text is extracted and stored in memory - no embeddings, no vector store
- Ask the agent questions about your uploaded document
- Notice how the full document context is injected into the conversation
Part 2: Test Thread Memory
- Open the Chat UI at http://localhost:3000
- Test conversation continuity:
- Send: βMy name is Alice and I am learning Pythonβ
- Send: βWhat is my name and what am I learning?β
- Verify the agent remembers your context
- Test memory isolation:
- Open a new incognito tab
- Ask: βWhat is my name?β
- Verify the agent doesnβt know (new thread = fresh context)
Part 3: Inspect Storage
- Access Redis Commander at http://localhost:8001
- Find
thread:{id}:messageskeys - View the JSON conversation history
- Note the TTL countdown (24 hours default)
Part 4 (Optional): Handling Non-Text PDFs
Challenge: The bootcamp appβs current PDF extraction only handles text-based PDFs. Extend it to support documents where simple text extraction fails.
The Problem
The current PyMuPDF implementation returns empty or garbage text for:
- Scanned documents - Images of text, not actual text
- PDFs with diagrams/charts - Visual information the model canβt see
- Mixed content - Text plus images that need interpretation
Your Task
Extend the bootcamp appβs document processing to handle these cases. Your solution should:
- Detect when a PDF has insufficient extractable text
- Extract meaningful content from non-text PDFs
- Integrate with the existing upload flow
Possible Approaches
- OCR - Use optical character recognition for scanned documents (Docling, Tesseract, cloud OCR APIs)
- Multimodal models - Render pages as images and use vision-capable LLMs (Claude, GPT-4o) to interpret content
- Hybrid - Combine text extraction with OCR/vision as fallback
- Document AI services - Azure Document Intelligence, AWS Textract, Google Document AI
Success Criteria
- Upload a scanned PDF and ask the agent questions about its content
- Upload a PDF with charts/diagrams and verify the agent can describe the visual elements
- The solution should gracefully fall back to simple text extraction when appropriate
Considerations
Think about trade-offs between cost, speed, accuracy, and complexity. Not every document needs expensive vision processing.
Key Takeaway
The bootcamp app demonstrates that simple works. No vector database, no embedding API calls - just a Python dictionary and direct context injection. Start here. Add RAG only when you outgrow these limits.