Propel AI Bootcamp
Course 1 - Module 3

Context Engineering & RAG

Understand when simple context injection is enough and when you need RAG - with practical trade-offs

Learning Objectives

  • Implement simple in-memory document processing for small-scale use cases
  • Understand when and why to adopt RAG over simple context injection
  • Evaluate the complexity trade-offs RAG introduces
  • Build production-ready RAG pipelines when scale demands it
  • Understand memory types and when to use them vs RAG

Overview

Context engineering is about managing what the AI knows and when. A common mistake is jumping straight to complex RAG pipelines when simpler approaches would suffice. This module teaches you to start simple and add complexity only when needed.

The progression:

  1. Simple context injection - Put the document directly in the prompt
  2. In-memory document stores - Store and retrieve full documents by ID
  3. RAG pipelines - Semantic search over large document collections

Each step adds capability but also complexity. Understanding these trade-offs is essential for building production AI systems.

Codebase Reference

The bootcamp application demonstrates simple context injection - the first approach you should consider:

Topics Covered

1. Start Simple: In-Memory Document Processing

Before reaching for vector databases and embedding models, consider whether you actually need them. The bootcamp app demonstrates a simple context injection pattern that works well for many use cases.

How It Works

# From ai-bootcamp-app/backend/app/documents/store.py
class DocumentStore:
    """Simple in-memory document store - no embeddings, no vectors."""

    def __init__(self):
        self._documents: dict[str, Document] = {}

    def add(self, document: Document) -> None:
        self._documents[document.id] = document

    def get(self, document_id: str) -> Document | None:
        return self._documents.get(document_id)

πŸ“ See implementation: store.py

The Flow

1. User uploads PDF
   β†’ PyMuPDF extracts text
   β†’ Store full text in memory (dict keyed by doc ID)
   β†’ Return doc ID to client

2. User asks a question about the document
   β†’ Retrieve full document text by ID
   β†’ Inject text directly into system prompt
   β†’ LLM processes question with full context

3. No embeddings. No vector search. No chunking.

Text Extraction

# From ai-bootcamp-app/backend/app/documents/extraction.py
def extract_text_from_pdf(pdf_bytes: bytes) -> tuple[str, int]:
    """Extract text from PDF bytes using PyMuPDF."""
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    text_parts = []
    for page in doc:
        text_parts.append(page.get_text())
    return "\n\n".join(text_parts), len(doc)

πŸ“ See implementation: extraction.py

Context Injection Pattern

The magic happens when the agent runs. Document context is concatenated directly into the system instructions:

# From ai-bootcamp-app/backend/app/api/routes/agents.py (simplified)
async def run_agent(request: AgentRunRequest):
    # Collect context from memory providers
    memory_context_parts = []
    for provider in context_providers:
        ctx = await provider.invoking()
        if ctx:
            memory_context_parts.append(ctx)

    # Inject context into system instructions
    enhanced_instructions = request.instructions
    if memory_context_parts:
        memory_context = "\n\n".join(memory_context_parts)
        enhanced_instructions = f"{request.instructions}\n\n{memory_context}"

    # Create agent with context-enhanced instructions
    agent = create_agent(instructions=enhanced_instructions, ...)

Why This Works

BenefitExplanation
No embedding costsNo API calls to create embeddings
No vector storeJust a Python dictionary
Full context preservedNo lossy chunking or summarization
Explicit controlYou know exactly what context the model sees
Simple debuggingPrint the prompt, see what the model gets

Limitations

LimitationImpact
Context window limitsCan’t exceed ~100k tokens per document
No semantic searchMust select documents explicitly
Memory usageAll documents in RAM
No relevance rankingAll context weighted equally

2. When Simple Isn’t Enough: Enter RAG

The simple approach breaks down when:

  • Documents exceed context limits - A 500-page manual won’t fit in a prompt
  • You have many documents - Can’t inject 1000 documents at once
  • Semantic search is needed - β€œFind documents about X” requires embeddings
  • Relevance matters - Not all context is equally important

RAG (Retrieval-Augmented Generation) solves these problems by:

  1. Chunking documents into smaller pieces
  2. Creating embeddings (vector representations) of each chunk
  3. Storing embeddings in a vector database
  4. Retrieving only relevant chunks at query time

3. RAG Complexity Trade-offs

Before adopting RAG, understand what you’re signing up for:

New Components Required

ComponentPurposeComplexity
Embedding modelConvert text to vectorsAPI costs, model selection
Vector storeStore and search embeddingsInfrastructure, scaling
Chunking strategySplit documentsAffects retrieval quality
Retrieval strategyFind relevant chunksTop-k, MMR, re-ranking

New Failure Modes

ProblemDescription
Bad chunksPoor chunking loses context across boundaries
Embedding driftDifferent embedding models = incompatible vectors
Retrieval missesSemantically similar β‰  actually relevant
Context fragmentationRetrieved chunks may lack necessary context
LatencyEmbedding + search + retrieval adds overhead

Decision Framework

START HERE:
β”‚
β”œβ”€ Documents fit in context window?
β”‚   β”œβ”€ YES β†’ Use simple context injection
β”‚   └─ NO β†’ Continue...
β”‚
β”œβ”€ Need to search across documents?
β”‚   β”œβ”€ NO β†’ Use explicit document selection
β”‚   └─ YES β†’ Continue...
β”‚
β”œβ”€ Need semantic similarity search?
β”‚   β”œβ”€ NO β†’ Consider keyword search (BM25, full-text)
β”‚   └─ YES β†’ RAG is appropriate
β”‚
└─ RAG REQUIRED: Accept the complexity

4. Traditional RAG Pipelines

When you do need RAG, understanding its architecture helps you implement it effectively.

How RAG Works

RAG follows a two-phase approach: indexing (offline) and retrieval + generation (at query time).

INDEXING PHASE (Offline - done once per document)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Documents  β”‚ -> β”‚   Chunking  β”‚ -> β”‚  Embedding  β”‚ -> β”‚Vector Store β”‚
β”‚  (PDFs,     β”‚    β”‚  (Split     β”‚    β”‚  (Convert   β”‚    β”‚  (Store     β”‚
β”‚   text)     β”‚    β”‚   text)     β”‚    β”‚   to vecs)  β”‚    β”‚   & index)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

QUERY PHASE (Online - every user query)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Query    β”‚ -> β”‚   Embed     β”‚ -> β”‚   Search    β”‚ -> β”‚  Retrieve   β”‚
β”‚  "What is   β”‚    β”‚   Query     β”‚    β”‚   Vector    β”‚    β”‚   Top-K     β”‚
β”‚   X?"       β”‚    β”‚             β”‚    β”‚   Store     β”‚    β”‚   Chunks    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                β”‚
                                                                v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Answer    β”‚ <- β”‚     LLM     β”‚ <- β”‚  Query + Retrieved Context      β”‚
β”‚             β”‚    β”‚  Generate   β”‚    β”‚  "Given this context: [chunks]  β”‚
β”‚             β”‚    β”‚             β”‚    β”‚   Answer: What is X?"           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. Chunking - Splitting documents into retrievable pieces

StrategyDescriptionBest For
Fixed sizeSplit every N characters/tokensSimple, predictable
SemanticSplit at paragraph/section boundariesPreserves meaning
RecursiveTry multiple separators (paragraph β†’ sentence β†’ word)General purpose
OverlapInclude N characters from previous chunkPreserves context across boundaries
# Example: Recursive character splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Target chunk size in characters
    chunk_overlap=200,    # Overlap between chunks
    separators=["\n\n", "\n", ". ", " "]  # Try these in order
)
chunks = splitter.split_text(document_text)

2. Embeddings - Converting text to vectors for similarity search

Embeddings map text to high-dimensional vectors where similar meanings are close together.

ModelDimensionsNotes
text-embedding-3-small (OpenAI)1536Good balance of cost/quality
text-embedding-3-large (OpenAI)3072Higher quality, higher cost
all-MiniLM-L6-v2 (Sentence Transformers)384Free, runs locally
voyage-large-2 (Voyage AI)1024Strong retrieval performance
# Example: Creating embeddings
from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["Your chunk of text here"]
)
vector = response.data[0].embedding  # List of 1536 floats

3. Vector Stores - Storing and searching embeddings

StoreTypeBest For
ChromaIn-memory / SQLiteDevelopment, small datasets
PineconeManaged cloudProduction, serverless
WeaviateSelf-hosted / cloudHybrid search, filtering
pgvectorPostgreSQL extensionWhen you already use Postgres
FAISSIn-memory libraryHigh performance, local
# Example: Using Chroma for local development
import chromadb

client = chromadb.Client()
collection = client.create_collection("my_docs")

# Add documents (Chroma handles embedding automatically)
collection.add(
    documents=["chunk 1 text", "chunk 2 text"],
    ids=["chunk_1", "chunk_2"]
)

# Query
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=5
)

4. Retrieval Strategies - Finding the most relevant chunks

StrategyHow It WorksTrade-off
Top-KReturn K most similar chunksSimple, may have redundancy
MMR (Maximal Marginal Relevance)Balance relevance with diversityReduces redundant results
HybridCombine vector search + keyword search (BM25)Better for exact matches
Re-rankingUse a cross-encoder to re-score top resultsHigher quality, more latency

Scaling Considerations

  • S3-backed vector stores - For corpora too large to fit in memory
  • Metadata filtering - Filter by date, source, category before vector search
  • Hierarchical indexing - Summarize documents, search summaries first, then drill down
  • Caching - Cache frequent queries to reduce latency and cost

Common Pitfalls

PitfallSymptomSolution
Chunks too smallRetrieved context lacks necessary detailIncrease chunk size, add overlap
Chunks too largeIrrelevant content dilutes the answerDecrease chunk size, use semantic splitting
Wrong embedding modelPoor retrieval despite good chunksEvaluate models on your domain
No metadataCan’t filter by time/sourceAdd metadata during indexing
Stale indexNew documents not searchableImplement incremental indexing

5. Progressive Context

Loading all context upfront wastes tokens and can overwhelm the model. Progressive context strategies load information as needed.

Techniques:

  • On-demand retrieval - Fetch context only when relevant
  • Tool flooding prevention - Limit available tools based on task
  • Lazy loading - Start minimal, expand as conversation develops
ApproachWhen to Use
Simple injectionSmall documents, explicit selection
Pre-built RAGKnown document corpus, predictable queries
On-the-fly RAGDynamic sources, exploratory queries
ProgressiveLong conversations, multiple topics

6. Structured Output

LLMs naturally produce free-form text, but applications need predictable data structures. Structured output techniques constrain model responses to match expected schemas, enabling reliable parsing and type safety.

The Problem with Free-Form Output

# Without structured output - fragile string parsing
response = llm.complete("Extract the person's name and age from: 'John is 25 years old'")
# Response might be:
#   "Name: John, Age: 25"
#   "The person is John and they are 25."
#   "John (25)"
# Good luck parsing all these variations reliably!

The Solution: Schema-Constrained Generation

# With structured output - guaranteed schema
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

response = llm.complete_structured(
    "Extract the person's name and age from: 'John is 25 years old'",
    response_model=Person
)
# response.name = "John"
# response.age = 25  (already an int!)

How It Works

ApproachMechanismTrade-off
JSON modeModel trained to output valid JSONNo schema enforcement
Function callingModel fills function parametersProvider-specific
Constrained decodingGrammar restricts token generationGuaranteed valid, slower
Retry with validationParse output, retry on failureWorks with any model

Libraries

Instructor - Works with OpenAI, Anthropic, and other providers

import instructor
from openai import OpenAI
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())

class UserInfo(BaseModel):
    name: str
    age: int

user = client.chat.completions.create(
    model="gpt-4",
    response_model=UserInfo,
    messages=[{"role": "user", "content": "John Doe is 30 years old"}]
)
# user.name = "John Doe", user.age = 30

Pydantic AI - Native structured output support

from pydantic_ai import Agent
from pydantic import BaseModel

class CityInfo(BaseModel):
    name: str
    country: str
    population: int

agent = Agent('openai:gpt-4', result_type=CityInfo)
result = agent.run_sync("Tell me about Paris")
# result.data.name = "Paris"
# result.data.country = "France"

Native Provider Support - OpenAI and Anthropic have built-in structured output

# OpenAI structured outputs
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            }
        }
    },
    messages=[{"role": "user", "content": "Extract: John is 25"}]
)

When to Use Structured Output

Use CaseWhy
API responsesDownstream systems need predictable formats
Data extractionPull structured data from unstructured text
Tool/function callingArguments must match expected types
Multi-step workflowsOutput of one step feeds into next
Database insertionData must match schema

Best Practices

  • Start with Pydantic models - Define your schema as code, get validation free
  • Use optional fields - Optional[str] for data that might not exist
  • Add descriptions - Help the model understand field meaning with Field(description="...")
  • Validate beyond types - Use Pydantic validators for business logic
  • Handle failures - Always have a fallback for when parsing fails

7. Memory Management

Memory gives agents the ability to maintain context across interactions, transforming stateless request-response systems into contextual assistants.

Why Memory Matters

Without memory, every agent interaction starts fresh:

  • Users must repeat context (β€œAs I mentioned earlier…”)
  • Agents cannot learn user preferences
  • Multi-turn conversations lose coherence
  • Personalization is impossible

Hybrid Dual-Layer Architecture

The bootcamp app uses a hybrid memory architecture combining fast access with permanent storage:

LayerStoragePurposeLatencyRetention
L1 CacheRedisFast access for active conversations~1ms24 hours TTL
L2 StoragePostgreSQLPermanent conversation history~10msPermanent

How they connect: The thread_id links both layers. Redis stores messages at thread:{thread_id}:messages, and PostgreSQL has a conversations table with a unique thread_id column that references the same identifier.

Redis Key                         PostgreSQL
─────────────────────            ──────────────────────────────
thread:{thread_id}:messages  ←→  conversations.thread_id (UNIQUE)
                                       β”‚
                                       ↓
                                 messages.conversation_id

This enables:

  • Fast retrieval: Check Redis first for active conversations
  • Hydration: If Redis TTL expired, reload from PostgreSQL and repopulate cache
  • Permanent history: Conversations are never lost after Redis expiration

Memory vs RAG: When to Use Each

Use CaseSolutionWhy
Search documents/knowledge baseRAGLarge corpus, semantic search
Remember conversation historyMemorySequential, time-ordered
User preferences/factsMemoryPersonal, long-lived
Real-time external dataTools/APIsDynamic, external source

Key insight: RAG retrieves relevant information; memory retrieves recent or personal information.

Types of Memory

Thread Memory (Short-term)

Scoped to a single conversation, stores message history in sequence. The ThreadMemoryProvider uses a hybrid retrieval strategy:

  1. First checks Redis for fast access
  2. If Redis is empty (TTL expired), hydrates from PostgreSQL
  3. Falls back to PostgreSQL-only if Redis is unavailable
# From ai-bootcamp-app/backend/app/memory/thread_provider.py
class ThreadMemoryProvider:
    """Provides conversation history with hybrid Redis/PostgreSQL retrieval."""

    def __init__(self, redis: Redis | None, db_session: Session, settings: MemorySettings, thread_id: str):
        self._redis = redis
        self._db_session = db_session
        self._settings = settings
        self._key = f"thread:{thread_id}:messages"
        self._thread_id = thread_id

    async def invoking(self) -> str:
        """Get context - Redis first, hydrate from PostgreSQL if needed."""
        messages = []

        if self._redis:
            raw = await self._redis.lrange(self._key, 0, -1)
            if raw:
                messages = [json.loads(m) for m in raw]
            elif self._db_session:
                # Redis empty - hydrate from PostgreSQL
                messages = await self._hydrate_from_db()
        elif self._db_session:
            messages = await self._get_messages_from_db()

        # Format for context injection...
  • Short TTL in Redis (24 hours default)
  • Permanent storage in PostgreSQL
  • Automatic sync via ConversationSyncService

πŸ“ See implementation: thread_provider.py

User Preferences (Long-term)

Scoped to a user across all conversations, stores facts and preferences.

# From ai-bootcamp-app/backend/app/memory/user_provider.py
class UserPreferencesProvider:
    """Provides user preferences across all threads."""

    def __init__(self, redis: Redis | None, settings: MemorySettings, user_id: str):
        self._redis = redis
        self._settings = settings
        self._key = f"user:{user_id}:preferences"
  • Long TTL (30 days default)
  • Enables: β€œRemember I prefer Python” personalization

πŸ“ See implementation: user_provider.py

Memory Decay and Prioritization

Not all memories are equal. Strategies include:

StrategyHow It Works
Recency weightingRecent messages matter more
Sliding windowKeep only last N messages
SummarizationCompress old context into summaries
TTL expirationAutomatic cleanup of stale data
Importance scoringRetain high-value information longer

Storage Options

The bootcamp app uses a hybrid approach combining Redis and PostgreSQL:

LayerToolRoleNotes
L1 CacheRedisFast access, TTL-based expirationSub-millisecond reads for active conversations
L2 StoragePostgreSQLPermanent historyConversations and messages tables
SyncConversationSyncServiceAsync persistenceFire-and-forget sync to PostgreSQL

Why hybrid? Redis provides speed for active conversations, PostgreSQL ensures history is never lost. The ConversationSyncService bridges them asynchronously without blocking user interactions.

Alternative options:

ToolBest ForNotes
Mem0Intelligent memory layerAuto-extracts facts, manages decay
pgvectorRelational + vectorWhen you need SQL + embeddings
In-memoryDevelopment/testingNo persistence, simple

Configuration

Use environment variables to configure memory behavior:

# From ai-bootcamp-app/backend/.env.example

# Redis Settings
REDIS__HOST=localhost
REDIS__PORT=6379
REDIS__PASSWORD=
REDIS__DB=0

# Memory Settings
MEMORY__THREAD_TTL_HOURS=24
MEMORY__USER_TTL_DAYS=30
MEMORY__MAX_THREAD_MESSAGES=50

πŸ“ See configuration: config.py

Deep Dive: See Course 2, Module 3: Memory Management - Hybrid Redis + PostgreSQL Architecture for production implementation patterns including the Context Provider pattern, ConversationSyncService, database schemas, and graceful degradation strategies.

8. PDF Tooling

Extract text from documents for context injection or RAG pipelines. The right tool depends on your document complexity.

PyMuPDF (fitz) - Simple and Fast

Best for straightforward text extraction from standard PDFs.

# Simple extraction with PyMuPDF (from bootcamp app)
import fitz

def extract_text(pdf_bytes: bytes) -> str:
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    return "\n\n".join(page.get_text() for page in doc)

Pros: Fast, lightweight, no external dependencies Cons: Struggles with complex layouts, tables, and images

Docling - Enterprise Document Understanding

Docling is IBM’s open-source library for parsing complex documents. It goes beyond simple text extraction to understand document structure.

What Docling handles that PyMuPDF doesn’t:

FeaturePyMuPDFDocling
Basic text extractionβœ…βœ…
Table extractionβŒβœ… (as structured data)
Image extractionBasicβœ… (with captions)
Layout understandingβŒβœ… (headers, sections, lists)
Multi-format supportPDF onlyPDF, DOCX, PPTX, HTML, images
OCR for scanned docsβŒβœ…
Markdown outputβŒβœ…

Installation:

pip install docling

Basic Usage:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Get markdown representation (great for LLM context)
markdown = result.document.export_to_markdown()

# Access structured elements
for table in result.document.tables:
    print(table.export_to_dataframe())  # Tables as pandas DataFrames

Why Markdown output matters:

LLMs understand markdown structure. Docling preserves:

  • Headings hierarchy (#, ##, ###)
  • Tables in markdown format
  • Lists and bullet points
  • Code blocks
  • Image references with captions
# Example: Converting a technical PDF for RAG
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("technical_manual.pdf")

# This markdown preserves structure the LLM can reason about
context = result.document.export_to_markdown()

# Now inject into your prompt
prompt = f"""Based on this documentation:

{context}

Answer: How do I configure the system?"""

When to Use Each

ScenarioRecommended Tool
Simple text PDFs (articles, reports)PyMuPDF
PDFs with tables you need to queryDocling
Scanned documentsDocling (with OCR)
Mixed document types (PDF, DOCX, PPTX)Docling
Maximum speed, minimal dependenciesPyMuPDF
Complex layouts (multi-column, figures)Docling

Best Practices for Document Processing

  1. Preserve structure - Use markdown output to maintain headings and sections
  2. Handle tables specially - Extract tables as structured data, not flattened text
  3. Add metadata - Track source document, page numbers, section titles
  4. Chunk intelligently - Split at section boundaries, not arbitrary character counts
  5. Test with real documents - Your production documents may have quirks

Hands-On Exercise

Part 1: Test Simple Document Processing

  1. Start the bootcamp app with docker-compose up
  2. Open the Chat UI at http://localhost:3000
  3. Click the Upload Document button in the chat interface
  4. Select a PDF file from your computer
  5. The document text is extracted and stored in memory - no embeddings, no vector store
  6. Ask the agent questions about your uploaded document
  7. Notice how the full document context is injected into the conversation

Part 2: Test Thread Memory

  1. Open the Chat UI at http://localhost:3000
  2. Test conversation continuity:
    • Send: β€œMy name is Alice and I am learning Python”
    • Send: β€œWhat is my name and what am I learning?”
    • Verify the agent remembers your context
  3. Test memory isolation:
    • Open a new incognito tab
    • Ask: β€œWhat is my name?”
    • Verify the agent doesn’t know (new thread = fresh context)

Part 3: Inspect Storage

  1. Access Redis Commander at http://localhost:8001
  2. Find thread:{id}:messages keys
  3. View the JSON conversation history
  4. Note the TTL countdown (24 hours default)

Part 4 (Optional): Handling Non-Text PDFs

Challenge: The bootcamp app’s current PDF extraction only handles text-based PDFs. Extend it to support documents where simple text extraction fails.

The Problem

The current PyMuPDF implementation returns empty or garbage text for:

  • Scanned documents - Images of text, not actual text
  • PDFs with diagrams/charts - Visual information the model can’t see
  • Mixed content - Text plus images that need interpretation

Your Task

Extend the bootcamp app’s document processing to handle these cases. Your solution should:

  1. Detect when a PDF has insufficient extractable text
  2. Extract meaningful content from non-text PDFs
  3. Integrate with the existing upload flow

Possible Approaches

  • OCR - Use optical character recognition for scanned documents (Docling, Tesseract, cloud OCR APIs)
  • Multimodal models - Render pages as images and use vision-capable LLMs (Claude, GPT-4o) to interpret content
  • Hybrid - Combine text extraction with OCR/vision as fallback
  • Document AI services - Azure Document Intelligence, AWS Textract, Google Document AI

Success Criteria

  • Upload a scanned PDF and ask the agent questions about its content
  • Upload a PDF with charts/diagrams and verify the agent can describe the visual elements
  • The solution should gracefully fall back to simple text extraction when appropriate

Considerations

Think about trade-offs between cost, speed, accuracy, and complexity. Not every document needs expensive vision processing.

Key Takeaway

The bootcamp app demonstrates that simple works. No vector database, no embedding API calls - just a Python dictionary and direct context injection. Start here. Add RAG only when you outgrow these limits.

Resources

Check Your Understanding

Question 1 of 5

What is the main advantage of simple context injection over RAG?