Layer 2: RAG — giving the model facts it doesn’t have
Retrieval-Augmented Generation is the single highest-leverage technique you can learn as an AI engineer. 90% of “AI products” in production are RAG with a UI on top.
The flow:
user query → embed → search vector DB → top-K chunks → stuff into prompt → LLM answer
But each step has a knob, and bad knobs ruin the system.
2.1 Chunking
You can’t shove a 400-page PDF into a prompt. You split it into chunks, embed each chunk, and retrieve the relevant ones at query time. How you chunk decides whether your RAG is good or garbage.
Fixed-size chunking (e.g. 512 tokens with 50-token overlap): fast, dumb, fine for prose.
Semantic chunking (split on paragraph/section boundaries): better for technical docs.
Structure-aware chunking (split code by function, markdown by heading): best for source code, API docs, legal contracts.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " "], # prefer paragraph breaks
)
chunks = splitter.split_text(long_document)
Real use case — local JARVIS over personal notes: For the on-device personal assistant I’m building (Ollama + FastAPI + ChromaDB + Raspberry Pi voice terminal), naive chunking of my Obsidian notes was useless — a meeting note got split mid-sentence and retrieval returned half-thoughts. Switching to a markdown-aware splitter that respects heading hierarchy made answers go from “technically related” to “actually correct.”
2.2 Vector Database
This is just a database optimized for “find the K nearest vectors to this query vector” using ANN (Approximate Nearest Neighbor) algorithms like HNSW.
Press enter or click to view image in full size
import chromadb
client = chromadb.PersistentClient(path="./jarvis_memory")
collection = client.get_or_create_collection("notes")
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))],
metadatas=[{"source": "meeting_notes_2025_11"} for _ in chunks],
)
results = collection.query(query_texts=["what did we decide about pricing?"], n_results=5)
Real use case — HopeConnect face matching: Same primitive (vector similarity), different domain. I store ArcFace embeddings of missing-person photos in pgvector and run cosine similarity against newly uploaded “found” photos. Vector DBs aren’t just for text.
2.3 Reranking
Here’s the dirty secret of RAG: vector search retrieves similar chunks, not necessarily correct ones. A query about “Python list comprehension performance” might retrieve a chunk about “JavaScript array performance” — same shape, wrong language.
The fix is a two-stage retrieval:
Cheap retrieval: vector DB pulls top-50 candidates
Expensive rerank: a cross-encoder (or a small LLM) re-scores those 50 and picks the top-5
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, candidate) for candidate in top_50_chunks]
scores = reranker.predict(pairs)
top_5 = [c for _, c in sorted(zip(scores, top_50_chunks), reverse=True)[:5]]
Adding a reranker is usually the single biggest quality jump you’ll see in a RAG pipeline after you’ve gotten the basics right. Worth a week of your time.
Layer 3: Agents — when retrieval isn’t enough
RAG answers questions over a static corpus. Agents take actions: call APIs, run code, read files, browse, write to a DB. This is where AI engineering stops feeling like “smarter search” and starts feeling like building software.
3.1 Tools (function calling)
A “tool” is just a function whose schema you describe to the LLM, and which the model can decide to call. The runtime executes the function and feeds the result back to the model.
tools = [{
"name": "get_stock_price",
"description": "Get current price for an NSE/BSE ticker",
"input_schema": {
"type": "object",
"properties": {"ticker": {"type": "string"}},
"required": ["ticker"],
},
}]
Real use case — algo trading agent: My paper trading system exposes tools like get_technical_indicators(ticker), compute_position_size(capital, risk_pct), and place_paper_order(...). The LLM doesn't know what RELIANCE is trading at — it calls the tool, gets the answer, then reasons over it. This is the right division of labor: deterministic code for facts and execution, LLM for reasoning over outputs.
3.2 MCP (Model Context Protocol)
MCP is Anthropic’s open protocol for exposing tools, resources, and prompts to LLMs in a standardized way. Think of it as “USB-C for AI tools” — instead of every app inventing its own function-calling wrapper, you build an MCP server once and any MCP-compatible client (Claude Desktop, Cursor, your own app) can use it.
For an AI engineer in 2026, learning MCP isn’t optional anymore. The ecosystem is consolidating fast.
Real use case — CognitoAITesting: I’m building a test-generation agent that watches SAP Fiori user flows and emits Playwright factory-pattern test code. Instead of welding it tightly to one client, the heavy lifting (DOM extraction, network capture, code generation) sits behind an MCP server. That means the same tool surface works from a Chrome extension, a Cursor command, or a CLI — without rewriting the integration three times.
3.3 Memory
Conversational memory is the hardest part of agent engineering, and it’s where most “AI assistants” fall apart in week two. There are three tiers, and you almost always need all three:
Short-term (working memory): the rolling message history within the context window. Manage it with summarization once you cross ~70% of the limit.
Long-term (episodic): past conversations and outcomes, retrieved via — surprise — a vector DB. Yes, memory is just RAG over your own history.
Structured (semantic): facts you’ve explicitly committed about the user. “Sachin lives in India, builds DevUtil.dev, prefers FastAPI.” Store these as key-value or graph data, not embeddings, so they’re deterministic to retrieve.
async def respond(user_msg: str, user_id: str):
facts = await structured_memory.get(user_id) # who they are
past = await vector_memory.search(user_id, user_msg) # what's happened before
recent = working_memory.last_n(user_id, n=10) # current thread
prompt = build_prompt(facts, past, recent, user_msg)
reply = await llm(prompt)
working_memory.append(user_id, user_msg, reply)
if is_factual_update(user_msg):
await structured_memory.upsert(user_id, extract_fact(user_msg))
return reply
Real use case — JARVIS again: A voice assistant that forgets every conversation is a demo, not a product. The version on my Raspberry Pi keeps structured facts (“wife’s birthday is X”, “I take metformin”), vector-indexed conversation history, and a rolling working buffer. The same triple-tier pattern shows up in every production agent worth using.
The actual learning path — week by week
If you’re starting Monday, here’s the sequence I’d follow:
Press enter or click to view image in full size
Two principles will save you months:
Build the pipeline before you optimize any piece of it. A working ugly RAG beats a beautiful chunking strategy with no retrieval wired up.
Evaluate or perish. Keep a tiny eval set of 20–50 query/answer pairs from day one. Every change gets scored against it. Without this, you’re guessing.
Closing thought
AI engineering in 2026 isn’t about training models. It’s about composing three primitives — transformers, retrieval, agency — into systems that solve real problems for real users. The sequence above is the order in which each layer becomes legible: you can’t reason about RAG until you understand vectors, and you can’t build a reliable agent until you understand why RAG sometimes lies to it.
Start with embeddings. Ship the ugly version. Iterate from there.