A 3-layer roadmap with real code, real use cases, and zero hand-waving.
If you’re a software engineer staring at the AI space right now, you’ve probably noticed the problem: every course either starts with linear algebra and never reaches anything you can ship, or it starts with “build a chatbot in 10 lines of LangChain” and you walk away unable to debug a single thing when it breaks in production.Non members can read here
Press enter or click to view image in full size
There’s a saner path. It looks like this:
Transformers → Vectors → Attention → FFNN
RAG → Chunking → Vector DB → Reranking
Agents → Tools → MCP → Memory
Notice what’s not in this list: training models from scratch, GPU optimization, fine-tuning LLaMA on your laptop. As an AI engineer (not an ML researcher), your job is to compose these primitives into products. This post walks the sequence with the use cases I’ve actually built or am building right now.
Layer 1: Transformers — the substrate everything sits on
You don’t need to implement a transformer from scratch. You do need a working mental model of three things, because every weird LLM behavior you’ll ever debug traces back to one of them.
1.1 Vectors (embeddings)
An embedding is just a list of numbers that captures the meaning of a piece of text. Two sentences with similar meaning sit close together in this high-dimensional space; two unrelated sentences sit far apart.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([
"format this JSON",
"pretty print my JSON payload",
"calculate compound interest",
])
# vectors[0] and vectors[1] are nearly identical.
# vectors[2] is far from both.
Real use case — DevUtil.dev search: I run a developer toolkit with 30+ tools (JSON formatter, regex tester, JWT decoder, etc.). Keyword search is brittle — a user typing “make my JSON pretty” won’t match a tool literally named “JSON Formatter” if I only do string matching. Embedding the tool descriptions once and doing cosine similarity at query time turns a clunky search box into something that feels like it actually understands intent.
1.2 Attention
Attention is how a transformer decides which previous tokens matter when generating the next token. The famous “the animal didn’t cross the street because it was too tired” — attention is what lets the model figure out “it” refers to “animal” and not “street”.
You don’t need to derive softmax(QK^T/√d)V to be useful. You need to internalize two consequences:
Context window is finite and quadratic. Doubling the input roughly quadruples the compute. This is why you can’t just dump your entire codebase into the prompt and pray.
The model attends to the whole context every step. This is why prompt injection works, why “ignore previous instructions” is a real attack, and why putting the most important instructions at the end of long prompts often helps.
1.3 Feed-Forward Networks (FFNN)
After attention figures out what to look at, the FFNN layer does the actual “thinking” on each token — transforming, recombining, retrieving knowledge baked into the weights. Recent interpretability work (Anthropic, others) increasingly points to FFNN layers as where facts live, while attention is the routing.
Why this matters to you as an engineer: when an LLM “hallucinates” a library function that doesn’t exist, that’s the FFNN confidently retrieving a pattern that statistically fits — but isn’t grounded in reality. The fix isn’t a better prompt. The fix is the next layer: RAG.
Layer 2: RAG — giving the model facts it doesn’t have
Retrieval-Augmented Generation is the single highest-leverage technique you can learn as an AI engineer. 90% of “AI products” in production are RAG with a UI on top.
The flow:
user query → embed → search vector DB → top-K chunks → stuff into prompt → LLM answer
But each step has a knob, and bad knobs ruin the system.
2.1 Chunking
You can’t shove a 400-page PDF into a prompt. You split it into chunks, embed each chunk, and retrieve the relevant ones at query time. How you chunk decides whether your RAG is good or garbage.
Fixed-size chunking (e.g. 512 tokens with 50-token overlap): fast, dumb, fine for prose.
Semantic chunking (split on paragraph/section boundaries): better for technical docs.
Structure-aware chunking (split code by function, markdown by heading): best for source code, API docs, legal contracts.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " "], # prefer paragraph breaks
)
chunks = splitter.split_text(long_document)
Real use case — local JARVIS over personal notes: For the on-device personal assistant I’m building (Ollama + FastAPI + ChromaDB + Raspberry Pi voice terminal), naive chunking of my Obsidian notes was useless — a meeting note got split mid-sentence and retrieval returned half-thoughts. Switching to a markdown-aware splitter that respects heading hierarchy made answers go from “technically related” to “actually correct.”
2.2 Vector Database
This is just a database optimized for “find the K nearest vectors to this query vector” using ANN (Approximate Nearest Neighbor) algorithms like HNSW.
Press enter or click to view image in full size
import chromadb
client = chromadb.PersistentClient(path="./jarvis_memory")
collection = client.get_or_create_collection("notes")
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))],
metadatas=[{"source": "meeting_notes_2025_11"} for _ in chunks],
)
results = collection.query(query_texts=["what did we decide about pricing?"], n_results=5)
Real use case — HopeConnect face matching: Same primitive (vector similarity), different domain. I store ArcFace embeddings of missing-person photos in pgvector and run cosine similarity against newly uploaded “found” photos. Vector DBs aren’t just for text.
2.3 Reranking
Here’s the dirty secret of RAG: vector search retrieves similar chunks, not necessarily correct ones. A query about “Python list comprehension performance” might retrieve a chunk about “JavaScript array performance” — same shape, wrong language.
The fix is a two-stage retrieval:
Cheap retrieval: vector DB pulls top-50 candidates
Expensive rerank: a cross-encoder (or a small LLM) re-scores those 50 and picks the top-5
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, candidate) for candidate in top_50_chunks]
scores = reranker.predict(pairs)
top_5 = [c for _, c in sorted(zip(scores, top_50_chunks), reverse=True)[:5]]
Adding a reranker is usually the single biggest quality jump you’ll see in a RAG pipeline after you’ve gotten the basics right. Worth a week of your time.
Layer 3: Agents — when retrieval isn’t enough
RAG answers questions over a static corpus. Agents take actions: call APIs, run code, read files, browse, write to a DB. This is where AI engineering stops feeling like “smarter search” and starts feeling like building software.
3.1 Tools (function calling)
A “tool” is just a function whose schema you describe to the LLM, and which the model can decide to call. The runtime executes the function and feeds the result back to the model.
tools = [{
"name": "get_stock_price",
"description": "Get current price for an NSE/BSE ticker",
"input_schema": {
"type": "object",
"properties": {"ticker": {"type": "string"}},
"required": ["ticker"],
},
}]