소프트웨어 엔지니어가 AI 엔지니어링을 실제로 배워야 하는 방법 — 올바른 순서로

A 3-layer roadmap with real code, real use cases, and zero hand-waving.

If you’re a software engineer staring at the AI space right now, you’ve probably noticed the problem: every course either starts with linear algebra and never reaches anything you can ship, or it starts with “build a chatbot in 10 lines of LangChain” and you walk away unable to debug a single thing when it breaks in production.Non members can read here

Press enter or click to view image in full size

There’s a saner path. It looks like this:

Transformers → Vectors → Attention → FFNN

RAG → Chunking → Vector DB → Reranking

Agents → Tools → MCP → Memory

Notice what’s not in this list: training models from scratch, GPU optimization, fine-tuning LLaMA on your laptop. As an AI engineer (not an ML researcher), your job is to compose these primitives into products. This post walks the sequence with the use cases I’ve actually built or am building right now.

Layer 1: Transformers — the substrate everything sits on

You don’t need to implement a transformer from scratch. You do need a working mental model of three things, because every weird LLM behavior you’ll ever debug traces back to one of them.

1.1 Vectors (embeddings)

An embedding is just a list of numbers that captures the meaning of a piece of text. Two sentences with similar meaning sit close together in this high-dimensional space; two unrelated sentences sit far apart.

from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") vectors = model.encode([ "format this JSON", "pretty print my JSON payload", "calculate compound interest", ]) # vectors[0] and vectors[1] are nearly identical. # vectors[2] is far from both.

Real use case — DevUtil.dev search: I run a developer toolkit with 30+ tools (JSON formatter, regex tester, JWT decoder, etc.). Keyword search is brittle — a user typing “make my JSON pretty” won’t match a tool literally named “JSON Formatter” if I only do string matching. Embedding the tool descriptions once and doing cosine similarity at query time turns a clunky search box into something that feels like it actually understands intent.

Attention is how a transformer decides which previous tokens matter when generating the next token. The famous “the animal didn’t cross the street because it was too tired” — attention is what lets the model figure out “it” refers to “animal” and not “street”.

You don’t need to derive softmax(QK^T/√d)V to be useful. You need to internalize two consequences:

Context window is finite and quadratic. Doubling the input roughly quadruples the compute. This is why you can’t just dump your entire codebase into the prompt and pray.

The model attends to the whole context every step. This is why prompt injection works, why “ignore previous instructions” is a real attack, and why putting the most important instructions at the end of long prompts often helps.

1.3 Feed-Forward Networks (FFNN)

After attention figures out what to look at, the FFNN layer does the actual “thinking” on each token — transforming, recombining, retrieving knowledge baked into the weights. Recent interpretability work (Anthropic, others) increasingly points to FFNN layers as where facts live, while attention is the routing.

Why this matters to you as an engineer: when an LLM “hallucinates” a library function that doesn’t exist, that’s the FFNN confidently retrieving a pattern that statistically fits — but isn’t grounded in reality. The fix isn’t a better prompt. The fix is the next layer: RAG.

Layer 2: RAG — giving the model facts it doesn’t have

Retrieval-Augmented Generation is the single highest-leverage technique you can learn as an AI engineer. 90% of “AI products” in production are RAG with a UI on top.

user query → embed → search vector DB → top-K chunks → stuff into prompt → LLM answer

But each step has a knob, and bad knobs ruin the system.

You can’t shove a 400-page PDF into a prompt. You split it into chunks, embed each chunk, and retrieve the relevant ones at query time. How you chunk decides whether your RAG is good or garbage.

Fixed-size chunking (e.g. 512 tokens with 50-token overlap): fast, dumb, fine for prose.

Semantic chunking (split on paragraph/section boundaries): better for technical docs.

Structure-aware chunking (split code by function, markdown by heading): best for source code, API docs, legal contracts.

from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, separators=["\n\n", "\n", ". ", " "], # prefer paragraph breaks ) chunks = splitter.split_text(long_document)

Real use case — local JARVIS over personal notes: For the on-device personal assistant I’m building (Ollama + FastAPI + ChromaDB + Raspberry Pi voice terminal), naive chunking of my Obsidian notes was useless — a meeting note got split mid-sentence and retrieval returned half-thoughts. Switching to a markdown-aware splitter that respects heading hierarchy made answers go from “technically related” to “actually correct.”

This is just a database optimized for “find the K nearest vectors to this query vector” using ANN (Approximate Nearest Neighbor) algorithms like HNSW.

Press enter or click to view image in full size

import chromadb client = chromadb.PersistentClient(path="./jarvis_memory") collection = client.get_or_create_collection("notes") collection.add( documents=chunks, embeddings=embeddings, ids=[f"chunk_{i}" for i in range(len(chunks))], metadatas=[{"source": "meeting_notes_2025_11"} for _ in chunks], ) results = collection.query(query_texts=["what did we decide about pricing?"], n_results=5)

Real use case — HopeConnect face matching: Same primitive (vector similarity), different domain. I store ArcFace embeddings of missing-person photos in pgvector and run cosine similarity against newly uploaded “found” photos. Vector DBs aren’t just for text.

Here’s the dirty secret of RAG: vector search retrieves similar chunks, not necessarily correct ones. A query about “Python list comprehension performance” might retrieve a chunk about “JavaScript array performance” — same shape, wrong language.

The fix is a two-stage retrieval:

Cheap retrieval: vector DB pulls top-50 candidates

Expensive rerank: a cross-encoder (or a small LLM) re-scores those 50 and picks the top-5

from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") pairs = [(query, candidate) for candidate in top_50_chunks] scores = reranker.predict(pairs) top_5 = [c for _, c in sorted(zip(scores, top_50_chunks), reverse=True)[:5]]

Adding a reranker is usually the single biggest quality jump you’ll see in a RAG pipeline after you’ve gotten the basics right. Worth a week of your time.

Layer 3: Agents — when retrieval isn’t enough

RAG answers questions over a static corpus. Agents take actions: call APIs, run code, read files, browse, write to a DB. This is where AI engineering stops feeling like “smarter search” and starts feeling like building software.

3.1 Tools (function calling)

A “tool” is just a function whose schema you describe to the LLM, and which the model can decide to call. The runtime executes the function and feeds the result back to the model.

tools = [{ "name": "get_stock_price", "description": "Get current price for an NSE/BSE ticker", "input_schema": { "type": "object", "properties": {"ticker": {"type": "string"}}, "required": ["ticker"], }, }]

A 3-layer roadmap with real code, real use cases, and zero hand-waving.

Press enter or click to view image in full size

There’s a saner path. It looks like this:

Transformers → Vectors → Attention → FFNN

RAG → Chunking → Vector DB → Reranking

Agents → Tools → MCP → Memory

Layer 1: Transformers — the substrate everything sits on

You don’t need to implement a transformer from scratch. You do need a working mental model of three things, because every weird LLM behavior you’ll ever debug traces back to one of them.

1.1 Vectors (embeddings)

You don’t need to derive softmax(QK^T/√d)V to be useful. You need to internalize two consequences:

Context window is finite and quadratic. Doubling the input roughly quadruples the compute. This is why you can’t just dump your entire codebase into the prompt and pray.

1.3 Feed-Forward Networks (FFNN)

Layer 2: RAG — giving the model facts it doesn’t have

Retrieval-Augmented Generation is the single highest-leverage technique you can learn as an AI engineer. 90% of “AI products” in production are RAG with a UI on top.

user query → embed → search vector DB → top-K chunks → stuff into prompt → LLM answer

But each step has a knob, and bad knobs ruin the system.

You can’t shove a 400-page PDF into a prompt. You split it into chunks, embed each chunk, and retrieve the relevant ones at query time. How you chunk decides whether your RAG is good or garbage.

Fixed-size chunking (e.g. 512 tokens with 50-token overlap): fast, dumb, fine for prose.

Semantic chunking (split on paragraph/section boundaries): better for technical docs.

Structure-aware chunking (split code by function, markdown by heading): best for source code, API docs, legal contracts.

This is just a database optimized for “find the K nearest vectors to this query vector” using ANN (Approximate Nearest Neighbor) algorithms like HNSW.

Press enter or click to view image in full size

The fix is a two-stage retrieval:

Cheap retrieval: vector DB pulls top-50 candidates

Expensive rerank: a cross-encoder (or a small LLM) re-scores those 50 and picks the top-5

Adding a reranker is usually the single biggest quality jump you’ll see in a RAG pipeline after you’ve gotten the basics right. Worth a week of your time.

Layer 3: Agents — when retrieval isn’t enough

3.1 Tools (function calling)

A “tool” is just a function whose schema you describe to the LLM, and which the model can decide to call. The runtime executes the function and feeds the result back to the model.

오픈클로(OpenClaw)-AI비서 자동화

게시판

공지 8

오픈클로(OpenClaw)-AI비서 자동화

소프트웨어 엔지니어가 AI 엔지니어링을 실제로 배워야 하는 방법 — 올바른 순서로

소프트웨어 엔지니어가 AI 엔지니어링을 실제로 배워야 하는 방법 — 올바른 순서로