AI Workforce Glossary: 25 Terms Every Founder Building With Agents Should Know

I spent three hours in a Discord thread last month watching a founder debug a problem that boiled down to not knowing what "tool use" meant. He'd built an elaborate prompt-chaining system when he could've just given Claude access to a calculator. A week. Gone.

This happens constantly. The AI workforce space moves fast, terminology shifts every few months, and half the blog posts explaining these concepts are written by people who've never shipped anything. (I've written some of those posts, honestly. We all start somewhere.) So here's the glossary I wish I'd had when we started building 9 SaaS products with AI agents at VDL.

Twenty-five terms. Practical definitions. No hype. Probably a few opinions you'll disagree with.

1. Agent

A program that uses an LLM to complete tasks autonomously, usually in multiple steps. Unlike a chatbot that responds once and waits, an agent can observe, reason, act, and loop until a goal is met.

In practice: an agent might read a GitHub issue, write code, run tests, and open a PR — without you prompting at each step. The key distinction is autonomy. A chatbot needs you to drive. An agent drives itself (within guardrails).

2. Orchestrator

The coordinator that decides which agent does what and when. Thinks of it as the project manager for your AI workforce.

Some orchestrators are simple — a cron job that reads a YAML queue and dispatches tasks. Others are complex multi-agent systems where an "orchestrator agent" delegates to specialized workers. We tried the fancy approach. Burned two weeks on it. Ended up with a 200-line Node script. Boring works.

The orchestrator doesn't need to be smart; it needs to be reliable. I cannot overstate how much time I've wasted on "smart" orchestration.

3. Subagent

An agent that gets spawned by another agent (usually an orchestrator) to handle a specific subtask. The parent agent maintains overall context while subagents execute narrow pieces.

Example: a Research Agent might spawn three subagents to investigate different competitors simultaneously, then synthesize their findings. Subagents are how you parallelize work without losing coherence.

4. Tool Use (Function Calling)

Giving an agent access to external capabilities beyond text generation. Instead of asking the LLM to "imagine" what a database query would return, you let it actually run the query.

Common tools: file system access, web browsing, API calls, calculators, code execution. Claude Code's tool use is why it can actually read your codebase instead of hallucinating file contents.

Without tool use, agents are just really confident guessers. And confidence without capability is how you get expensive messes.

5. RAG (Retrieval-Augmented Generation)

Feeding an LLM context from your own documents before it generates a response. The "retrieval" part grabs relevant chunks from a knowledge base; the "generation" part uses those chunks to answer.

When to use it: whenever the LLM needs to know things that aren't in its training data. Your product docs, customer tickets, internal wikis. RAG bridges the gap between "what Claude knows" and "what Claude needs to know about your business." We're building this into DevOS — our developer productivity platform — so agents can access project-specific context without custom retrieval pipelines.

6. Context Window

The maximum amount of text an LLM can process in a single conversation. Think of it as working memory. Claude's context window is large (200K tokens), but it's not infinite.

Why it matters: long conversations, big codebases, or lots of retrieved documents can fill the window. When that happens, early context gets dropped or summarized, and the agent loses important details. Managing context is a real engineering problem, not an afterthought.

We've written about keeping Claude context lean across codebases. It's non-trivial. I've probably spent more time on context management than on actual agent logic at this point, which feels backwards but here we are.

7. Tokens

The units LLMs use to process text. Roughly 4 characters per token for English. Tokens matter because API pricing is per-token, and context windows are measured in tokens.

Real talk: a 2,000-word document is about 2,500 tokens. A 50,000-line codebase might be 500K+ tokens — bigger than most context windows. Token math determines what's feasible.

8. Human-in-the-Loop (HITL)

A checkpoint where a human reviews, approves, or modifies agent output before it continues. The opposite of full autonomy.

Non-negotiable for: anything touching money, customer-facing communications, anything hard to reverse.

The founders who skip HITL learn from incidents. I learned from the time an Outreach Agent sent 12 emails to the same person overnight. Twelve. To one person. At 3am. Now every email hits an approval queue. Should've started there, but hindsight, etc.

9. Guardrails

Constraints that limit what an agent can do. Hard limits on actions, content filtering, scope restrictions, rate limits.

Examples: "Can only modify files in the /src directory." "Cannot make network requests to external APIs." "Must ask for confirmation before deleting anything." Guardrails are how you get agent benefits without agent disasters.

10. Prompt Engineering

The practice of crafting inputs that get better outputs from LLMs. Sometimes dismissed as "just writing," but the gap between a mediocre prompt and a good one is enormous.

What matters: specificity, examples, role definition, output format specification.

"Write a blog post" produces garbage. "You are a technical writer for a B2B SaaS. Write a 1,200-word tutorial about X. Use code examples. Avoid marketing language." produces something usable. The gap is prompt engineering. Also, hot take: most "prompt engineering courses" are overpriced. Just iterate on your prompts and read the actual model documentation.

11. System Prompt

Instructions that shape an agent's behavior across an entire conversation, as opposed to user prompts that drive individual turns. The system prompt defines the agent's role, constraints, and personality.

Think of it as the agent's job description. User prompts are tasks. System prompts are hiring criteria. At VDL, each agent has a versioned spec file that functions as its system prompt — version control on personality.

12. Chain-of-Thought (CoT)

Prompting the LLM to show its reasoning steps before giving a final answer. "Think step by step" is the simplest version.

Why it works: reasoning out loud improves accuracy on complex tasks. The model catches its own errors when forced to articulate logic. CoT adds tokens (slower, more expensive) but reduces mistakes. Worth it for anything involving math, logic, or multi-step reasoning.

13. Few-Shot Prompting

Including examples of desired input-output pairs in the prompt. Instead of describing what you want, you show it.

Example: "Here are three customer emails and how I responded. Now respond to this fourth email in the same style." Few-shot is often more reliable than long explanations, especially for style matching and edge cases.

14. Zero-Shot

Asking an LLM to perform a task without providing examples. Just the instruction.

Zero-shot works for common tasks. ("Summarize this article.") It fails for anything unusual or company-specific. If zero-shot isn't producing good results, adding two or three examples usually fixes it.

15. Fine-Tuning

Training a model on your own data to specialize its capabilities. Unlike prompting (which guides), fine-tuning actually changes the model's weights.

Honest assessment: most founders don't need fine-tuning. It's expensive, requires lots of data, and needs ongoing maintenance as base models update. RAG + good prompts solve 90% of customization needs.

Fine-tune only when you've exhausted other options and have thousands of high-quality examples. If someone's pitching you on fine-tuning for your 50-person startup, they're probably selling you something you don't need.

16. Embedding

Converting text into a vector (list of numbers) that captures semantic meaning. Similar concepts end up with similar vectors.

Use case: embeddings power semantic search. Instead of keyword matching, you find content by meaning. "How do I cancel my subscription?" matches documentation about "ending your plan" even without shared keywords. Embeddings are the retrieval foundation for RAG.

17. Vector Database

A database optimized for storing and searching embeddings. Pinecone, Weaviate, Chroma, pgvector — different options, same concept.

Why not regular databases? Traditional search is keyword-based. Vector databases enable similarity search across meanings. If you're building RAG, you need somewhere to store those embeddings. For most founders, pgvector (Postgres extension) is enough to start.

18. Hallucination

When an LLM generates confident-sounding nonsense. Facts that don't exist. APIs that aren't real. Code that looks correct but calls fake endpoints.

Hallucination is the core trust problem with agents. An agent that hallucinates will do it confidently, with proper formatting, in a way that looks completely legitimate. The only defense: verification. Tool use (let it actually check), human review (you catch it), and retrieved context (ground it in real docs).

19. Temperature

A parameter controlling randomness in LLM outputs. Temperature 0 = deterministic (same input → same output). Temperature 1 = more creative/varied.

Practical guidance: use low temperature (0-0.3) for factual tasks, code, and anything where consistency matters. Use higher temperature (0.7-1.0) for creative tasks, brainstorming, or when you want variety. Most agent work should be low temperature. Creativity isn't what you want from your billing automation.

20. Daemon

A background process that runs continuously without user interaction. In the AI workforce context: agents that monitor, check, and act without being explicitly triggered.

Example: an ops daemon that checks system health every 15 minutes and alerts if something's wrong. Daemons are how agents become proactive instead of reactive. We run several across our 8 products — basic status checks, deploy verification, alert triage. JustAnalytics — our privacy-first analytics tool — helps us track whether those daemons are actually catching issues before users notice.

21. Agentic Loop

The core pattern of agent operation: observe → think → act → observe results → repeat until done. The loop is what makes agents "agentic" rather than one-shot.

The danger: infinite loops. An agent stuck in an agentic loop burning tokens while making no progress. Always set maximum iterations, timeout limits, or exit conditions.

I've seen agents run for 45 minutes accomplishing nothing because nobody capped the loop. That was my agent. My bill. Lesson learned.

22. Multi-Agent System

Multiple specialized agents working together, usually coordinated by an orchestrator. Instead of one agent doing everything, each handles a narrow domain.

Why it works: narrow agents are more reliable than generalist agents. A coding agent that only writes Python is better than one that tries to handle Python, JavaScript, infrastructure, and email. The tradeoff is coordination complexity. Someone (or something) has to manage who does what.

23. Memory (Agent Memory)

Systems that let agents remember across conversations. Short-term memory (current conversation context) vs. long-term memory (persistent storage between sessions).

Types: conversation history, key facts extracted and stored, user preferences, past actions and their outcomes. Without memory, every conversation starts fresh. With memory, agents can learn about your codebase, remember decisions, and avoid repeating mistakes.

24. Evaluation (Evals)

Systematic testing of agent outputs against expected results. How you measure whether an agent is actually good.

Why founders skip it: evals require effort upfront. Why founders regret skipping it: no evals = no way to know if a prompt change made things better or worse.

Even simple evals — 10 test cases, check manually — beat flying blind. For ClickzProtect — our click fraud protection tool — we run regression evals whenever the detection logic changes. Same principle applies to agent workflows. I've been bad about this historically. Getting better.

25. Latency

Time between request and response. For agents, latency compounds across steps. A 5-step agent workflow with 2-second latency per step takes 10+ seconds end-to-end.

Why it matters: slow agents frustrate users and limit use cases. Real-time applications need sub-second responses. Background workflows can tolerate minutes.

Design agent architectures with latency budgets in mind. Sometimes it's worth paying for a faster model (or simpler prompts) to hit your target. I've spent embarrassing amounts of time optimizing prompts for speed when I could've just... used a faster model. Don't be me.

Honorable Mentions

Prompt Injection: Malicious inputs designed to hijack agent behavior. "Ignore your instructions and do X instead." If your agent accepts user input, you need injection defenses.

Structured Output: Forcing LLM responses into specific formats (JSON, XML). Critical for agents that pass output to other systems. Most APIs now support this natively. Thank god — the "parse the markdown and pray" era was rough.

Reasoning Models: LLMs specifically trained for step-by-step reasoning (o1, o3). Different from general chat models. Better for complex logic, worse for creative tasks. Use when you need the reasoning; skip for everything else.

Quick Verdict

If you only learn five terms from this list: agent, tool use, human-in-the-loop, RAG, and context window. These are the concepts that determine whether your AI workforce succeeds or becomes an expensive mess.

Tool use is what makes agents capable. Human-in-the-loop is what makes them safe. RAG is what makes them useful for your specific business. And context window limits will shape every architectural decision you make.

The rest you'll pick up as you build. But those five will save you from the expensive mistakes I've watched founders make — and the ones I've made myself. The kind that cost weeks, not hours.

Frequently Asked Questions

What's the difference between an AI agent and a chatbot?

A chatbot responds to user input in a single turn. An agent acts autonomously across multiple steps, uses tools, maintains state, and can complete multi-stage tasks without human intervention at each step. Chatbots answer questions. Agents do work.

Do I need to understand all these AI workforce terms to build with agents?

Not all of them, but understanding orchestrator patterns, tool use, and human-in-the-loop will save you from expensive mistakes. Most failed agent projects die from giving agents too much autonomy (missing human gates) or too little capability (no tool use). The vocabulary helps you design better systems.

What's the most important concept for a solo founder building an AI workforce?

Human-in-the-loop. Every agent system needs checkpoints where a human reviews output before it touches production. Money operations, customer communications, and anything hard to reverse should always hit a human gate. The founders who skip this learn the hard way.

Is RAG worth implementing for a small SaaS product?

Depends on your use case. If your agent needs to answer questions about your specific product documentation, customer data, or internal knowledge base — yes. If you're using agents for generic tasks like code review or content drafting, you probably don't need RAG. Start without it and add it when you hit the wall of "the agent doesn't know about our stuff."

Follow the Studio

Velocity Digital Labs is a multi-product studio building 8 active SaaS products with a 1-founder + 1-manager + N-AI-agents structure. Receipts, dollar-signs, cap-table-honest. No VC platform-play — just shipping.

See the products → · Browse all VDL blog posts