← Back to Blog
Case Studies
12 min read

Running AI Agents as Employees Inside a SaaS Studio: The Workflow That Actually Stuck

The agent roster, the human gates, the workflows we killed. Real operating model from an 8-product studio.

VDL Platform Team
June 12, 2026
Running AI Agents as Employees Inside a SaaS Studio: The Workflow That Actually Stuck

The agent that broke my trust was the Outreach Agent. Third week of running it. I'd given it access to my email templates and a list of podcast hosts. It sent 47 pitches overnight. Twelve of them went to the same person — different subject lines, same ask, same signature. That person forwarded the thread to Twitter. Brutal.

That was attempt number two.

Attempt one died when a Code Agent pushed a Stripe webhook handler that silently dropped refund events. I didn't catch it for three days. Three days of failed refunds piling up in a Stripe dashboard I wasn't checking because, you know, I trusted the agent. Attempt three collapsed when I tried to orchestrate everything through a single "manager agent" that couldn't keep context straight across products.

Attempt four? Still running. Seven months now. Across 8 active products. First version I'd actually recommend to someone who isn't me.

The Problem: One Founder, Eight Products, Zero Clone

Velocity Digital Labs runs 8 active SaaS products with a 1-founder + 1-manager structure. That's ClickzProtect, VeloCards, JustAnalytics, VeloCalls, JustBrowser, JustEmails, DevOS, and the parent VDL site.

The math doesn't work without automation. Not "nice to have" automation. "Literally impossible otherwise" automation.

I needed something that could:

  • Write code across multiple codebases without context-switching overhead
  • Handle content production I kept putting off (and off, and off)
  • Do the boring ops work — status checks, alert triage, deploy verification
  • Research competitors I couldn't track manually
  • Draft outreach without me typing every email

A cofounder would've helped. But cofounders come with equity, disagreements, and a roughly 50% failure rate on the relationship itself. I'd watched three founder friends go through cofounder breakups. Two lost their companies. The third spent eight months in legal limbo. Hard pass.

So I built the AI version. Then rebuilt it three more times until something stuck. We've written about the broader lessons from building 9 SaaS products — this post focuses specifically on the AI agent operating model.

What We Tried First (And Why It Failed)

Attempt 1: Claude as a general assistant. I threw everything at one model. Research, code, content, ops. Same conversation thread for a week straight. By day four, the context window was stuffed with irrelevant history, outputs got inconsistent, and I spent more time prompting than just doing the work myself. Classic mistake.

Attempt 2: Specialized agents, no gates. Split into seven agents (Research, Content, Outreach, Code, QA, Ops, Orchestrator). Gave them autonomy. The Outreach Agent incident happened. The Code Agent pushed directly to staging without review. The Content Agent published a post with competitor pricing from 2024. Fast, but catastrophically unreliable. I should've seen this coming. I didn't.

Attempt 3: One orchestrator to rule them all. Built a "Manager Agent" that coordinated the others. It decided what ran when, reviewed outputs, and approved actions. Sounded elegant. In practice? The Manager Agent couldn't maintain context across products. It would approve a ClickzProtect PR while thinking it was reviewing JustAnalytics code. The abstraction layer added confusion without adding reliability. Elegant on paper, disaster in prod.

Each attempt took two to three weeks to build and a few days to fail. Expensive.

The Architecture That Finally Worked

The current system — seven months running — has three principles I didn't understand until the failures taught me.

Principle 1: Agents own tasks, not decisions. Every agent has a narrow scope. The Code Agent writes code. The Content Agent drafts posts. The Ops Agent monitors systems. None of them decide what gets shipped. That's my job. I'm the bottleneck on purpose.

Principle 2: Human checkpoints are non-negotiable. Every agent output goes somewhere I can review before it touches production. Code Agent outputs go to GitHub PRs. Content Agent drafts go to Notion. Outreach drafts go to a queue I approve manually. Ops Agent is the only one with limited autonomy — it can restart crashed services automatically, but anything else waits for me. Yes, this slows things down. That's the point.

Principle 3: Specs beat prompts. Each agent has a markdown spec file that defines role, constraints, output format, and examples. The spec is versioned in git. When an agent starts producing weird output, I diff against the last known-good version. Ad-hoc prompting works for one-off tasks. Repeatable workflows need specs. Learned this the hard way when I couldn't figure out why the Content Agent's tone shifted — turned out I'd edited the prompt in a chat window and never saved it back.

Here's the roster as of June 2026:

AgentOwnsHuman GateAutonomy Level
Code AgentFeature implementation, bug fixes, refactorsGitHub PR reviewNone — waits for merge
Content AgentBlog posts, docs, changelogsNotion draft reviewNone — waits for approval
Outreach AgentPartnership emails, pitchesManual send queueNone — every email approved
Research AgentCompetitor analysis, market sizingReport reviewNone — output only
QA AgentTest generation, regression catchingPR integrationAutomated tests run, failures block merge
Ops AgentStatus reports, alert triage, deploysTelegram notificationsCan restart crashed services only
OrchestratorTask dispatch, schedulingTask queue reviewDecides order, not approval

The Orchestrator is just a Node.js script that reads a YAML task queue every 15 minutes and dispatches to the right agent. Under 200 lines. No framework. I tried LangChain for about a week. Debugging it at 2am while trying to figure out which abstraction layer was swallowing my errors — never again. Plain Node. Boring works.

Implementation: The Daily Operating Model

Here's exactly how a day runs.

6:30am — Ops Agent report. Automated. Summarizes all 8 products: uptime, error rates, anything weird overnight. Lands in Telegram. Takes 30 seconds to read.

7:00am — Review queue. I check what's waiting for approval:

  • Content drafts from overnight (usually 1-2 posts)
  • Outreach drafts (usually 5-10 emails)
  • Code PRs ready for review

Approving content and outreach takes 10-15 minutes. Most drafts are 80-90% good. Some are genuinely better than what I'd write. Some are garbage that makes me question the whole system. I edit the rough spots, approve, move on.

7:30am — Code review. This takes longer. 30-60 minutes depending on what shipped. The Code Agent produces PR-ready branches with commit messages and test coverage. My job is verifying it actually does what the spec asked, doesn't introduce security issues, and matches existing patterns.

The failure mode I watch for: code that looks plausible but calls APIs incorrectly. Claude Code once generated a Stripe handler using an endpoint that doesn't exist. Looked perfect. Crashed at runtime. Now I grep every AI-generated PR for new external dependencies and verify them manually.

8:30am — Task queue. I add new tasks for the day. Each task is a YAML entry:

- type: content
  priority: high
  payload:
    topic: "JustEmails DMARC migration guide"
    template: tutorial
    deadline: 2026-06-13

- type: code
  priority: medium
  payload:
    repo: clickzprotect
    spec: "Add rate limiting to /api/verify endpoint"
    branch: feature/rate-limit-verify

The Orchestrator picks these up and dispatches to the right agent. By the time I'm in meetings, the agents are working.

Throughout the day — Telegram pings when something needs attention. Ops Agent alerts (rare, usually false positives that make me paranoid anyway). Content drafts ready for review. PR status updates.

Evening — Quick check on what finished. Approve any stragglers. Add tomorrow's tasks. Most days this takes five minutes. Some days I stare at a PR for an hour because the Code Agent did something clever that I don't fully understand, and I have to decide if clever-I-don't-understand is genius or a time bomb.

For tracking whether any of this actually moves metrics, we use JustAnalytics — cookieless analytics that doesn't require GDPR popups. I can see if content is getting traction without GA4's consent popup overhead. For click fraud protection across our paid campaigns, ClickzProtect handles automated blocking so the Ops Agent doesn't need to monitor ad spend anomalies.

The Workflows We Killed

Not everything made the cut. Here's what we retired.

Killed: Automated social posting. The Content Agent could draft tweets and LinkedIn posts. But social requires real-time context — responding to trends, engaging with replies, reading the room. The drafts were fine. The timing was always wrong. Human-operated social still beats automated posting for anything that matters.

Killed: Customer support agent. Tried building an agent that handled first-line support tickets. It worked for simple questions ("How do I reset my password?"). Failed catastrophically for anything requiring empathy. A customer upset about a billing issue doesn't want to talk to something that can't actually feel sorry. One response started with "I understand your frustration" in that hollow way that makes people angrier. Killed it after a week.

Killed: Automated pricing experiments. The Research Agent could analyze competitor pricing and suggest changes. But pricing affects revenue immediately and is hard to reverse. The failure cost was too high. Pricing decisions stay human-only.

Killed: Multi-agent debate. I read a paper about having agents debate each other to improve output quality. Built a version where the Code Agent and QA Agent would argue about implementation choices. In theory, the back-and-forth would surface better solutions. In practice? They'd agree on something mediocre within two rounds and call it done. Wasted tokens, no quality improvement. Academic papers and production reality are different planets.

The pattern: AI agents work for tasks with clear right answers and low reversal cost. They fail for tasks requiring judgment, empathy, or real-time context. The sharper that line gets in my head, the fewer messes I clean up.

Results: What Changed After Seven Months

I won't give you fabricated productivity multipliers. What I can say:

Content velocity tripled. Before the Content Agent, I published maybe one post per product per month. Now it's three to four. The drafts need editing, but 80% of the work is done when they hit my queue.

Code review replaced code writing. My morning code work shifted from "write features" to "review features." The Code Agent handles implementation. I handle architecture decisions and catch mistakes. Different skill, same time investment, more output.

Ops became invisible. I used to spend 30-60 minutes daily checking dashboards, verifying deploys, triaging alerts. The Ops Agent compressed that to a 30-second Telegram report. Most days, nothing needs attention.

Outreach happened at all. Before: I'd queue partnership emails, procrastinate, never send them. Now: Outreach Agent drafts, I approve in batches, emails actually go out. Quality isn't perfect. But sent beats perfect-and-unsent every time.

What didn't change: the hard decisions. Pricing, product direction, customer conversations, strategic bets. Those still take the same time and same energy. AI handles execution. Judgment is still mine. Honestly, I thought AI would eventually handle judgment too. Seven months in, I'm less convinced. Maybe that's cope. Maybe it's experience.

What I'd Do Differently

Looking back at seven months of iteration:

Start with fewer agents. I built seven from the start. Dumb. Should've started with two (Code and Content) and added others only when I hit clear bottlenecks. Each agent is maintenance overhead. Start minimal.

Write specs before building anything. The first month was ad-hoc prompting. Inconsistent outputs. No way to debug regressions. If I'd written specs from day one, I'd have saved three weeks of thrashing. I knew this. I ignored it anyway because "move fast." Moving fast into walls is still hitting walls.

Shadow mode everything. Before trusting any agent, run it in shadow mode — it produces output, but you compare against what you would've done manually. Catches failure modes before they cost you. We're building DevOS partly to make this kind of agent evaluation workflow easier for other teams.

Budget for AI spend explicitly. Token costs add up. I hit my first surprise bill in month two — not catastrophic, but enough to make me pay attention. Now AI spend is a line item I track weekly, same as hosting. We use VeloCards virtual cards to separate AI API costs from other operational spend — makes reconciliation trivial.

Frequently Asked Questions

How do you structure AI agents as employees in a SaaS business?

Each agent gets a narrow job description, explicit constraints, required output format, and a defined human checkpoint. The structure mirrors how you'd onboard a junior employee — clear scope, examples of good work, explicit boundaries on what they can't do without approval. The difference: agents never learn from feedback unless you update their spec. Every improvement requires editing the system prompt or examples.

What's the biggest mistake founders make when building an AI workforce?

Giving agents too much scope too fast. The temptation is to build one "super agent" that handles everything. It fails. Context windows fill up, outputs get inconsistent, and you can't debug which part went wrong. Narrow agents with single responsibilities — one for code, one for content, one for ops — produce better results and fail in ways you can actually fix.

Which tasks should humans always gate when using AI agents?

Three categories: anything touching money (payments, refunds, pricing changes), anything customer-facing that requires reading emotional context (support escalations, churn conversations), and anything where being wrong costs more than a weekend to fix. AI drafts and implements. Humans approve and ship.

How long does it take to trust an AI agent workflow in production?

About two weeks to get the basics running, two to three months to actually trust it. The first month is catching the edge cases — the confident wrong answers, the misunderstood requirements, the moments you realize a human checkpoint was missing. Don't expect instant productivity. Expect a learning curve that eventually compounds. And honestly? I still don't fully trust it. I just trust it enough. For email automation specifically, we route agent-generated campaigns through JustEmails to maintain deliverability while the Outreach Agent handles drafts.


Follow the Studio

Velocity Digital Labs is a multi-product studio building 8 active SaaS products with a 1-founder + 1-manager + N-AI-agents structure. Receipts, dollar-signs, cap-table-honest. No VC platform-play — just shipping.

See the products → · Browse all VDL blog posts

#ai-agents-workflow#ai-workforce#claude-code#multi-product-saas#solo-founder#buildinpublic#saasstudio#aiworkforce#buildwithclaude