240 Tests in 7 Days: Building Testing Guardrails from Day One

Last Monday, our SaaS product had exactly zero meaningful tests. By Sunday night, we had 320 passing tests across 32 files — unit tests, integration tests, E2E flows, visual regression, and tier enforcement. We also broke our AI agents, fixed them, and queued another 8 tasks to run while we sleep.

This is not a story about AI making testing easy. It's about what actually happens when you try — and why the effort was inevitable once we started redesigning.

The Trigger: A UI/UX Redesign

It started with design, not testing.

We'd been using UI/UX design tools to overhaul the product's interface — rethinking navigation flows, component hierarchy, accessibility patterns, the works. A proper redesign, not a coat of paint.

But redesigning the UI exposed a terrifying truth: we had no safety net. Every component swap, every route restructure, every interaction change was a leap of faith. Would the payment flow still work after moving the CTA? Would tier enforcement hold after restructuring the dashboard? Would the API still return the right data after refactoring the layouts that consume it?

We were building a new house on top of a foundation we'd never inspected.

That's when the real lesson hit: testing guardrails aren't something you bolt on after shipping. They're fundamental to the software development lifecycle from the start. And designing the test suites themselves — deciding what to test, how to structure the tests, what coverage means for your specific architecture — needs to happen at the same time as designing the product.

The redesign forced the reckoning. Better late than never.

The Starting Point: A Ship Without a Hull

Our product runs on Next.js + Cloudflare Workers, with Clerk auth, D1 database, queues, containers, and an MCP server. Real infrastructure. Real users.

And until last week: zero automated tests.

No unit tests. No integration tests. A handful of E2E smoke tests that mostly checked if pages loaded. The kind of test coverage that makes you sweat every deploy — and absolutely terrifying when you're about to ship a full UI redesign.

We knew the debt. We just kept shipping features instead. Sound familiar?

Day 1-2: Designing the Test Suite (Mon-Tue)

Before writing a single test, we designed the test architecture. This is the step most teams skip — and it's the most important one.

We mapped every testable surface:

57 API routes (15 tested, 42 not)
19 pages (17 had basic E2E, 2 completely untested)
Zero unit tests for business logic — tier enforcement, auth, schema validation, YouTube parsing, the entire card pipeline

The audit produced a brutal scorecard:

Area	Grade
Payment / billing	F
API v1 programmatic	F
Card operations	D
Playlist lifecycle	D
Auth flow	B
Public pages	A
CI quality gates	A

Payment — the thing that makes money — had zero test coverage. Not great.

We designed the suite in 6 waves, ordered by dependency and risk:

Core logic — tier enforcement, auth helpers, schema validation, database utilities
API routes + components — the integration layer
E2E edge cases — error handling, race conditions
MCP server + playlists — external-facing API surface
YouTube parsing + card pipeline — the data transformation backbone
Component tests + coverage gaps — UI regression safety net

The wave design was deliberate: each layer builds on the one before it. Core logic tests validate the primitives. API tests validate the composition. E2E tests validate the user-facing behaviour. This structure meant we could catch regressions at every level — not just at the top.

Day 3: The Agent Squad Experiment

Here's where it gets interesting. We run a multi-agent squad — specialized AI agents coordinated by an orchestrator:

Forge 🔨 writes code
Sentinel 🛡️ reviews PRs
Pi ◎ orchestrates everything

The plan: dispatch 6 Forge agents in parallel, one per wave. Each gets a GitHub issue with acceptance criteria, file targets, and constraints. They work in isolated branches, open PRs, and Sentinel reviews them.

We spawned all 6 agents.

All 12 attempts failed.

Every single one. Six agents, each retried once. Twelve LiveSessionModelSwitchError crashes. The root cause? A config bug — Forge's model config was empty {}, so it inherited a default model ID that didn't exist anymore. One empty JSON object took down the entire squad.

Ironic: the system that was supposed to write our guardrails crashed because it didn't have its own.

Day 3-4: Writing It Ourselves

So we did what you do when automation fails: we wrote the tests by hand.

All six test suites. In one sitting. 111 tests across 6 files:

src/app/api/__tests__/mcp.test.ts             — 26 tests
src/app/api/__tests__/playlists.test.ts        — 12 tests
src/lib/__tests__/errors.test.ts               — 11 tests
src/lib/__tests__/youtube.test.ts              — 26 tests
src/app/api/__tests__/cards-pipeline.test.ts   — 19 tests
src/components/__tests__/components.test.tsx    — 17 tests

Along the way we hit every testing gotcha in the book:

Vitest 4 changed its API. Options are the 2nd argument now: it('name', { timeout: 5000 }, fn) — not the 3rd.
Mock wrappers break TypeScript. You can't spread unknown[] into typed function signatures. You need specific signatures or explicit casts.
Default vs named exports trip you up. One component uses a default export. Our test imported it as named. Silent failure.
DOM APIs return null in non-obvious cases. DOMParser.parseFromString("", "text/html").body is null when the string is empty. Our sanitizer didn't guard for it — a live production crash waiting to happen.

These are the kind of bugs that AI agents also would have hit — and probably gotten stuck on for 10 turns each. Writing the tests by hand meant we found and fixed them in real time.

Day 5-6: E2E and the Flake Wars

E2E tests are a different beast. Ours run against a real Cloudflare Workers deployment with a real D1 database, real Clerk auth, and a real container runtime.

Every E2E run is a war against flakiness:

fix(e2e): replace all button:has-text(/regex/) with getByRole
fix(e2e): use text= regex locator instead of broken has-text CSS selector
fix(e2e): handle slow processing redirect + fix progress bar locator
fix(e2e): seed 3 playlists before tier-free limit tests
fix(e2e): reset test user quota via D1 cleanup before E2E tests
fix(e2e): skip add-page tests when user quota is exhausted
fix(e2e): make user-journey tests more resilient

Seven commits before CI went green. Each one a lesson:

Lesson 1: Playwright locators are picky. button:has-text(/Start/) looks right but silently breaks. getByRole('button', { name: /Start/ }) works and is accessible by default.

Lesson 2: State bleeds between tests. Our free-tier test creates 3 playlists to test the limit. But if a previous run left data in the database, the test starts at the wrong count. Fix: cleanup queries before each E2E suite.

Lesson 3: Containers have cold starts. The processing container takes seconds to spin up. Tests expecting instant responses timeout. Fix: startAndWaitForPorts() with configurable timeout, not blind retries.

Day 7: Fixing the Agents (and Queuing Overnight Work)

With 320 tests green, we circled back to the broken agent config. The fix was one line — set Forge's model explicitly instead of inheriting from a stale default.

Then we created 7 new GitHub issues for the remaining coverage gaps and queued them for overnight autonomous execution:

23:00  🌙 Orchestrator dispatches 8 tasks
       ├── 🛡️ Sentinel: review 2 open PRs
       ├── 🔨 Forge: 3 new E2E suites (parallel, new files only)
       └── (re-dispatch after Batch 1)
           ├── 🔨 Forge: payment flow tests
           ├── 🔨 Forge: card operations tests
           └── 🔨 Forge: missing page coverage
05:30  🌅 Debrief arrives on Telegram

The overnight system has real constraints: budget caps, file-level locking to prevent merge conflicts between parallel agents, and automatic re-dispatch when the first batch completes.

The Numbers

Metric	Before	After
Unit tests	0	23 files, 320 tests
E2E suites	3 basic	9 comprehensive
Test lines	~200	7,738
API coverage	15/57 routes	~40/57 routes
CI time	4 min	6 min (worth it)
Commits this week	—	62
Agent failures	—	12

What We Actually Learned

1. Design your test suite like you design your product. Test architecture isn't an afterthought. Which layers to test, what assertions matter, how tests compose — these decisions deserve the same attention as your component hierarchy. Our 6-wave structure meant every change had coverage at the right level.

2. Testing guardrails are fundamental, not optional. The UI/UX redesign made this visceral. Without tests, every design change was a gamble. With tests, we could refactor components confidently, knowing the payment flow, tier enforcement, and API contracts were verified on every commit. Start testing from day one — not after the redesign forces your hand.

3. Integrate design tools and testing early. The redesign generated new UI components, new interaction patterns, new accessibility requirements. Each of these is a testable surface. When design and testing evolve together, you catch integration issues before they reach users. When they're disconnected, you ship bugs and pray.

4. AI agents are force multipliers, not magic. When they work, a single agent can write a solid test suite in 20 minutes. When they don't, you need the same skills you always needed. The agents didn't fail because AI is bad at testing — they failed because of an empty {} in a JSON config. Infrastructure kills you, not intelligence.

5. Writing tests yourself teaches you things agents can't. We found 4 real bugs in our production code while writing test mocks. DOMParser returning null on empty strings was a live crash. No agent caught it — we found it because we had to think about edge cases while writing the test.

6. Start with E2E, not unit tests. Counterintuitive, but E2E tests against your real deployment catch integration bugs that unit tests never will. Database state bleeding, container cold starts, auth edge cases — all E2E discoveries.

The Takeaway

If you're building a product and thinking "we'll add tests later" — you won't. Or you will, but it'll be triggered by a crisis: a redesign that breaks everything, a deploy that takes down payments, a bug that costs you users.

Design your test suites from the start. Build the guardrails into your development lifecycle, not around it. And when AI agents can help — great. But know the fundamentals first, because when the agents crash, you're the fallback.

We write about AI agents, software quality, and the messy reality of building with AI. Follow us on GitHub.