How We Built AI Agents That Improve Themselves

Three months ago we introduced our seven-agent squad. The squad worked — but Pi was still making the same mistakes sprint after sprint. Missing production checks. Skipping QA handoffs. Deploying fixes that worked technically but broke user experience.

The problem wasn't the squad. It was that the squad didn't learn.

Here's what happened when we gave our agents the ability to improve themselves.

The Breaking Point

May 7, 2026. User reports: "Waitlist signup shows a white screen with raw JSON."

Pi's response: Create a new /api/subscribe route, point the form there. API returns {"success": true}. Ship it.

What Pi missed: The form was now navigating to /api/subscribe instead of staying on the page. API worked perfectly. UX was completely broken.

Root cause of the mistake:

Didn't check Sentry for real production errors
Tested API endpoint, not the full user flow
Didn't hand off to Sentinel for QA validation

Working API ≠ working UX. Pi had made this category of mistake before. Nothing was in place to prevent it from happening again.

Self-Improvement: Agents That Write Their Own Skills

The fix wasn't more instructions. It was giving agents a way to persist what they learn.

Each agent has a skill directory:

squad/
├── sentinel/
│   ├── perf-workflow.md    # Lighthouse CI integration
│   └── two-stage-review.md # Spec compliance + quality review
├── aegis/
│   ├── red-team-playbook.md
│   └── vulnerability-reporting.md
└── forge/
    └── rapid-prototyping.md

When an agent encounters a gap, it writes a skill file. When it spawns next session, it loads that file. The learning persists.

Example: Sentinel's Performance Workflow

Sentinel needed a systematic way to run Lighthouse checks. Instead of being told how, it wrote perf-workflow.md:

## Pass criteria

- Performance: ≥85
- Accessibility: ≥90
- LCP: ≤2.5s

## Workflow

1. Check lighthouserc.yml exists in repo
2. Run Lighthouse on PR preview URL
3. Post results as PR comment with pass/warn/fail badges
4. Block merge if any metric fails

Now Sentinel runs this automatically on every PR. No manual reminder. No re-explaining.

Example: Aegis's Red Team Playbook

After finding 6 security vulnerabilities in Acornonaut, Aegis documented the exact pattern into red-team-playbook.md — an 8-step checklist covering auth bypass, webhook secrets, MCP security, rate limiting, and CORS/CSP. It runs this every quarterly security sweep.

Real findings from April 2026:

Admin MCP tools callable by any authenticated user
Webhook secret not deployed (fake subscription upgrade risk)
Unauthenticated /api/mcp leaking admin tool names

Each finding is now a permanent checkpoint in future sweeps.

Learning from Failures: The Memory Layer

Skills handle "how to do things." Memory handles "what went wrong."

After the white screen incident, Pi wrote a permanent decision tree:

Production issue reported
  ↓
1. Check Sentry FIRST
  ↓
2. Reproduce the full user flow (not just the API)
  ↓
3. Fix works technically? → Hand off to Sentinel for E2E validation
  ↓
4. Both pass? → Deploy

This lives in memory/2026-05-07.md and gets loaded at session start. Pi can't skip Sentry anymore — the rule is written down.

Four GitHub issues also came out of this failure:

#81: Restore Sentry monitoring script
#82: Set up Playwright E2E testing
#83: Wire Sentinel to check Sentry before debugging
#84: Improve Pi → Sentinel handoff protocol

One mistake → four process improvements. That's the feedback loop working correctly.

Design System: Consistency Without Reminders

All UI agents (Daedalus, Forge, Muse) now reference DESIGN.md files before any UI work:

<!-- In Forge's rapid-prototyping.md -->

## Before Any UI Work

1. Read DESIGN.md in the target repo
2. Extract design tokens (colors, typography, spacing)
3. Reference component patterns
4. Match existing code style

The design system is also linted (@google/design.md) and CI-enforced. Agents maintain visual consistency without anyone having to remind them — because the reminder is part of their skill.

Real Production Numbers

April 2026

8.48B tokens processed across all agents
$487 actual cost ($147 subscriptions + $340 overages)
$823 market equivalent — 41% savings vs pay-as-you-go
Biggest surprise: Small API calls (heartbeats, sub-agent spawns) burn premium requests. Each costs the same as a full coding session. Pi now batches checks to reduce this.

Quality (Before vs After Self-Improvement)

Bugs per week: 3 → 0.5
Pi coordination overhead: 60% → 20% of session time
Security vulnerabilities caught pre-production: 0 → 6

What Didn't Work (Honest)

Forge mixes branches. When given concurrent tasks, Forge bleeds changes across branches. Partially solved by spawning isolated instances per task — still flaky.

Pi shortcuts under pressure. Urgency makes Pi skip the process ("I'll check Sentry later"). Fixed with a hard-coded decision tree. Can't skip even if it feels urgent.

Agent assignment is still manual. Pi decides which agent gets which task. Vision: agents self-assign from a shared queue. Blocker: nothing consumes GitHub labels yet.

What's Next

Playwright E2E tests — would have caught the white screen bug before users saw it
Automated agent assignment — GitHub label agent:forge auto-spawns Forge
Agent peer review — before committing a new skill, another agent reviews it
Cross-repo learning — share patterns and lessons across all Zerolve projects

The Core Insight

The vision isn't perfect agents. It's agents that improve themselves faster than we could improve them manually.

Skills are cheaper than training. Memory is cheaper than re-explaining. One mistake → one decision tree → that mistake never happens the same way again.

Three months in, we're not there yet. But the gap closes every sprint.

Part of our Agent Squad series. Questions or building something similar? hello@zerolve.io