All articles
Product6 min read·June 9, 2026

Multi-Agent Recovery: From Incident to Fix in Minutes

When an incident fires at 2am, the bottleneck isn't information — it's the time it takes a human to gather that information, interpret it, and decide what to do. SiteBrief now does all three automatically with a four-agent recovery pipeline that streams live progress as it works.

Here's what incident response looks like without automation: an alert fires at 2:17am. The on-call developer wakes up, opens four tools, manually correlates uptime data with deploy logs, figures out which commit probably caused it, writes a revert, opens a PR, waits for CI, merges, and goes back to sleep — if they're lucky — at 3:45am.

The problem isn't that the developer doesn't know what to do. The problem is that every step requires a human to manually gather context, make a decision, and execute an action. That's exactly the work that a well-designed AI pipeline should handle.

Today we're shipping Multi-Agent Recovery: a four-agent pipeline built into SiteBrief that runs the full incident-to-fix workflow — with live streaming progress so you can watch it work or step in at any point.

Multi-Agent Recovery is available now on all SiteBrief Pro plans. Trigger it from the alert notification, from the Timeline tab, or from the Recovery button on any site's dashboard.

The four agents

The pipeline runs four specialized agents in sequence. Each agent completes its task and hands a structured output to the next — no agent starts until the previous one finishes and passes a confidence check.

1
Monitor
✓ Done

Re-scans the site right now across all dimensions — uptime from 3 locations, PageSpeed (mobile + desktop), security headers, recent deploys, error rate from logs. Builds a complete snapshot of the current state vs. the last known-good state.

✓ Uptime: DOWN (2 of 3 nodes) · PageSpeed: 34 (was 79) · Last deploy: 14:02 (commit a3f9) · New critical issues: 2
2
Debugger
✓ Done

Claude AI receives the full Monitor snapshot and the diff between current and last-good state. It identifies the most probable root cause using pattern matching across deploy timing, issue type, and PageSpeed signal. It outputs a structured root cause hypothesis with confidence level.

Root cause (87% confidence): Deploy a3f9c12 introduced a broken CSS import that fails silently in production but not staging — causing layout errors and a render-blocking LCP resource.
3
Fixer
● Running

Based on the Debugger's root cause, Fixer either generates a code fix and opens a PR, or recommends a 1-Click Rollback to a specific safe commit. For rollbacks, it picks the most recent commit with a PageSpeed score above the regression threshold.

4
Reviewer
○ Waiting

After the fix merges or rollback deploys, Reviewer re-runs the full scan and confirms that all key metrics returned to acceptable levels. It closes the incident in the timeline and sends a resolution summary to Slack.

Live streaming progress

You don't stare at a spinner and wait. The Recovery panel streams live updates from each agent as they run — every sub-step is visible in real time. When Monitor is running, you see each check completing tick by tick. When Debugger is working, you see the reasoning steps as Claude forms its hypothesis.

This matters for two reasons. First, it builds trust — you can see exactly why the agents reached each conclusion, not just the final output. Second, it lets you intervene at any point. If Monitor's snapshot looks wrong (maybe the site recovered between the alert and the pipeline starting), you can stop the pipeline before Fixer opens an unnecessary PR.

Confidence thresholds and human gates

The pipeline has built-in safety gates. Fixer will only open a PR automatically if the Debugger's root cause confidence is ≥80%. Below that threshold, Fixer presents its recommendation for your approval before taking action. You can lower or raise this threshold per site in Recovery Settings.

1-Click Rollback

When Fixer determines that a rollback is the safest path, it surfaces the last 5 commits with their associated PageSpeed scores in a simple picker. You click the commit you want to revert to, and SiteBrief creates the revert PR on your GitHub or GitLab repository — with a descriptive title, the Debugger's root cause in the PR body, and a link back to the incident in SiteBrief.

  • The revert PR is created on a dedicated branch (sitebrief/rollback-a3f9c12) — not pushed directly to main
  • You review and merge the PR exactly as you would any other change
  • After merge, Reviewer automatically re-scans and confirms the rollback resolved the incident
  • The full sequence is recorded in the AI Incident Timeline with commit SHAs and timestamps

You can also trigger 1-Click Rollback outside of the recovery pipeline — there's a standalone Rollback button on the Timeline tab that lets you pick a safe commit and create the revert PR without running the full agent pipeline.

Natural Language Q&A

While the agents are running (or after they complete), you can ask questions about the incident in plain English. The Q&A panel is always visible in the Recovery drawer:

Why did conversions drop today?
The deploy at 14:02 caused PageSpeed to drop from 79 to 34 on mobile. LCP increased from 1.8s to 7.4s. Studies show a 3× LCP increase typically reduces mobile conversion rate by 15–25%. The issue was resolved at 14:16 after a hotfix deploy.
Was this the same issue as last week?
No. Last week's incident (2026-06-02) was caused by a missing Strict-Transport-Security header that triggered browser warnings. This incident is a CSS import failure causing render-blocking — different root cause.
What should I tell the client?
Suggested message: "We detected and resolved a deployment issue at 14:02 today. The site experienced 12 minutes of degraded performance, which has been fully resolved. We've added a pre-deploy PageSpeed check to prevent similar issues."

The Q&A is powered by Claude with full context of the current incident — it has access to all monitoring data, the Debugger's findings, the deploy history, and any previous incidents for the same site. Questions about "today" or "last week" are answered relative to real data, not generic advice.

When to use the full pipeline vs. individual tools

Multi-Agent Recovery is designed for active incidents — when you have a live problem and want the fastest path to a fix. But each component also works standalone:

  • AI Incident Timeline — for post-mortem analysis after an incident is resolved
  • 1-Click Rollback — when you already know what commit caused the problem
  • Natural Language Q&A — for answering client questions or building a status update
  • Visual AI Diff — for catching layout regressions before they become incidents

Used together, these features form a complete incident management workflow — from detection through recovery — without leaving SiteBrief.

What's coming next

The next iteration of the pipeline will add a fifth agent: Communicator, which automatically drafts and sends a status page update, a Slack message to your team, and a client-facing email summary — all triggered after Reviewer confirms the resolution. No copy-paste required.

Try Multi-Agent Recovery → Trigger it from the Recovery button on any site in your SiteBrief dashboard, or click the recovery action in your next alert notification. Available on Pro plan.