Here's what incident response looks like without automation: an alert fires at 2:17am. The on-call developer wakes up, opens four tools, manually correlates uptime data with deploy logs, figures out which commit probably caused it, writes a revert, opens a PR, waits for CI, merges, and goes back to sleep — if they're lucky — at 3:45am.
The problem isn't that the developer doesn't know what to do. The problem is that every step requires a human to manually gather context, make a decision, and execute an action. That's exactly the work that a well-designed AI pipeline should handle.
Today we're shipping Multi-Agent Recovery: a four-agent pipeline built into SiteBrief that runs the full incident-to-fix workflow — with live streaming progress so you can watch it work or step in at any point.
The four agents
The pipeline runs four specialized agents in sequence. Each agent completes its task and hands a structured output to the next — no agent starts until the previous one finishes and passes a confidence check.
Re-scans the site right now across all dimensions — uptime from 3 locations, PageSpeed (mobile + desktop), security headers, recent deploys, error rate from logs. Builds a complete snapshot of the current state vs. the last known-good state.
Claude AI receives the full Monitor snapshot and the diff between current and last-good state. It identifies the most probable root cause using pattern matching across deploy timing, issue type, and PageSpeed signal. It outputs a structured root cause hypothesis with confidence level.
Based on the Debugger's root cause, Fixer either generates a code fix and opens a PR, or recommends a 1-Click Rollback to a specific safe commit. For rollbacks, it picks the most recent commit with a PageSpeed score above the regression threshold.
After the fix merges or rollback deploys, Reviewer re-runs the full scan and confirms that all key metrics returned to acceptable levels. It closes the incident in the timeline and sends a resolution summary to Slack.
Live streaming progress
You don't stare at a spinner and wait. The Recovery panel streams live updates from each agent as they run — every sub-step is visible in real time. When Monitor is running, you see each check completing tick by tick. When Debugger is working, you see the reasoning steps as Claude forms its hypothesis.
This matters for two reasons. First, it builds trust — you can see exactly why the agents reached each conclusion, not just the final output. Second, it lets you intervene at any point. If Monitor's snapshot looks wrong (maybe the site recovered between the alert and the pipeline starting), you can stop the pipeline before Fixer opens an unnecessary PR.
Confidence thresholds and human gates
The pipeline has built-in safety gates. Fixer will only open a PR automatically if the Debugger's root cause confidence is ≥80%. Below that threshold, Fixer presents its recommendation for your approval before taking action. You can lower or raise this threshold per site in Recovery Settings.
1-Click Rollback
When Fixer determines that a rollback is the safest path, it surfaces the last 5 commits with their associated PageSpeed scores in a simple picker. You click the commit you want to revert to, and SiteBrief creates the revert PR on your GitHub or GitLab repository — with a descriptive title, the Debugger's root cause in the PR body, and a link back to the incident in SiteBrief.
- The revert PR is created on a dedicated branch (
sitebrief/rollback-a3f9c12) — not pushed directly to main - You review and merge the PR exactly as you would any other change
- After merge, Reviewer automatically re-scans and confirms the rollback resolved the incident
- The full sequence is recorded in the AI Incident Timeline with commit SHAs and timestamps
You can also trigger 1-Click Rollback outside of the recovery pipeline — there's a standalone Rollback button on the Timeline tab that lets you pick a safe commit and create the revert PR without running the full agent pipeline.
Natural Language Q&A
While the agents are running (or after they complete), you can ask questions about the incident in plain English. The Q&A panel is always visible in the Recovery drawer:
The Q&A is powered by Claude with full context of the current incident — it has access to all monitoring data, the Debugger's findings, the deploy history, and any previous incidents for the same site. Questions about "today" or "last week" are answered relative to real data, not generic advice.
When to use the full pipeline vs. individual tools
Multi-Agent Recovery is designed for active incidents — when you have a live problem and want the fastest path to a fix. But each component also works standalone:
- AI Incident Timeline — for post-mortem analysis after an incident is resolved
- 1-Click Rollback — when you already know what commit caused the problem
- Natural Language Q&A — for answering client questions or building a status update
- Visual AI Diff — for catching layout regressions before they become incidents
Used together, these features form a complete incident management workflow — from detection through recovery — without leaving SiteBrief.
What's coming next
The next iteration of the pipeline will add a fifth agent: Communicator, which automatically drafts and sends a status page update, a Slack message to your team, and a client-facing email summary — all triggered after Reviewer confirms the resolution. No copy-paste required.