agent-platform

A 5-agent pipeline that detects production errors, grounds the diagnosis against the actual repo, generates a patch, validates it in a Docker sandbox, and opens a PR — all before anyone gets paged. Designed around three constraints: zero polling load on production, no fabricated symbols in fixes, and human approval for anything HIGH/CRITICAL.

Request a walkthrough → Source on GitHub →

Pipeline

Every production error flows through this graph. Sandbox failure regenerates the fix up to 3× before any GitHub noise. HIGH/CRITICAL incidents stop at the approval gate.

flowchart TB A([CloudWatch alarm]) --> B[SNS topic] B --> C[/webhooks/cloudwatch-alarm/] C --> D{Dedup gate} D -- duplicate --> X([drop]) D -- new --> E[TriageAgent · Haiku] E -- noise --> X E -- duplicate --> X E -- real, P0–P3 --> F[DiagnosisAgent · Sonnet] F --> G[FixGenerationAgent] G --> H{Sandbox npm test} H -- fail · retry 3× --> G H -- pass --> I[CodeReviewAgent self-critique] I --> J[Open GitHub PR] J --> K{Risk} K -- LOW/MED --> M[Auto-merge] K -- HIGH/CRIT --> L[Human approval] L --> M M --> N[MonitorGenerationAgent new alarms for changed code] classDef event fill:#0b0d10,stroke:#2f343b,color:#e6e8eb classDef agent fill:#13161a,stroke:#6ee7b7,color:#e6e8eb classDef gate fill:#0b0d10,stroke:#23272d,color:#9aa1a9 class A,X event class E,F,G,I,N agent class D,H,K gate

End-to-end pipeline — from CloudWatch alarm to merged fix PR.

Deployment topology

Cloudflare-fronted ALB sits in front of ECS Fargate. Postgres + pgvector is the source of truth for incidents and the RAG index. CI/CD is OIDC — no secrets in the repo.

flowchart LR U([User · Interviewer]) --> CF[Cloudflare edge] CF --> ALB[AWS ALB :443] ALB --> ECS[ECS Fargate task FastAPI + Uvicorn :8000] ECS --> PG[(Postgres + pgvector)] ECS --> S3[(S3 artefacts)] GH[GitHub Actions OIDC] -. on push to main .-> ECR[ECR] ECR -. ECS pull .-> ECS SNS[SNS topic] --> WH[/webhook handler/] CW[CloudWatch alarms] --> SNS classDef ext fill:#0b0d10,stroke:#23272d,color:#9aa1a9 classDef compute fill:#13161a,stroke:#6ee7b7,color:#e6e8eb classDef store fill:#13161a,stroke:#2f343b,color:#e6e8eb class U,CF,GH ext class ALB,ECS,WH compute class PG,S3,ECR,SNS,CW store

AWS ECS Fargate · Cloudflare · GitHub Actions OIDC.

The agents

TriageAgent

ReAct + typed wrapper

Classifies events as real / noise / duplicate. P0–P3 severity. Haiku for cost.

DiagnosisAgent

ReAct + grounding guards

Roots every claim against the live repo via GitHub Code Search.

FixGenerationAgent

Direct LLM + sandbox loop

Writes the patch. Runs it in a Docker sandbox. Retries 3× on test failure.

CodeReviewAgent

Direct LLM

Self-critique pass before opening the PR. Catches obvious regressions.

MonitorGenerationAgent

ReAct + dry-run tools

On PR merge, generates new CloudWatch alarms for new code paths.

How the metrics are computed

Time-to-first-PR — wall-clock from CloudWatch alarm webhook receipt to GitHub PR creation. Decomposed in scripts/measure_mttr.py into agent-bound (LLM + tool latency) vs human-bound (approval queue) so the agent number is attributable.
Triage accuracy — TriageAgent run against app/evals/golden_dataset.jsonl (100+ labelled cases: real/noise/duplicate × P0–P3). Eval runner isolates the agent from live AWS via stubs.
False-positive rate — fraction of incoming events classified as real that turned out to be noise on human review. Captured into the golden dataset so the next eval has a higher bar.
Sample size — production incidents in agent_platform.db from the live deploy. Refreshed whenever measure_mttr.py is re-run.

Tech stack

Agent runtime	Anthropic SDK · Claude Opus 4.6 / Sonnet 4.6 / Haiku 4.5
Web framework	FastAPI · WebSocket streaming · Pydantic v2
Storage	Postgres (SQLAlchemy Core) · pgvector for RAG
Sandbox	Docker Compose · Jest · mongodb-memory-server
Tracing	Langfuse — every LLM call + tool execution as nested spans
Resilience	Circuit breakers · schema validation at handoffs · context checkpointing
Deploy	AWS ECS Fargate · ALB · Cloudflare · GitHub Actions OIDC
Frontend	React + Vite · WebSocket dashboard · served from same container

Design decisions worth defending

Push, not poll. CloudWatch alarms → SNS → HTTPS webhook. Zero polling load on the production server. Detection latency drops from ~7d (human attention) to single-digit minutes.
RAG finds candidates. The live store confirms truth. ChromaDB (and now pgvector) holds index-time snapshots; current incident state lives in Postgres. Blocking decisions always re-read the live store — the post explains why.
Hard blocks are deterministic, soft hints are LLM-shaped. Dedup is a hard block (drop the event); regression context is a soft hint (prompt injection). Mixing the two created a four-failure-mode bug — walked through here.
Ground every symbol against the repo. DiagnosisAgent runs verify_symbol_in_repo via GitHub Code Search; a server-side guard re-checks every named function in the parsed output and rejects fabricated camelCase identifiers.
Sandbox before PR. Every fix runs in a Docker container against the real test suite. If tests fail, regenerate up to 3× before opening any GitHub noise.
Approval gate for HIGH/CRITICAL. Configurable risk threshold; rejections are logged as RLHF preference pairs.