agent-platform
A 5-agent pipeline that detects production errors, grounds the diagnosis against the actual repo, generates a patch, validates it in a Docker sandbox, and opens a PR — all before anyone gets paged. Designed around three constraints: zero polling load on production, no fabricated symbols in fixes, and human approval for anything HIGH/CRITICAL.
Pipeline
Every production error flows through this graph. Sandbox failure regenerates the fix up to 3× before any GitHub noise. HIGH/CRITICAL incidents stop at the approval gate.
Deployment topology
Cloudflare-fronted ALB sits in front of ECS Fargate. Postgres + pgvector is the source of truth for incidents and the RAG index. CI/CD is OIDC — no secrets in the repo.
The agents
TriageAgent
ReAct + typed wrapperClassifies events as real / noise / duplicate. P0–P3 severity. Haiku for cost.
DiagnosisAgent
ReAct + grounding guardsRoots every claim against the live repo via GitHub Code Search.
FixGenerationAgent
Direct LLM + sandbox loopWrites the patch. Runs it in a Docker sandbox. Retries 3× on test failure.
CodeReviewAgent
Direct LLMSelf-critique pass before opening the PR. Catches obvious regressions.
MonitorGenerationAgent
ReAct + dry-run toolsOn PR merge, generates new CloudWatch alarms for new code paths.
How the metrics are computed
- Time-to-first-PR — wall-clock from CloudWatch alarm
webhook receipt to GitHub PR creation. Decomposed in
scripts/measure_mttr.pyinto agent-bound (LLM + tool latency) vs human-bound (approval queue) so the agent number is attributable. - Triage accuracy — TriageAgent run against
app/evals/golden_dataset.jsonl(100+ labelled cases: real/noise/duplicate × P0–P3). Eval runner isolates the agent from live AWS via stubs. - False-positive rate — fraction of incoming events classified as real that turned out to be noise on human review. Captured into the golden dataset so the next eval has a higher bar.
- Sample size — production incidents in
agent_platform.dbfrom the live deploy. Refreshed whenevermeasure_mttr.pyis re-run.
Tech stack
| Agent runtime | Anthropic SDK · Claude Opus 4.6 / Sonnet 4.6 / Haiku 4.5 |
| Web framework | FastAPI · WebSocket streaming · Pydantic v2 |
| Storage | Postgres (SQLAlchemy Core) · pgvector for RAG |
| Sandbox | Docker Compose · Jest · mongodb-memory-server |
| Tracing | Langfuse — every LLM call + tool execution as nested spans |
| Resilience | Circuit breakers · schema validation at handoffs · context checkpointing |
| Deploy | AWS ECS Fargate · ALB · Cloudflare · GitHub Actions OIDC |
| Frontend | React + Vite · WebSocket dashboard · served from same container |
Design decisions worth defending
- Push, not poll. CloudWatch alarms → SNS → HTTPS webhook. Zero polling load on the production server. Detection latency drops from ~7d (human attention) to single-digit minutes.
- RAG finds candidates. The live store confirms truth. ChromaDB (and now pgvector) holds index-time snapshots; current incident state lives in Postgres. Blocking decisions always re-read the live store — the post explains why.
- Hard blocks are deterministic, soft hints are LLM-shaped. Dedup is a hard block (drop the event); regression context is a soft hint (prompt injection). Mixing the two created a four-failure-mode bug — walked through here.
- Ground every symbol against the repo. DiagnosisAgent
runs
verify_symbol_in_repovia GitHub Code Search; a server-side guard re-checks every named function in the parsed output and rejects fabricated camelCase identifiers. - Sandbox before PR. Every fix runs in a Docker container against the real test suite. If tests fail, regenerate up to 3× before opening any GitHub noise.
- Approval gate for HIGH/CRITICAL. Configurable risk threshold; rejections are logged as RLHF preference pairs.