← Home

agent-platform

A 5-agent pipeline that detects production errors, grounds the diagnosis against the actual repo, generates a patch, validates it in a Docker sandbox, and opens a PR — all before anyone gets paged. Designed around three constraints: zero polling load on production, no fabricated symbols in fixes, and human approval for anything HIGH/CRITICAL.

Pipeline

Every production error flows through this graph. Sandbox failure regenerates the fix up to 3× before any GitHub noise. HIGH/CRITICAL incidents stop at the approval gate.

flowchart TB A([CloudWatch alarm]) --> B[SNS topic] B --> C[/webhooks/cloudwatch-alarm/] C --> D{Dedup gate} D -- duplicate --> X([drop]) D -- new --> E[TriageAgent · Haiku] E -- noise --> X E -- duplicate --> X E -- real, P0–P3 --> F[DiagnosisAgent · Sonnet] F --> G[FixGenerationAgent] G --> H{Sandbox<br/>npm test} H -- fail · retry 3× --> G H -- pass --> I[CodeReviewAgent<br/>self-critique] I --> J[Open GitHub PR] J --> K{Risk} K -- LOW/MED --> M[Auto-merge] K -- HIGH/CRIT --> L[Human approval] L --> M M --> N[MonitorGenerationAgent<br/>new alarms for changed code] classDef event fill:#0b0d10,stroke:#2f343b,color:#e6e8eb classDef agent fill:#13161a,stroke:#6ee7b7,color:#e6e8eb classDef gate fill:#0b0d10,stroke:#23272d,color:#9aa1a9 class A,X event class E,F,G,I,N agent class D,H,K gate
End-to-end pipeline — from CloudWatch alarm to merged fix PR.

Deployment topology

Cloudflare-fronted ALB sits in front of ECS Fargate. Postgres + pgvector is the source of truth for incidents and the RAG index. CI/CD is OIDC — no secrets in the repo.

flowchart LR U([User · Interviewer]) --> CF[Cloudflare<br/>edge] CF --> ALB[AWS ALB<br/>:443] ALB --> ECS[ECS Fargate task<br/>FastAPI + Uvicorn :8000] ECS --> PG[(Postgres<br/>+ pgvector)] ECS --> S3[(S3<br/>artefacts)] GH[GitHub Actions OIDC] -. on push to main .-> ECR[ECR] ECR -. ECS pull .-> ECS SNS[SNS topic] --> WH[/webhook handler/] CW[CloudWatch alarms] --> SNS classDef ext fill:#0b0d10,stroke:#23272d,color:#9aa1a9 classDef compute fill:#13161a,stroke:#6ee7b7,color:#e6e8eb classDef store fill:#13161a,stroke:#2f343b,color:#e6e8eb class U,CF,GH ext class ALB,ECS,WH compute class PG,S3,ECR,SNS,CW store
AWS ECS Fargate · Cloudflare · GitHub Actions OIDC.

The agents

TriageAgent

ReAct + typed wrapper

Classifies events as real / noise / duplicate. P0–P3 severity. Haiku for cost.

DiagnosisAgent

ReAct + grounding guards

Roots every claim against the live repo via GitHub Code Search.

FixGenerationAgent

Direct LLM + sandbox loop

Writes the patch. Runs it in a Docker sandbox. Retries 3× on test failure.

CodeReviewAgent

Direct LLM

Self-critique pass before opening the PR. Catches obvious regressions.

MonitorGenerationAgent

ReAct + dry-run tools

On PR merge, generates new CloudWatch alarms for new code paths.

How the metrics are computed

Tech stack

Agent runtime Anthropic SDK · Claude Opus 4.6 / Sonnet 4.6 / Haiku 4.5
Web framework FastAPI · WebSocket streaming · Pydantic v2
Storage Postgres (SQLAlchemy Core) · pgvector for RAG
Sandbox Docker Compose · Jest · mongodb-memory-server
Tracing Langfuse — every LLM call + tool execution as nested spans
Resilience Circuit breakers · schema validation at handoffs · context checkpointing
Deploy AWS ECS Fargate · ALB · Cloudflare · GitHub Actions OIDC
Frontend React + Vite · WebSocket dashboard · served from same container

Design decisions worth defending

  1. Push, not poll. CloudWatch alarms → SNS → HTTPS webhook. Zero polling load on the production server. Detection latency drops from ~7d (human attention) to single-digit minutes.
  2. RAG finds candidates. The live store confirms truth. ChromaDB (and now pgvector) holds index-time snapshots; current incident state lives in Postgres. Blocking decisions always re-read the live store — the post explains why.
  3. Hard blocks are deterministic, soft hints are LLM-shaped. Dedup is a hard block (drop the event); regression context is a soft hint (prompt injection). Mixing the two created a four-failure-mode bug — walked through here.
  4. Ground every symbol against the repo. DiagnosisAgent runs verify_symbol_in_repo via GitHub Code Search; a server-side guard re-checks every named function in the parsed output and rejects fabricated camelCase identifiers.
  5. Sandbox before PR. Every fix runs in a Docker container against the real test suite. If tests fail, regenerate up to 3× before opening any GitHub noise.
  6. Approval gate for HIGH/CRITICAL. Configurable risk threshold; rejections are logged as RLHF preference pairs.