Engineering blog

Notes from building Remediate Labs — agent architecture, retrieval design, debugging stories, and the unglamorous decisions that make production agents actually work.

All agents architecture code-graph cost debugging rag testing

Series

Code Graph in Production

2 parts

Code RAG in Production

11 parts

Agent Cost Engineering

1 parts

RAG Learnings

2 parts

Agent Debugging

1 parts

All posts

Jun 21, 2026 · 9 min

We Built a Call Graph Because Our Agent Kept Breaking Callers It Never Knew About

The agent found one caller via GitHub search, patched it, and shipped. Two other callers broke in production. The fix was correct. The picture was incomplete.

code-graphagentsarchitecture

Jun 21, 2026 · 10 min

Five Data Structures for a Call Graph — Why We Chose Hash Map of Sets

The adjacency matrix sounds clever. It's wrong for sparse graphs. The edge list is right for serialization and wrong for queries. Here's what each actually costs.

code-graphagentsarchitecture

May 28, 2026 · 8 min

Cross-Encoder Re-Ranking — From Top-3 to Rank 1

The vocabulary gap query is at rank 3. Five functions have identical lexical scores — vector can't break the tie. Cross-encoder re-ranking sees query and document together and pushes the correct function to rank 1.

ragarchitecturedebugging

May 26, 2026 · 9 min

HyDE — Querying With Hypothetical Code Instead of the Error Message

HyDE pushed the code-vocabulary queries from 0.77 to 0.84. The incident-vocabulary query got worse: rank 3 → rank 5. The hypothetical went in the wrong direction.

ragarchitecturedebugging

May 26, 2026 · 13 min

Token Cost Engineering in Agent Loops — Prompt Caching and State Pruning

Output tokens cost 5× more than input. In agent loops, both compound. Two targeted fixes — prompt caching and state pruning — cut per-incident cost by ~25%.

agentscostarchitecture

May 19, 2026 · 14 min

Testing and Observability for Code RAG

Two ways to know your RAG is working: a recall harness you run before you ship, and chunk-level tracing that shows what actually retrieved in production. Neither replaces the other.

ragarchitecturedebuggingtesting

May 12, 2026 · 8 min

Alias-Based Deployment — Zero-Downtime Index Rebuilds

Without an alias swap, a full rebuild puts your index in a broken state for its entire duration. Queries return a mix of old and new results that never existed as a coherent snapshot.

ragarchitecturedebugging

May 5, 2026 · 10 min

When to Re-chunk, Re-index, and Re-embed

Three operations that sound similar. They are not. Each has different triggers, different costs, and different consequences if you get them wrong.

ragarchitecturedebugging

May 2, 2026 · 8 min

Why the Same Bug Kept Creating New Incidents (And What That Taught Me About RAG)

Three layers of dedup. Four independent failure modes that all had to fire simultaneously. The compound bug that exposed them, and the principle that makes it not happen again.

debuggingragagents

May 2, 2026 · 7 min

RAG Finds the Candidate. The Live Store Confirms the Truth.

A search index can't be a source of truth — that's a category error, not just a bug. The general rule for any vector index sitting in front of a live data store.

architectureragagents

May 2, 2026 · 8 min

Why My AI Agent Kept Adding Null Checks Instead of Fixing the Bug

Five PRs to teach a fix-generation pipeline that the crash site is almost never the fix site. The producer/consumer distinction, RAG's structural blind spot, and what it took to find the actual bug.

debuggingagentsrag

Apr 28, 2026 · 6 min

Document Registry — Keeping the Index Honest

Upsert is insert-or-update by ID. It doesn't delete old chunks when a function is refactored. Without a registry, stale vectors accumulate silently until your LLM is reading code that was deleted six months ago.

ragarchitecturedebugging

Apr 21, 2026 · 8 min

Hybrid Search — Closing the Vocabulary Gap

The error query went from rank 7 to rank 3. The lexical bonus is 0.24 — larger than the 0.20 vector contribution. For this query, lexical search is doing most of the work.

ragarchitecturedebugging

Apr 14, 2026 · 9 min

Closing the Vocabulary Gap with LLM-Generated Descriptions

The error query went from not found to rank 7. Here's why rank 7 still isn't good enough — and what the score reveals about the trade-off between code vocabulary and incident vocabulary.

ragarchitecturedebugging

Apr 7, 2026 · 8 min

Fixing the Chunking Split with Function-Boundary Chunks

The function name query score jumps from 0.56 to 0.71. The natural language query goes from rank 3 to rank 1. Here's what changes when each function gets its own chunk.

ragarchitecturedebugging

Mar 31, 2026 · 7 min

Search Quality — Two Failure Modes and Why They're Different

One failure is a chunking problem. The other is a vocabulary gap that better chunking can't fix. Here's what the scores reveal about code RAG's limits.

ragarchitecturedebugging

Mar 24, 2026 · 8 min

What Actually Gets Indexed

A 1026-line file produces 26 chunks. Most boundaries fall mid-function, with no awareness of code structure. Here's exactly what gets lost — and why it matters for search quality.

ragarchitecturedebugging