I Built a Code-Analysis Tool to Make LLMs Better at Programming. It Didn’t Work. What I Found Instead Was Better.

TL;DR

I built stitchgraph, a local-first code-intelligence engine, on the thesis that a pre-built structural graph of a codebase would make LLM coding agents faster and better. Benchmarked honestly, it made no difference, with two exceptions that only emerged in the field: codebases too big to read, and type-resolved precision from real language servers. The part that survived untouched is the part that never competed with reading: runtime behavioural analysis, which told me my 2,350-test suite exercises only 27 independent behaviours, caught real defects in stitchgraph’s own source four times, and held up against Home Assistant and Django.

Background

stitchgraph is a local-first code-intelligence engine. Point it at a codebase and it answers questions: what’s dead, what breaks if I change this, which tests should I run. It works across 12 languages and attaches a confidence score to every answer. It’s on PyPI and GitHub, MIT-licensed.

This post is the story of building it, benchmarking it honestly, and what happened when I pointed it at its own source code, and then at code that owes me nothing.

The Null Result

The thesis fits in one sentence: LLM coding agents waste effort rediscovering structure. Give an agent a pre-built graph of the codebase (every definition, every call, every route, every SQL table, queryable in milliseconds) and it should code faster and better.

So I benchmarked it. Agent with the tool versus agent without, on real tasks.

They performed exactly the same.

I want to lead with that, because most tool announcements don’t, and because why it tied turned out to be the most useful thing the project taught me. A capable model can just read. Every answer the static graph gives (who calls this, where is that defined, what’s reachable from here) is information the agent can recover with grep and a few file reads. The tool compresses seconds, not capability.

Worse, in a way that’s to its credit: stitchgraph attaches honest confidence to every answer. When it says “0.6 confidence, name-based resolution, verify before acting”, a competent agent goes and reads the code anyway. Calibrated honesty made the tool trustworthy by making it non-load-bearing.

One capability broke the pattern, and it’s the one that doesn’t compete with reading at all.

Two Places the Tie Breaks

The months since I ran that benchmark added two honest qualifications.

Scale. “The agent can just read” is true at ten thousand lines and false at Home Assistant’s size: 6,728 files, 59k definitions, 16 million resolved edges. Nothing reads that. Transitive questions (what ultimately depends on this function, what does this suite actually reach) stop being greppable long before that point, and the graph answers them in seconds; the 12 MB memory-mapped sidecar builds in 2.5 seconds, and a whole-graph traversal over all 59k nodes runs in about 2 seconds, where a pure-Python reference sweep took 46. The static side doesn’t beat reading; it outlives it.

Precision the reader can’t get. The graph now drives real language servers (typescript-language-server, rust-analyzer, gopls, clangd) over its own call sites and upgrades name-guesses to type-resolved edges: +497 confident edges on hono, +147 on fd, each hand-verified (research log). That’s not information an agent recovers with grep either; it’s information the compiler’s machinery has and a reader approximates.

The core lesson survives both updates: everything that competes with reading ties on codebases a model can hold, and the durable value sits where reading was never the competition.

The Part That Worked: Measuring What Code Does

The behavioural toolkit doesn’t analyze source. It analyzes an execution record: a matrix of which test executed which function, captured by running your own test suite under coverage. The tool never runs your code; it generates a sandboxed capture kit and reads the inert result.

Then it does something borrowed from fluid dynamics rather than software engineering: proper orthogonal decomposition (POD, a mean-centred SVD) of that matrix. In fluids, POD extracts the dominant modes of a turbulent flow. Here, the singular vectors are the behavioural modes of your test suite (sets of functions that fire together), and the spectrum tells you something no amount of reading can: how many independent behaviours your suite actually exercises.

On stitchgraph’s own suite of 2,350 tests, 940 logical test rows, and 754 executed functions, the answer:

27 independent behaviours. That’s the suite’s intrinsic dimensionality. Not 2,350. Twenty-seven.
64 tests cover every executed function. The other ~97% add redundant coverage, often legitimately (parametrized cases share coverage profiles while testing different data), which is why the tool flags them for review and never auto-deletes.
The modes are legible: the top one is the entry-point machinery, the second is polyglot extraction, and so on down. It’s the suite’s runtime architecture, recovered unsupervised.

Around that core sit the practical operations:

select_tests: which tests should this PR run
find_gaps: which live functions does no test execute
find_coupling: what co-runs with no static connection between them (this found a real hidden config-to-envelope side-channel in my own code, blind)
coverage_drift: what gained or lost test exposure between two versions

You cannot read your way to any of these numbers. That’s the line that mattered. Everything on the static side competed with an LLM’s ability to read, and tied. Everything on the runtime side is complementary to reading, for humans and agents alike.

The Evidence: Pointing It at Itself, Four Times

Claims about analysis tools are cheap. So here’s the strongest evidence I have: I ran the full battery on stitchgraph’s own source four times across its release history, and it caught real defects every time, including in code that had just been reviewed and gated.

Round one (research log), after v3.25.0, a release that had just absorbed a 24-finding external code review: the newest operation, runtime_risk, returned “no hotspots” on stitchgraph itself. Silently. The cause: coverage file-ids are relative to the indexed root, git churn paths to the repo root, and the join between them matched nothing on any src-layout project. Every gate had passed; the op just answered an empty question confidently. Dogfooding caught it in one run.

Round two (research log), after v3.27.0, which included a large deduplication refactor: find_stale flagged a function called parse_tree in the freshly refactored shared module. It was right. My refactor had added the shared helper but never wired the nine call sites to use it, and the linter then auto-removed the unused imports, hiding the slip.

Here’s the part worth dwelling on. The refactor was gated by a byte-identical output differential and a 1,618-test oracle battery, and both passed, because dead code has no outputs. Output-equivalence oracles prove a refactor changed nothing, including that a helper changed nothing because nothing called it. Only a liveness view sees that. find_gaps then corroborated from the runtime side: its untested-dead list was exactly that function plus the one known advisory.

The refactor itself gave the runtime toolkit a controlled experiment. Before and after a ~400-line, nine-file deduplication:

Intrinsic dimensionality: 27 before, 27 after. The strongest runtime statement of “behaviour-preserving” I know how to make.
coverage_drift narrated the refactor from coverage alone. Functions that lost coverage were exactly the nine deleted per-language copies; functions that gained it were exactly the new shared module. A behavioural changelog, derived without reading the diff.
The static graph watched too. The duplicated frontends had shown up as their own 329-node cluster in find_subsystems (the graph had, in effect, recommended the refactor); after it, the cluster shrank and 25 duplicate-driven noise findings evaporated from scan.

The first two rounds each found strictly less than the last, converging on a fixed point of one known, documented advisory. A tool for finding problems, run on itself until it has nothing left to say: that’s the closest thing to a self-certification the genre allows. Then the code kept moving.

Round three came nineteen releases later (research log): v3.46, installed from the released PyPI wheel and pointed at its own repo again. It caught three more real defects: a dead public function nothing had ever called, and two parameters threaded through helpers since their first version that no body ever read. It also taught a different lesson. The raw scan produced 435 findings, and hand-verifying them showed most were correct arithmetic answering the wrong question; a heavily-tested helper isn’t a “god object”, and a pytest fixture isn’t a “read this first” hub. Calibrating those judgments against the dogfood evidence took the finding count from 435 to 45 without suppressing a single verified true positive. Dogfooding doesn’t just find bugs; it calibrates taste.

Round four dropped the pretense of routine and went adversarial (research log): for v3.50 I wrote the bug-hunt prompt I’d want pointed at someone else’s project (seven failure classes, a confirmed-versus-plausible evidence bar, mandatory write-ups for suspicions that dissolve) and fed it to myself plus two parallel hunting agents. Fourteen confirmed bugs, including an embarrassing one: the file-watch path silently downgraded the analysis quality of every file you edited. A cluster of them shared a common cause: features field-validated on real corpora, but always on the machine that built them. Every fix shipped pinned by a test. The prompt is in the repo (docs/BUG_HUNT_PROMPT.md) if you want to run it against your own project, or mine.

Taking It to Strangers’ Code

Self-analysis has an obvious weakness: I know my own codebase. So the same battery went to two codebases that owe me nothing, with a rule: no number leaves the run without hand-verification against the source.

Home Assistant (6,728 files, 59k nodes, 16M edges) was the scale trial (field log, POD validation). Indexing end-to-end held 158 MB peak under a 4 GB ulimit. The interesting part was what the runtime side did to the static side. I captured real per-test coverage from HA’s helper suite (2,056 tests by 3,274 executed functions) and asked a question most static tools never submit to: of the functions the tests actually executed, what fraction does static reachability find? Three rounds later the answer was 0.991, but the three rounds are the story. Round one scored 0.975 while the graph was silently missing 880 files: a great number hiding a hole. Round two fixed the files, exposed the honest denominator, and scored 0.299. Round three stitched the cross-parser edges and earned the 0.991. Four indexer bugs died on the way. A validation harness that can only confirm success is not a validation harness.

The static battery also paid rent directly: a verified list of dead code in HA’s own utils. rgbww_to_color_temperature and its private helper, four of five deprecation.py helpers, and a legacy loader shim, each grep-verified to zero call sites in the shipped package.

Django 5.2.15 (2,873 files, 47k nodes) was the adversarial pick, one of the most-audited Python codebases alive (field log). The battery plus hand-verification produced one finding I’d take upstream: the 5.2 release notes say all SyndicationFeed classes support stylesheets, and Atom1Feed accepts the argument, then silently never writes it. find_stale flagged the base hook as uncalled; reading the flag’s reason (RSS calls its own override; Atom never calls the hook at all) turned a dead-code advisory into a behaviour bug with a three-line stdlib repro. That’s the workflow I now believe in: the tool proposes with calibrated confidence, the human disposes with the source open.

What I Accidentally Reinvented

I built stitchgraph largely without the academic literature, and when I finally mapped it against prior work, the honest answer is that much of the foundation replicates university research, independently. That’s worth stating plainly, for two reasons: you should know what’s new and what isn’t, and independent convergence is its own kind of evidence that these ideas are the natural ones.

The layered call, statement, and expression graph is the Code Property Graph (Yamaguchi et al., IEEE S&P 2014; the tool Joern), reinvented.
Name- and order-invariant clone detection over dependence graphs goes back to Krinke and Komondoor & Horwitz (2001); the Weisfeiler-Lehman kernel is Shervashidze et al. (2011).
The coverage matrix is what the literature calls program spectra; clustering execution profiles dates to Dickinson, Leon & Podgurski (2001). Greedy minimal test covers are Harrold, Gupta & Soffa (1993); test selection is a whole field (Ekstazi runs it in production). Feature location from test execution is Wilde & Scully (1995) and Eisenbarth et al. (2003).
Clustering call graphs for architecture recovery, PageRank for key classes, churn-based risk: all established.

Where I’d claim actual novelty, having looked:

The intrinsic-dimensionality framing. The field has pointed the coverage matrix at fault localization and suite reduction for twenty years. “Your suite exercises 27 independent behaviours; here they are, auto-labelled” as a first-class, developer-facing metric appears to be new. Standard math, standard matrix, new question.
Calibrated honesty as an API contract for agent consumers. Every answer carries confidence, provenance, and machine-actionable reasons to doubt it; provenance caps how loudly a finding may shout, and the tool refuses rather than guesses. Alarm-ranking research exists. Designing the interface around a consumer that will act on the answer without judgment is a problem the literature is only beginning to have.
The development-process record. stitchgraph was substantially built by LLMs under an adversarial multi-model review process: 280+ documented panel rounds with severity-weighted release gates, differential oracles pinning every risky path, an honesty ledger of negative results, four self-analysis rounds, an unprompted field review by a different frontier model whose findings shipped as fixes within two releases (docs/LLM_REVIEW.md), and an adversarial self-audit run with a published prompt. I don’t know of a comparably documented longitudinal record of LLM-driven development, including its failures, like the benchmark result this post opened with.

What I’d Tell You to Actually Do with It

pip install 'stitchgraph[all]'
cd your-project
stitchgraph reindex . --db stitchgraph.db
stitchgraph report --db stitchgraph.dbCode language: Bash (bash)

The report command gives you orientation, issues, and risk on one page.

Then, for the part that will tell you something you don’t already know: generate the coverage kit with stitchgraph scaffold-coverage, run it in your own sandbox, and ask stitchgraph find-modes --coverage coverage_modes.json. The kit is turnkey for Python, Rust, Go, and JS/TS (details): a generated script runs your suite per-test and converts the result, no wiring. The Rust one ran fd’s 267 tests unedited (dimensionality 7, a covering set of 154). The first time you learn your five-thousand-test suite has a behavioural dimensionality of 30, and that 70 tests cover every function you actually execute, is the moment this project exists for.

Agents get the same operations over MCP (stitchgraph-mcp --db …), with the envelope telling them exactly how much to trust each answer. Everything is local, read-only on your source, and MIT-licensed.

Note: If you find something, or your dogfooding catches something mine didn’t, the issue tracker is open. Reviews, benchmarks, and bug hunts, including the unflattering ones, are how most of the releases above happened.

Summary

A pre-built static code graph did not improve LLM coding agents on real tasks; a capable model recovers the same answers by reading. Two field-tested exceptions: codebases too big to read (Home Assistant’s 16M edges, traversed in ~2 s), and type-resolved call edges from real language servers.
Runtime behavioural analysis is the part reading cannot replace: POD over the test-coverage matrix showed a 2,350-test suite exercising 27 independent behaviours, with 64 tests covering every executed function.
Four dogfood rounds caught real defects every time, from a silently empty join in freshly gated code to fourteen confirmed bugs in an adversarial self-audit whose prompt is published in the repo.
On strangers’ code the numbers held: 0.991 static-reachability recall against measured runtime truth on Home Assistant (after the harness exposed and killed four indexer bugs), verified dead code in HA’s utils, and a documented-but-unimplemented stylesheets behaviour in Django’s Atom1Feed.
Most of the static foundation independently replicates published research; the intrinsic-dimensionality metric, the calibrated-honesty API contract, and the documented LLM-driven development record are the parts that appear new.
Try it: pip install 'stitchgraph[all]', then stitchgraph find-modes on your own suite.

I Built a Code-Analysis Tool to Make LLMs Better at Programming. It Didn’t Work. What I Found Instead Was Better.

TL;DR

Background

The Null Result

Two Places the Tie Breaks

The Part That Worked: Measuring What Code Does

The Evidence: Pointing It at Itself, Four Times

Taking It to Strangers’ Code

What I Accidentally Reinvented

What I’d Tell You to Actually Do with It

Summary

Like this:

Remote-Containers CLI: RPC pipe not configured

Like this:

GitHub Actions CI/CD (4/10): Your First CI Workflow — Run on Every PR

Like this:

GitHub Actions CI/CD (3/10): Quality Gate Before Tests — Lint and Formatting

Like this:

pgmonkey (4/8) — Connection Pooling

Like this:

QNAP With Portainer

Like this:

Python Concurrency (or is that asynchronicity?) , With asyncio

Like this:

Leave a Reply Cancel reply

TL;DR

Background

The Null Result

Two Places the Tie Breaks

The Part That Worked: Measuring What Code Does

The Evidence: Pointing It at Itself, Four Times

Taking It to Strangers’ Code

What I Accidentally Reinvented

What I’d Tell You to Actually Do with It

Summary

Share this:

Like this:

Similar Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a Reply Cancel reply