I Built a Code-Analysis Tool to Make LLMs Better at Programming. It Didn’t Work. What I Found Instead Was Better.
TL;DR
I built stitchgraph, a local-first code-intelligence engine, on the thesis that a pre-built structural graph of a codebase would make LLM coding agents faster and better. Benchmarked honestly, it made no difference at all. The part that survived is the part that does not compete with reading code: runtime behavioural analysis, which told me my 2,350-test suite exercises only 27 independent behaviours, and which caught real defects in stitchgraph's own freshly reviewed source, twice.
Background
stitchgraph is a local-first code-intelligence engine. Point it at a codebase and it answers questions: what's dead, what breaks if I change this, which tests should I run. It works across 12 languages and attaches a confidence score to every answer. It's on PyPI and GitHub, MIT-licensed.
This post is the story of building it, benchmarking it honestly, and what happened when I pointed it at its own source code.
The Null Result
The thesis fits in one sentence: LLM coding agents waste effort rediscovering structure. Give an agent a pre-built graph of the codebase (every definition, every call, every route, every SQL table, queryable in milliseconds) and it should code faster and better.
So I benchmarked it. Agent with the tool versus agent without, on real tasks.
They performed exactly the same.
I want to lead with that, because most tool announcements don't, and because why it tied turned out to be the most useful thing the project taught me. A capable model can just read. Every answer the static graph gives (who calls this, where is that defined, what's reachable from here) is information the agent can recover with grep and a few file reads. The tool compresses seconds, not capability.
Worse, in a way that's to its credit: stitchgraph attaches honest confidence to every answer. When it says "0.6 confidence, name-based resolution, verify before acting", a competent agent goes and reads the code anyway. Calibrated honesty made the tool trustworthy by making it non-load-bearing.
One capability broke the pattern, and it's the one that doesn't compete with reading at all.
The Part That Worked: Measuring What Code Does
The behavioural toolkit doesn't analyze source. It analyzes an execution record: a matrix of which test executed which function, captured by running your own test suite under coverage. The tool never runs your code; it generates a sandboxed capture kit and reads the inert result.
Then it does something borrowed from fluid dynamics rather than software engineering: proper orthogonal decomposition (POD, a mean-centred SVD) of that matrix. In fluids, POD extracts the dominant modes of a turbulent flow. Here, the singular vectors are the behavioural modes of your test suite (sets of functions that fire together), and the spectrum tells you something no amount of reading can: how many independent behaviours your suite actually exercises.
On stitchgraph's own suite of 2,350 tests, 940 logical test rows, and 754 executed functions, the answer:
- 27 independent behaviours. That's the suite's intrinsic dimensionality. Not 2,350. Twenty-seven.
- 64 tests cover every executed function. The other ~97% add redundant coverage, often legitimately (parametrized cases share coverage profiles while testing different data), which is why the tool flags them for review and never auto-deletes.
- The modes are legible: the top one is the entry-point machinery, the second is polyglot extraction, and so on down. It's the suite's runtime architecture, recovered unsupervised.
Around that core sit the practical operations:
select_tests: which tests should this PR runfind_gaps: which live functions does no test executefind_coupling: what co-runs with no static connection between them (this found a real hidden config-to-envelope side-channel in my own code, blind)coverage_drift: what gained or lost test exposure between two versions
You cannot read your way to any of these numbers. That's the line that mattered. Everything on the static side competed with an LLM's ability to read, and tied. Everything on the runtime side is complementary to reading, for humans and agents alike.
The Evidence: Pointing It at Itself, Twice
Claims about analysis tools are cheap. So here's the strongest evidence I have: I ran the full battery on stitchgraph's own source after each of two releases, and it caught real defects both times, including in code that had just been reviewed and gated.
Round one (research log), after v3.25.0, a release that had just absorbed a 24-finding external code review: the newest operation, runtime_risk, returned "no hotspots" on stitchgraph itself. Silently. The cause: coverage file-ids are relative to the indexed root, git churn paths to the repo root, and the join between them matched nothing on any src-layout project. Every gate had passed; the op just answered an empty question confidently. Dogfooding caught it in one run.
Round two (research log), after v3.27.0, which included a large deduplication refactor: find_stale flagged a function called parse_tree in the freshly refactored shared module. It was right. My refactor had added the shared helper but never wired the nine call sites to use it, and the linter then auto-removed the unused imports, hiding the slip.
Here's the part worth dwelling on. The refactor was gated by a byte-identical output differential and a 1,618-test oracle battery, and both passed, because dead code has no outputs. Output-equivalence oracles prove a refactor changed nothing, including that a helper changed nothing because nothing called it. Only a liveness view sees that. find_gaps then corroborated from the runtime side: its untested-dead list was exactly that function plus the one known advisory.
The refactor itself gave the runtime toolkit a controlled experiment. Before and after a ~400-line, nine-file deduplication:
- Intrinsic dimensionality: 27 before, 27 after. The strongest runtime statement of "behaviour-preserving" I know how to make.
coverage_driftnarrated the refactor from coverage alone. Functions that lost coverage were exactly the nine deleted per-language copies; functions that gained it were exactly the new shared module. A behavioural changelog, derived without reading the diff.- The static graph watched too. The duplicated frontends had shown up as their own 329-node cluster in
find_subsystems(the graph had, in effect, recommended the refactor); after it, the cluster shrank and 25 duplicate-driven noise findings evaporated fromscan.
Two rounds, each finding strictly less than the last, converging on a fixed point of one known, documented advisory. A tool for finding problems, run on itself until it has nothing left to say: that's the closest thing to a self-certification the genre allows, and every number above is reproducible from the repo.
What I Accidentally Reinvented
I built stitchgraph largely without the academic literature, and when I finally mapped it against prior work, the honest answer is that much of the foundation replicates university research, independently. That's worth stating plainly, for two reasons: you should know what's new and what isn't, and independent convergence is its own kind of evidence that these ideas are the natural ones.
- The layered call, statement, and expression graph is the Code Property Graph (Yamaguchi et al., IEEE S&P 2014; the tool Joern), reinvented.
- Name- and order-invariant clone detection over dependence graphs goes back to Krinke and Komondoor & Horwitz (2001); the Weisfeiler-Lehman kernel is Shervashidze et al. (2011).
- The coverage matrix is what the literature calls program spectra; clustering execution profiles dates to Dickinson, Leon & Podgurski (2001). Greedy minimal test covers are Harrold, Gupta & Soffa (1993); test selection is a whole field (Ekstazi runs it in production). Feature location from test execution is Wilde & Scully (1995) and Eisenbarth et al. (2003).
- Clustering call graphs for architecture recovery, PageRank for key classes, churn-based risk: all established.
Where I'd claim actual novelty, having looked:
- The intrinsic-dimensionality framing. The field has pointed the coverage matrix at fault localization and suite reduction for twenty years. "Your suite exercises 27 independent behaviours; here they are, auto-labelled" as a first-class, developer-facing metric appears to be new. Standard math, standard matrix, new question.
- Calibrated honesty as an API contract for agent consumers. Every answer carries confidence, provenance, and machine-actionable reasons to doubt it; provenance caps how loudly a finding may shout, and the tool refuses rather than guesses. Alarm-ranking research exists. Designing the interface around a consumer that will act on the answer without judgment is a problem the literature is only beginning to have.
- The development-process record. stitchgraph was substantially built by LLMs under an adversarial multi-model review process: 280+ documented panel rounds with severity-weighted release gates, differential oracles pinning every risky path, an honesty ledger of negative results, and now two self-analysis rounds converging to a fixed point. I don't know of a comparably documented longitudinal record of LLM-driven development, including its failures, like the benchmark result this post opened with.
What I'd Tell You to Actually Do with It
pip install 'stitchgraph[all]'
cd your-project
stitchgraph reindex . --db stitchgraph.db
stitchgraph report --db stitchgraph.dbCode language: Bash (bash)
The report command gives you orientation, issues, and risk on one page.
Then, for the part that will tell you something you don't already know: generate the coverage kit with stitchgraph scaffold-coverage, run it in your own sandbox, and ask stitchgraph find-modes --coverage coverage_modes.json. The first time you learn your five-thousand-test suite has a behavioural dimensionality of 30, and that 70 tests cover every function you actually execute, is the moment this project exists for.
Agents get the same operations over MCP (stitchgraph-mcp --db …), with the envelope telling them exactly how much to trust each answer. Everything is local, read-only on your source, and MIT-licensed.
Note: If you find something, or your dogfooding catches something mine didn't, the issue tracker is open. That's how the last three releases happened.
Summary
- A pre-built static code graph did not improve LLM coding agents on real tasks; a capable model recovers the same answers by reading, so the tool compressed seconds, not capability.
- Runtime behavioural analysis is the part reading cannot replace: POD over the test-coverage matrix showed a 2,350-test suite exercising 27 independent behaviours, with 64 tests covering every executed function.
- Dogfooding caught real defects twice in freshly reviewed, gate-passing code, including a dead refactor helper that a byte-identical differential and a 1,618-test oracle battery could not see.
- Most of the static foundation independently replicates published research; the intrinsic-dimensionality metric, the calibrated-honesty API contract, and the documented LLM-driven development record are the parts that appear new.
- Try it:
pip install 'stitchgraph[all]', thenstitchgraph find-modeson your own suite.
