Grigoriy Dobryakov

Howto · breakdown

Breakdown 17 Architecture Audit LLM + Metrics Hybrid

Git Log as a Dataset for Architectural Retro-Analysis

A hybrid: deterministic metrics (a small bash script) plus LLM interpretation on top. Every claim in the report ties back to a SHA or a number from the metric layer — no made-up patterns. Read-only by design. Proof: a real run on .claude/skills/ of this repo, with concrete architectural findings that came out of the numbers.

Architect Tech Lead Head of AI

The Problem

In a long-running project of dozens of skills, services, or modules, the context behind past architectural decisions gets lost. Approaches shifted, things got refactored, some pieces died silently, others fused without explicit intent. To see that objectively, you need a dataset that doesn't depend on human memory.

That dataset already exists — git history. The problem isn't its absence; it's that people rarely use it as a dataset. Usually they read it as a feed of commits ("what happened this month"), less often as a metric (churn, co-change). Almost never as material for interpretation.

Up front, what this is not. It's not "feed the git log to an LLM and get insights." Doesn't work: on a month of an average repo's history the diff is tens of megabytes of text — won't fit in a single prompt. And even if it did, on raw log an LLM gravitates toward narrative smoothness: it will find "evolution" and "turning points" where there are none.

It's also not a code-maat-style approach (pure quantitative metrics). Numbers state a fact, not an explanation. "Component X was rewritten 16 times" — fact. Whether that's searching for the right form, reacting to changing requirements, or evidence of bad design — numbers won't say.

The ask: get explanation grounded in numbers, not a narrative without a ground and not a number without interpretation.

The Method

One principle: separate the metric layer from the interpretation layer. Metrics are computed deterministically; the LLM writes a narrative report on top. Every claim in the report has to reference either a concrete commit SHA or a number from the metrics. That removes the risk of invented patterns: if the LLM says "component A is coupled to B," there must be co_commits ≥ 2 and the SHAs where they appear together. Otherwise the claim doesn't make it into the report.

Layer 1 — metrics (bash + awk script)

One script takes --root <path>, --since <date>, --until <date>, --bulk-threshold K and emits JSON with three blocks:

Bulk filter: a commit that touches more than K components (default 5) is flagged is_bulk=true and excluded from the co-change matrix. Otherwise one or two cross-cutting refactors ("changed the format in all skills") skew the matrix enough that any pair of involved components looks coupled. Per-component churn still counts those commits — it's real work, just not a semantic coupling signal.

No dependencies: git + bash + awk. Read-only.

Layer 2 — deterministic commit selection for deep review

The LLM doesn't pick what to look at. Rules are fixed:

The union of these three sets goes into full diff via git show -p. If the total diff payload exceeds the budget (~200K tokens), the skill trims from the end of the date-sorted list and says so in the report.

Layer 3 — interpretation (LLM)

Given the metrics JSON and full diffs of the selected commits, the LLM writes a report following a strict structure: scope → metrics → co-change → turning points → paradoxes → evolution of approach → what suggests itself → limitations.

Interpretation rules:

What this gives you

  1. 1. Reproducibility. On the same git state the script emits identical JSON. Metrics are fact. Interpretation on top is the LLM's opinion — and it's explicitly marked as opinion (linked to its source).
  2. 2. Localizable failure. Numbers off — bug in the script, fixed by a test. Narrative off — LLM hallucinated, fixed by editing the instructions.
  3. 3. Transparent bulk filter. Excluded commits don't disappear — they get named in their own section of the report. You see what was dropped and why.
  4. 4. Reusable metrics. The script works without the LLM layer — for CI, a dashboard, comparing two runs. A bonus, not a goal.

The Artifact

The skill git-evolution-audit — public repository: github.com/dobryakov/git-evolution-audit.

One real run was executed on this repo's .claude/skills/ over 2026-04-27..2026-05-22 (62 commits, 19 components, 2 bulk-excluded). Concrete findings that came out of the numbers:

All of those came from the numbers, not from general reasoning. That's the method validating itself on its own repo.

Series signature

Where It Breaks

Who It's For and Why

Architects, tech leads, heads of AI — anyone building long-lived systems who wants a periodic objective mirror of what's actually happening inside.

Particularly useful for systems with skills / agents / LLM pipelines, where "how did this component evolve" is harder to answer than for classic code: skills get rewritten iteratively, the line between methodology and implementation is blurred, and without a dedicated tool you can't reconstruct state from a quarter ago.

The core thing this pattern addresses: subjective drift — when your opinion of your own architecture diverges from its actual state. Metrics won't let you forget what was rewritten five times; interpretation on top won't let you drown in numbers alone. The hybrid works better than either layer on its own.

This isn't a replacement for review or retrospective; it's a cheap intermediate artifact you can run quarterly against any critical directory and keep as a baseline. A quarter later — next run, with the delta visible without special tooling.

Want this kind of objective mirror for your codebase?

Architectural audits grounded in numbers, not in vibes. Skill governance, AI-driven engineering practices that hold up to a quarterly review.

Send an email

More breakdowns

A series of engineering breakdowns: real problem → methodology → working artifact → honest analysis of where it fails.

To the series →