Git Log as a Dataset for Architectural Retro-Analysis

The Problem

In a long-running project of dozens of skills, services, or modules, the context behind past architectural decisions gets lost. Approaches shifted, things got refactored, some pieces died silently, others fused without explicit intent. To see that objectively, you need a dataset that doesn't depend on human memory.

That dataset already exists — git history. The problem isn't its absence; it's that people rarely use it as a dataset. Usually they read it as a feed of commits ("what happened this month"), less often as a metric (churn, co-change). Almost never as material for interpretation.

Up front, what this is not. It's not "feed the git log to an LLM and get insights." Doesn't work: on a month of an average repo's history the diff is tens of megabytes of text — won't fit in a single prompt. And even if it did, on raw log an LLM gravitates toward narrative smoothness: it will find "evolution" and "turning points" where there are none.

It's also not a code-maat-style approach (pure quantitative metrics). Numbers state a fact, not an explanation. "Component X was rewritten 16 times" — fact. Whether that's searching for the right form, reacting to changing requirements, or evidence of bad design — numbers won't say.

The ask: get explanation grounded in numbers, not a narrative without a ground and not a number without interpretation.

The Method

One principle: separate the metric layer from the interpretation layer. Metrics are computed deterministically; the LLM writes a narrative report on top. Every claim in the report has to reference either a concrete commit SHA or a number from the metrics. That removes the risk of invented patterns: if the LLM says "component A is coupled to B," there must be co_commits ≥ 2 and the SHAs where they appear together. Otherwise the claim doesn't make it into the report.

Layer 1 — metrics (bash + awk script)

One script takes --root <path>, --since <date>, --until <date>, --bulk-threshold K and emits JSON with three blocks:

• components[] — per-component churn, first/last commit, biggest commit. A "component" = an immediate subdirectory of --root.
• co_change_pairs[] — for each pair of components, the count of commits that touched both.
• commits[] — index by SHA: date, author, diff size, components touched, is_bulk flag, subject.

Bulk filter: a commit that touches more than K components (default 5) is flagged is_bulk=true and excluded from the co-change matrix. Otherwise one or two cross-cutting refactors ("changed the format in all skills") skew the matrix enough that any pair of involved components looks coupled. Per-component churn still counts those commits — it's real work, just not a semantic coupling signal.

No dependencies: git + bash + awk. Read-only.

Layer 2 — deterministic commit selection for deep review

The LLM doesn't pick what to look at. Rules are fixed:

• per-component top-N (default 10) by diff size;
• multi-component non-bulk: commits touching 2-4 components — turning-point candidates (cross-cutting, but not "everything at once");
• top co-change drivers: for the top-5 pairs, find the commits where the pair appears.

The union of these three sets goes into full diff via git show -p. If the total diff payload exceeds the budget (~200K tokens), the skill trims from the end of the date-sorted list and says so in the report.

Layer 3 — interpretation (LLM)

Given the metrics JSON and full diffs of the selected commits, the LLM writes a report following a strict structure: scope → metrics → co-change → turning points → paradoxes → evolution of approach → what suggests itself → limitations.

Interpretation rules:

• A claim about a component → reference a number from metrics ("21 commits, +716/-166").
• A claim about a specific change → reference a SHA.
• A claim about co-change → reference a pair with its co_commits count.
• A pair with co_commits = 1 is noise, not a pattern. The report includes only co_commits ≥ 2.
• If something "seems true" but isn't supported by numbers — the place stays empty. A gap in the report beats a narrative without a ground.

What this gives you

1. Reproducibility. On the same git state the script emits identical JSON. Metrics are fact. Interpretation on top is the LLM's opinion — and it's explicitly marked as opinion (linked to its source).
2. Localizable failure. Numbers off — bug in the script, fixed by a test. Narrative off — LLM hallucinated, fixed by editing the instructions.
3. Transparent bulk filter. Excluded commits don't disappear — they get named in their own section of the report. You see what was dropped and why.
4. Reusable metrics. The script works without the LLM layer — for CI, a dashboard, comparing two runs. A bonus, not a goal.

The Artifact

The skill git-evolution-audit — public repository: github.com/dobryakov/git-evolution-audit.

• SKILL.md — orchestration, selection rules, report structure, failure modes.
• scripts/collect_metrics.sh — bash metric layer (~150 lines, no dependencies).

One real run was executed on this repo's .claude/skills/ over 2026-04-27..2026-05-22 (62 commits, 19 components, 2 bulk-excluded). Concrete findings that came out of the numbers:

humanizer has the highest delete-to-insert ratio in the directory: 824/1625 ≈ 51% vs. ~3% for neighbors. Half of what was written got rewritten — a search for form, not accumulation. On May 22 (commit 27fa2bc) the monolithic SKILL.md decomposed into skill + references/{calques,lexicon,rhetoric,typography}.md.
action-planner is the next candidate for the same decomposition: 21 commits, 716+ lines, but delete:insert = 23% (iterations, not rewrites). If the pace holds, the monolith becomes a bottleneck.
Co-change paradox: the pair evidence-finder ↔ humanizer (co_commits=2, 50%) isn't structural — it surfaces from the publication pipeline declared in CLAUDE.md. A marker that the repo has implicit pipelines, visible only through co-change. Candidate to promote pipelines into first-class objects.

All of those came from the numbers, not from general reasoning. That's the method validating itself on its own repo.

Series signature

Where It Breaks

Directory renames distort history. If a component was renamed/moved, metrics show "short life" for the new and "death" for the old. git log --follow works only for single files, not directories. Rename detection is not implemented by design (MVP). On rename-free repos it's a non-issue; on a refactored codebase a disclaimer is mandatory.
Rebase/squash/force-push in history make numbers inexact where the team cleaned up history. On a linear-merge policy it's usually fine.
Binary files show as -/- in --numstat. The script counts them as 0/0 for churn, but they're still counted as a touch. On directories with images the metric misleads.
Content vs code. On outputs/, market-state/signals/ and similar, high churn = normal operation, not instability. The skill doesn't understand that semantics — interpretation has to know what kind of directory it's looking at. On a code repo the distinction between "source" and "log-style" directories is simpler; on a content repo the boundary is fuzzy.
Generator/generated pairs aren't detected. A run against website/ would show "the evolution of HTML," which is meaningless — the source is content.md. An auto-detection heuristic (1:1 co-change → mark as generated) is out of scope for MVP — on a homogeneous directory like .claude/skills/ it's unnecessary; on a heterogeneous repo it's required.
Context size. On large code repos the diff selection from layer 2-3 won't fit even after the top-N filter. Step 3 trims with an honest warning, but the realistic MVP scale is directories with tens to hundreds of commits with meaningful diffs. "A whole monorepo over five years" requires budgeting (quarterly slices → meta-summary) and re-engineering not implemented here.
Co-change at small N is unreliable. A pair with co_commits = 1 is noise. The skill filters that at the report level (≥ 2), but on a short history (1-2 weeks) even 2 is unreliable. Minimum 4 weeks or 50+ commits per directory for a meaningful matrix.
Bulk threshold is a heuristic. The default K=5 catches cross-cutting "format change everywhere," but may erroneously drop a legitimate large refactor across six components. Choose K consciously; if in doubt, run twice with different values and compare the matrices.
LLM narrative-smoothing doesn't disappear entirely. It's reduced by "every claim ties to a SHA or a number," but not to zero. The cure: read the report critically — verify any suspicious generalization via grep on the referenced SHAs.
Single-author bias. On a single-author repo actor_breakdown isn't a signal. On a team repo it becomes critical; it's not separately surfaced in the MVP.

Who It's For and Why

Architects, tech leads, heads of AI — anyone building long-lived systems who wants a periodic objective mirror of what's actually happening inside.

Particularly useful for systems with skills / agents / LLM pipelines, where "how did this component evolve" is harder to answer than for classic code: skills get rewritten iteratively, the line between methodology and implementation is blurred, and without a dedicated tool you can't reconstruct state from a quarter ago.

The core thing this pattern addresses: subjective drift — when your opinion of your own architecture diverges from its actual state. Metrics won't let you forget what was rewritten five times; interpretation on top won't let you drown in numbers alone. The hybrid works better than either layer on its own.

This isn't a replacement for review or retrospective; it's a cheap intermediate artifact you can run quarterly against any critical directory and keep as a baseline. A quarter later — next run, with the delta visible without special tooling.

Want this kind of objective mirror for your codebase?

Architectural audits grounded in numbers, not in vibes. Skill governance, AI-driven engineering practices that hold up to a quarterly review.

Send an email

More breakdowns

A series of engineering breakdowns: real problem → methodology → working artifact → honest analysis of where it fails.

To the series →