Phase 1: Git Diff Parser
The first stage of the pipeline extracts complete source files from both sides of a git diff. It reads the full file content from both the base and head commits — not just the diff hunks.
Why full source extraction
Most diff tools show you changed hunks — three lines before, the change, three lines after. This is useful for human review but useless for structural analysis. A diff hunk like + userId: string is not valid syntax — it cannot be parsed into an AST.
Diff Guardian needs complete, parseable files. It extracts the entire source code of each changed file from both the base and head refs. This gives the AST Mapper (Phase 2) two valid source trees to compare.
Git commands used
Internally, the parser runs these git commands for each file in the diff:
# Step 1: Get list of changed files between two refs
git diff --name-only --diff-filter=ACMR <base> <head>
# Step 2: For each changed file, extract full source
git show <base>:<filepath> # → old source (entire file)
git show <head>:<filepath> # → new source (entire file)The --diff-filter=ACMR flag filters for Added, Copied, Modified, and Renamed files. Deleted files are also tracked but with an empty new source.
Operating modes
The Git Diff Parser operates in four distinct modes depending on how Diff Guardian was invoked:
Compare mode
Used by dg compare. Compares two explicit git refs.
npx dg compare main feature/payments
# Internally:
# old source = git show main:<file>
# new source = git show feature/payments:<file>Working tree mode
Used by dg check (default). Compares HEAD against the current working tree.
npx dg check
# Internally:
# old source = git show HEAD:<file>
# new source = fs.readFile(<file>) // reads from diskStaged mode
Used by dg check --staged. Compares HEAD against the git staging area (index).
npx dg check --staged
# Internally:
# old source = git show HEAD:<file>
# new source = git show :0:<file> // reads from git indexCI mode
Used in GitHub Actions. Compares the PR base ref against the PR head SHA.
# GitHub Actions environment:
# GITHUB_BASE_REF=main
# GITHUB_HEAD_SHA=abc1234
# Internally:
# old source = git show origin/main:<file>
# new source = git show abc1234:<file>Output: FileDiff[]
The parser produces a FileDiff array — one entry per changed file. This is the contract between Phase 1 and Phase 2.
interface FileDiff {
path: string; // Relative path, e.g., "src/api/payments.ts"
language: string; // File extension without dot, e.g., "ts", "py", "go"
oldSource: string; // Full source code from the base ref
newSource: string; // Full source code from the head ref
}Special cases
| Case | oldSource | newSource |
|---|---|---|
| New file added | "" (empty string) | Complete file contents |
| File deleted | Complete file contents | "" (empty string) |
| File renamed | Source from old path | Source from new path |
| Binary file | Skipped entirely — binary files have no AST | |
Performance
For a typical PR with 10-20 changed files, Phase 1 completes in under 100ms. The bottleneck is git show — one subprocess per file per ref. For a diff touching 50 files, that is 100 subprocess calls, which takes approximately 200-500ms depending on disk speed.
Buffer limit: each git show call is capped at 10MB. Files larger than 10MB are skipped with a warning. This prevents out-of-memory crashes on generated files or vendor bundles.
Next phase
The FileDiff[] array is passed to Phase 2: AST Mapper, which parses the source strings into syntax trees and extracts structural signatures.