Phase 1: Git Diff Parser

The first stage of the pipeline extracts complete source files from both sides of a git diff. It reads the full file content from both the base and head commits — not just the diff hunks.


Why full source extraction

Most diff tools show you changed hunks — three lines before, the change, three lines after. This is useful for human review but useless for structural analysis. A diff hunk like + userId: string is not valid syntax — it cannot be parsed into an AST.

Diff Guardian needs complete, parseable files. It extracts the entire source code of each changed file from both the base and head refs. This gives the AST Mapper (Phase 2) two valid source trees to compare.

Git commands used

Internally, the parser runs these git commands for each file in the diff:

# Step 1: Get list of changed files between two refs
git diff --name-only --diff-filter=ACMR <base> <head>

# Step 2: For each changed file, extract full source
git show <base>:<filepath>     # → old source (entire file)
git show <head>:<filepath>     # → new source (entire file)

The --diff-filter=ACMR flag filters for Added, Copied, Modified, and Renamed files. Deleted files are also tracked but with an empty new source.

Operating modes

The Git Diff Parser operates in four distinct modes depending on how Diff Guardian was invoked:

Compare mode

Used by dg compare. Compares two explicit git refs.

npx dg compare main feature/payments

# Internally:
#   old source = git show main:<file>
#   new source = git show feature/payments:<file>

Working tree mode

Used by dg check (default). Compares HEAD against the current working tree.

npx dg check

# Internally:
#   old source = git show HEAD:<file>
#   new source = fs.readFile(<file>)  // reads from disk

Staged mode

Used by dg check --staged. Compares HEAD against the git staging area (index).

npx dg check --staged

# Internally:
#   old source = git show HEAD:<file>
#   new source = git show :0:<file>  // reads from git index

CI mode

Used in GitHub Actions. Compares the PR base ref against the PR head SHA.

# GitHub Actions environment:
# GITHUB_BASE_REF=main
# GITHUB_HEAD_SHA=abc1234

# Internally:
#   old source = git show origin/main:<file>
#   new source = git show abc1234:<file>

Output: FileDiff[]

The parser produces a FileDiff array — one entry per changed file. This is the contract between Phase 1 and Phase 2.

core/types.ts
interface FileDiff {
  path: string;        // Relative path, e.g., "src/api/payments.ts"
  language: string;    // File extension without dot, e.g., "ts", "py", "go"
  oldSource: string;   // Full source code from the base ref
  newSource: string;   // Full source code from the head ref
}

Special cases

CaseoldSourcenewSource
New file added"" (empty string)Complete file contents
File deletedComplete file contents"" (empty string)
File renamedSource from old pathSource from new path
Binary fileSkipped entirely — binary files have no AST

Performance

For a typical PR with 10-20 changed files, Phase 1 completes in under 100ms. The bottleneck is git show — one subprocess per file per ref. For a diff touching 50 files, that is 100 subprocess calls, which takes approximately 200-500ms depending on disk speed.

Buffer limit: each git show call is capped at 10MB. Files larger than 10MB are skipped with a warning. This prevents out-of-memory crashes on generated files or vendor bundles.

Next phase

The FileDiff[] array is passed to Phase 2: AST Mapper, which parses the source strings into syntax trees and extracts structural signatures.