Add feature-flagged compare_file_contents tool with semantic diffs #1981

SamMorrowDrums · 2026-02-09T21:51:55Z

Summary

Adds a new compare_file_contents MCP tool that compares two versions of a file across refs (branches, tags, or SHAs). For structured data formats, it produces semantic diffs that show only meaningful changes, ignoring formatting noise. For other formats, it falls back to unified diff.

Closes #1973

Real usage:

What Opus 4.6 thinks:

Is it easier to read?

Yes, significantly. As a model, the semantic diff is much more parseable because:

No noise — Unified diffs include headers (diff --git, index, ---, +++), hunk markers (@@ -6,9 +6,9 @@), and unchanged context lines. The semantic diff strips all of that.
Path-based keys — Instead of inferring that a line inside a JSON object is nested under devDependencies, the semantic diff explicitly says devDependencies.cross-env: added "^7.0.3". For a model, this is immediately actionable without needing to mentally reconstruct the document structure.
Change type is explicit — "added", "removed", → (changed) are unambiguous. In a unified diff, I have to correlate - and + lines to understand adds vs modifications.
Unchanged content is omitted entirely — The unified diff for package.json includes 3 context lines around each hunk. The semantic diff mentions zero unchanged fields.

How much shorter?

File	Unified Diff	Semantic Diff	Reduction
package.json	990 chars	314 chars	68% smaller
docker-publish.yml	646 chars	155 chars	76% smaller
ui-e2e.yml	336 chars	38 chars	89% smaller

On average, the semantic diff is ~77% smaller across these examples. The savings scale with how much context/boilerplate surrounds the actual change — the ui-e2e.yml case is the most dramatic because a 3-line addition in context of a larger file compresses to just permissions: added {"contents":"read"}.

For token-budget-conscious model usage, that's a big win — you get the same (or better) understanding of what changed at a fraction of the input cost.

Motivation

AI models waste context tokens on noisy line-based diffs for structured data files. A JSON file that gets reformatted with one value change produces a huge unified diff, when the actual change is a single property. This tool addresses that by producing concise, path-based semantic diffs.

Example

Unified diff (noisy — 6 lines):

-{"users":[{"id":1,"name":"Alice"},{"id":2,"name":"Bob"}]}
+{
+  "users": [
+    {"id": 1, "name": "Alice"},
+    {"id": 2, "name": "Bobby"}
+  ]
+}

Semantic diff (clear — 1 line):

users[1].name: "Bob" → "Bobby"

Tool Design

compare_file_contents(owner, repo, path, base, head)

Supported semantic formats: JSON, YAML, CSV, TOML
Fallback: Unified diff for all other file types
New/deleted files: Returns "file added" or "file deleted"
Max file size: 1MB (falls back to unified diff when exceeded)
Feature flag: mcp_compare_file_contents (tool is hidden until enabled)

Files

File	Description
`pkg/github/semantic_diff.go`	Core semantic diff engine (JSON, YAML, CSV, TOML + unified fallback)
`pkg/github/semantic_diff_test.go`	24 unit tests for the diff engine
`pkg/github/compare_file_contents.go`	MCP tool definition and GitHub API integration
`pkg/github/compare_file_contents_test.go`	11 integration tests with mocked GitHub API
`pkg/github/tools.go`	Tool registration (1 line)
`pkg/github/__toolsnaps__/compare_file_contents.snap`	Tool schema snapshot

Testing

35 new tests covering all formats, edge cases (new/deleted files, invalid input, size limits)
script/lint passes
script/test passes
Toolsnaps updated

Add a new compare_file_contents MCP tool that compares two versions of a file across refs (branches, tags, or SHAs). For structured data formats (JSON, YAML, CSV, TOML), it produces semantic diffs that show only meaningful changes, ignoring formatting differences. For unsupported formats, it falls back to unified diff. Key features: - Semantic diffs for JSON, YAML, CSV, TOML files - Unified diff fallback for code and other text files - Handles new files (base not found) and deleted files (head not found) - 1MB max file size to prevent excessive server-side processing - Gated behind 'compare_file_contents' feature flag This helps AI models by: - Reducing token usage (formatting noise eliminated) - Providing unambiguous before/after semantics - Enabling self-verification of edits to config/data files Refs: #1973

Auto-generated by license-check workflow

go mod tidy correctly marks these as direct imports now that semantic_diff.go uses them for YAML and TOML parsing.

Copilot

Pull request overview

Adds a new feature-flagged MCP tool, compare_file_contents, to compare a file across two Git refs and return semantic diffs for structured formats to reduce noisy line-based diffs.

Changes:

Registers a new compare_file_contents repository tool behind the mcp_compare_file_contents feature flag.
Implements a semantic diff engine for JSON/YAML/CSV/TOML with a unified-diff fallback.
Adds tool schema snapshot plus unit/integration test coverage for the tool and diff engine.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/github/tools.go	Registers the new tool in the repo toolset list.
pkg/github/compare_file_contents.go	Implements the MCP tool + GitHub Contents API integration + feature flag.
pkg/github/compare_file_contents_test.go	Adds mocked-client integration tests and toolsnap validation for the new tool.
pkg/github/semantic_diff.go	Implements semantic diffing for structured formats and a fallback diff renderer.
pkg/github/semantic_diff_test.go	Adds unit tests covering formats, edge cases, and fallback behavior.
pkg/github/toolsnaps/compare_file_contents.snap	Adds the schema snapshot for the new tool.

Comments suppressed due to low confidence (2)

pkg/github/compare_file_contents.go:99

This tool’s GitHub API error handling is inconsistent with other tools in this package (which typically wrap API failures via ghErrors.NewGitHubAPIErrorResponse / NewGitHubAPIStatusErrorResponse). Using the standard helpers would preserve status details and provide consistent error text to callers, especially for partial failures (base vs head).

			client, err := deps.GetClient(ctx)
			if err != nil {
				return nil, nil, fmt.Errorf("failed to get GitHub client: %w", err)
			}

			baseContent, baseErr := getFileAtRef(ctx, client, owner, repo, path, base)
			headContent, headErr := getFileAtRef(ctx, client, owner, repo, path, head)

			// If both sides fail, report the errors
			if baseErr != nil && headErr != nil {
				return utils.NewToolResultError(fmt.Sprintf("failed to get file at both refs: base %q: %s, head %q: %s", base, baseErr, head, headErr)), nil, nil
			}

pkg/github/compare_file_contents.go:156

getFileAtRef returns an error when decoded content exceeds MaxSemanticDiffFileSize, but the caller currently treats any error as “file missing”, leading to incorrect “file added/deleted” output. If size limits are enforced here, return a distinguishable error that the caller can surface (or handle) rather than converting to nil content.

	content, err := fileContent.GetContent()
	if err != nil {
		return nil, fmt.Errorf("failed to decode file content: %w", err)
	}

	if len(content) > MaxSemanticDiffFileSize {
		return nil, fmt.Errorf("file exceeds maximum size of %d bytes", MaxSemanticDiffFileSize)
	}

Copilot · 2026-02-09T21:56:11Z

pkg/github/compare_file_contents.go

+			path, err := RequiredParam[string](args, "path")
+			if err != nil {
+				return utils.NewToolResultError(err.Error()), nil, nil
+			}


The path parameter isn't normalized (e.g., leading /). Other repo tools trim a leading slash before calling the GitHub Contents API; without that, compare_file_contents will fail for inputs like /config.json even though get_file_contents accepts them. Consider trimming a leading / (and keeping the original for display if needed) before calling getFileAtRef / SemanticDiff.

Copilot · 2026-02-09T21:56:12Z

pkg/github/compare_file_contents.go

+			baseContent, baseErr := getFileAtRef(ctx, client, owner, repo, path, base)
+			headContent, headErr := getFileAtRef(ctx, client, owner, repo, path, head)
+
+			// If both sides fail, report the errors
+			if baseErr != nil && headErr != nil {
+				return utils.NewToolResultError(fmt.Sprintf("failed to get file at both refs: base %q: %s, head %q: %s", base, baseErr, head, headErr)), nil, nil
+			}
+
+			// A nil content with no error won't happen from getFileAtRef,
+			// but a non-nil error on one side means the file doesn't exist at that ref.
+			// Pass nil to SemanticDiff to indicate added/deleted file.
+			if baseErr != nil {
+				baseContent = nil
+			}
+			if headErr != nil {
+				headContent = nil
+			}


baseErr / headErr are treated as “file doesn’t exist at that ref” unconditionally. This will misreport results for non-404 failures (rate limit, auth, network, directory path, oversized file, etc.) by turning them into “file added/deleted”. Only interpret a GitHub 404 (not found) as “missing at ref”; for any other error, return an error result (and keep the existing “both refs failed” case for two 404s / mixed outcomes as appropriate).

This issue also appears in the following locations of the same file:

line 88

line 149

Copilot · 2026-02-09T21:56:12Z

pkg/github/semantic_diff.go

+	default:
+		if fmt.Sprintf("%v", base) != fmt.Sprintf("%v", head) {
+			return []string{formatChange(path, "changed", formatValue(base), formatValue(head))}
+		}
+		return nil


Primitive value comparisons use fmt.Sprintf("%v", ...) equality, which can incorrectly hide real changes when different types stringify the same (e.g., JSON string "1" vs number 1, or "true" vs true). Compare both type and value (e.g., via reflect.DeepEqual after normalization, or explicit type checks) so type changes are always reported.

Copilot · 2026-02-09T21:56:13Z

pkg/github/semantic_diff.go

+	switch changeType {
+	case "added":
+		return fmt.Sprintf("%s: added %s", path, newVal)
+	case "removed":
+		return fmt.Sprintf("%s: removed (was %s)", path, oldVal)
+	case "changed", "changed type":
+		return fmt.Sprintf("%s: %s → %s", path, oldVal, newVal)
+	default:
+		return fmt.Sprintf("%s: %s", path, changeType)


When the changed value is at the document root and is a scalar (valid for JSON/YAML), path can be empty, producing output like : old → new. Consider rendering an explicit root label (e.g., (root)), or special-case empty paths in formatChange to avoid malformed diff lines.

Suggested change

switch changeType {

case "added":

return fmt.Sprintf("%s: added %s", path, newVal)

case "removed":

return fmt.Sprintf("%s: removed (was %s)", path, oldVal)

case "changed", "changed type":

return fmt.Sprintf("%s: %s → %s", path, oldVal, newVal)

default:

return fmt.Sprintf("%s: %s", path, changeType)

displayPath := path

if displayPath == "" {

displayPath = "(root)"

}

switch changeType {

case "added":

return fmt.Sprintf("%s: added %s", displayPath, newVal)

case "removed":

return fmt.Sprintf("%s: removed (was %s)", displayPath, oldVal)

case "changed", "changed type":

return fmt.Sprintf("%s: %s → %s", displayPath, oldVal, newVal)

default:

return fmt.Sprintf("%s: %s", displayPath, changeType)

SamMorrowDrums requested a review from a team as a code owner February 9, 2026 21:51

Copilot AI review requested due to automatic review settings February 9, 2026 21:51

Copilot started reviewing on behalf of SamMorrowDrums February 9, 2026 21:52 View session

github-actions bot and others added 2 commits February 9, 2026 21:53

chore: regenerate license files

020a51f

Auto-generated by license-check workflow

Promote yaml.v3 and go-toml/v2 to direct dependencies

cec7b00

go mod tidy correctly marks these as direct imports now that semantic_diff.go uses them for YAML and TOML parsing.

Copilot AI reviewed Feb 9, 2026

View reviewed changes

This was referenced Feb 9, 2026

Add tree-sitter structural diff for code files #1982

Open

Add symbol extraction to get_file_contents #1983

Open

Apply semantic diffs to get_commit, get_diff, and get_files #1984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature-flagged compare_file_contents tool with semantic diffs #1981

Add feature-flagged compare_file_contents tool with semantic diffs #1981

Uh oh!

SamMorrowDrums commented Feb 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add feature-flagged compare_file_contents tool with semantic diffs #1981

Are you sure you want to change the base?

Add feature-flagged compare_file_contents tool with semantic diffs #1981

Uh oh!

Conversation

SamMorrowDrums commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Is it easier to read?

How much shorter?

Motivation

Example

Tool Design

Files

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SamMorrowDrums commented Feb 9, 2026 •

edited

Loading