Skip to content

Conversation

@SamMorrowDrums
Copy link
Collaborator

Summary

Extends the compare_file_contents tool (from #1981) with tree-sitter-based structural diffing for code files. Instead of line-based unified diffs, code files now show declaration-level changes (functions, classes, types, imports) which gives models more concise and semantically meaningful context.

Example output

For a Go file where a function body changed:

function_declaration main: modified
  --- base
  +++ head
  @@ -1,2 +1,3 @@
   func main() {
  +fmt.Println("hello")
   }

vs a traditional unified diff that includes file headers and less semantic context.

Supported languages

Go, Python, JavaScript, TypeScript, Ruby, Rust, Java, C/C++ (10 grammar variants)

Unsupported languages fall back to unified line-based diff.

CGO requirement

Tree-sitter requires C bindings via CGO. This PR:

  • Sets CGO_ENABLED=1 in Dockerfile and goreleaser
  • Adds gcc and musl-dev to the Docker build stage
  • Removes Windows from goreleaser targets (CGO cross-compilation needs additional toolchain setup)

Testing

  • 30 unit tests covering all 8 supported languages
  • Tests for added/removed/modified/renamed declarations
  • Tests for new files, deleted files, and fallback behavior
  • All existing tests updated and passing

Dependency

Stacked on #1981. Review that PR first.


Closes part of #1973

Extends the compare_file_contents tool with AST-based structural
diffing for code files using tree-sitter. Instead of line-based diffs,
this shows declaration-level changes (functions, classes, types) which
gives models more concise and semantically meaningful context.

Supported languages: Go, Python, JavaScript, TypeScript, Ruby, Rust,
Java, C/C++.

Requires CGO_ENABLED=1 for the tree-sitter C bindings. Windows builds
are removed from goreleaser as CGO cross-compilation is not supported
without additional toolchain setup.

For unsupported languages, falls back to unified line-based diff.
@SamMorrowDrums SamMorrowDrums requested a review from a team as a code owner February 9, 2026 22:21
Copilot AI review requested due to automatic review settings February 9, 2026 22:21
The CGO-enabled binary was dynamically linked against musl libc from
Alpine, but the distroless runtime image uses glibc. This caused
'no such file or directory' at runtime because the dynamic linker
couldn't be found. Fix by passing -linkmode external -extldflags
'-static' to produce a fully static binary.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the existing compare_file_contents semantic diff engine to support tree-sitter-based structural diffs for code files, returning declaration-level change summaries for a set of common programming languages, while keeping unified diffs as a fallback for unsupported paths.

Changes:

  • Add tree-sitter structural diff implementation + language mapping for multiple code extensions.
  • Update semantic diff routing to use structural diffs for supported code files and adjust tests accordingly.
  • Enable CGO in build/release configs and add the go-tree-sitter dependency.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/github/structural_diff.go Implements tree-sitter parsing + declaration extraction and change summarization for supported languages.
pkg/github/structural_diff_test.go Adds structural-diff coverage across supported languages and fallback behavior.
pkg/github/semantic_diff.go Routes code files to structural diff; updates DetectDiffFormat accordingly.
pkg/github/semantic_diff_test.go Updates expectations for Go/code paths (structural) vs non-code (unified).
pkg/github/compare_file_contents_test.go Updates tool-level expectations to match structural output for Go files.
go.mod / go.sum Adds github.com/smacker/go-tree-sitter dependency.
Dockerfile Enables CGO and installs build deps (gcc/musl-dev) for tree-sitter.
.goreleaser.yaml Enables CGO for releases and removes Windows from release targets.
Comments suppressed due to low confidence (1)

Dockerfile:20

  • The build stage is Alpine (musl) but the runtime stage is distroless/base-debian12 (glibc). With CGO_ENABLED=1 this will typically produce a dynamically musl-linked binary that won’t run in the Debian-based runtime image. To avoid runtime failures, build on a Debian-based golang image (glibc) to match the runtime, or switch the runtime stage to an Alpine/musl-compatible base, or explicitly build a fully static binary (e.g., via appropriate external linker flags) and use distroless/static.
# Install git and C compiler for CGO (tree-sitter)
RUN --mount=type=cache,target=/var/cache/apk \
    apk add git gcc musl-dev

# Build the server
# go build automatically download required module dependencies to /go/pkg/mod
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    --mount=type=bind,target=. \
    CGO_ENABLED=1 go build -ldflags="-s -w -linkmode external -extldflags '-static' -X main.version=${VERSION} -X main.commit=$(git rev-parse HEAD) -X main.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    -o /bin/github-mcp-server ./cmd/github-mcp-server

# Make a stage to run the app
FROM gcr.io/distroless/base-debian12

Comment on lines 8 to 16
builds:
- env:
- CGO_ENABLED=0
- CGO_ENABLED=1
ldflags:
- -s -w -X main.version={{.Version}} -X main.commit={{.Commit}} -X main.date={{.Date}}
goos:
- linux
- windows
- darwin
main: ./cmd/github-mcp-server
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR enables CGO in release builds and removes Windows from GoReleaser targets, but the repo CI still runs script/test and go build on windows-latest. Since go-tree-sitter requires CGO (and Windows runners typically lack a GCC/Clang toolchain by default), the Windows CI job will likely fail. Consider either (1) adding build tags + a !cgo stub that falls back to unified diffs, so non-CGO builds still compile, or (2) updating CI to drop Windows or install a working CGO toolchain there.

Copilot uses AI. Check for mistakes.
Comment on lines 14 to +17
)

require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go.mod is not in the usual go mod tidy canonical form: the new direct dependency is in its own single-line require block. Since CI runs go mod tidy -diff, this is likely to fail. Please re-run go mod tidy so the dependency is merged into the main direct-requires block and any indirects are updated accordingly.

Suggested change
)
require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82
github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82
)

Copilot uses AI. Check for mistakes.
Comment on lines 440 to 462
baseDecl, inBase := baseMap[key]
headDecl, inHead := headMap[key]

switch {
case inBase && !inHead:
changes = append(changes, fmt.Sprintf("%s %s: removed", baseDecl.Kind, baseDecl.Name))
case !inBase && inHead:
changes = append(changes, fmt.Sprintf("%s %s: added", headDecl.Kind, headDecl.Name))
case baseDecl.Text != headDecl.Text:
changes = append(changes, fmt.Sprintf("%s %s: modified", baseDecl.Kind, baseDecl.Name))
}
}

return changes
}

// indexDeclarations creates a lookup map from declaration key to declaration.
// The key combines kind and name to handle same-name declarations of different kinds.
func indexDeclarations(decls []declaration) map[string]declaration {
result := make(map[string]declaration, len(decls))
for _, d := range decls {
key := d.Kind + ":" + d.Name
result[key] = d
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexDeclarations collapses declarations into a map keyed by kind:name, so any file that can contain multiple declarations sharing the same key will lose information and produce incorrect diffs. This can happen with common constructs like const a=1, b=2 (JS/TS lexical_declaration) or Go type (...) / var (...) blocks where multiple specs share one parent node and nameExtractor returns only one name. Consider emitting one declaration per spec/declarator (preferred) or incorporating node position (start byte/row) into the key and diffing as a multiset instead of a map.

Suggested change
baseDecl, inBase := baseMap[key]
headDecl, inHead := headMap[key]
switch {
case inBase && !inHead:
changes = append(changes, fmt.Sprintf("%s %s: removed", baseDecl.Kind, baseDecl.Name))
case !inBase && inHead:
changes = append(changes, fmt.Sprintf("%s %s: added", headDecl.Kind, headDecl.Name))
case baseDecl.Text != headDecl.Text:
changes = append(changes, fmt.Sprintf("%s %s: modified", baseDecl.Kind, baseDecl.Name))
}
}
return changes
}
// indexDeclarations creates a lookup map from declaration key to declaration.
// The key combines kind and name to handle same-name declarations of different kinds.
func indexDeclarations(decls []declaration) map[string]declaration {
result := make(map[string]declaration, len(decls))
for _, d := range decls {
key := d.Kind + ":" + d.Name
result[key] = d
baseDecls, inBase := baseMap[key]
headDecls, inHead := headMap[key]
switch {
case inBase && !inHead:
for _, d := range baseDecls {
changes = append(changes, fmt.Sprintf("%s %s: removed", d.Kind, d.Name))
}
case !inBase && inHead:
for _, d := range headDecls {
changes = append(changes, fmt.Sprintf("%s %s: added", d.Kind, d.Name))
}
default:
// Both base and head have declarations for this key. Treat them as multisets
// and compute added/removed/modified counts.
// First, match declarations with identical Text (unchanged).
headCounts := make(map[string]int, len(headDecls))
for _, d := range headDecls {
headCounts[d.Text]++
}
baseMatched := make([]bool, len(baseDecls))
headMatched := make([]bool, len(headDecls))
for i, bd := range baseDecls {
if count := headCounts[bd.Text]; count > 0 {
headCounts[bd.Text]--
baseMatched[i] = true
// mark one matching head declaration as used
for j, hd := range headDecls {
if !headMatched[j] && hd.Text == bd.Text {
headMatched[j] = true
break
}
}
}
}
// Collect unmatched declarations.
var unmatchedBase []declaration
for i, bd := range baseDecls {
if !baseMatched[i] {
unmatchedBase = append(unmatchedBase, bd)
}
}
var unmatchedHead []declaration
for i, hd := range headDecls {
if !headMatched[i] {
unmatchedHead = append(unmatchedHead, hd)
}
}
// Pair up unmatched base and head declarations as "modified".
modifiedCount := len(unmatchedBase)
if len(unmatchedHead) < modifiedCount {
modifiedCount = len(unmatchedHead)
}
for i := 0; i < modifiedCount; i++ {
d := unmatchedBase[i]
changes = append(changes, fmt.Sprintf("%s %s: modified", d.Kind, d.Name))
}
// Remaining unmatched base declarations are "removed".
for i := modifiedCount; i < len(unmatchedBase); i++ {
d := unmatchedBase[i]
changes = append(changes, fmt.Sprintf("%s %s: removed", d.Kind, d.Name))
}
// Remaining unmatched head declarations are "added".
for i := modifiedCount; i < len(unmatchedHead); i++ {
d := unmatchedHead[i]
changes = append(changes, fmt.Sprintf("%s %s: added", d.Kind, d.Name))
}
}
}
return changes
}
// indexDeclarations creates a lookup map from declaration key to declarations.
// The key combines kind and name to handle same-name declarations of different kinds.
func indexDeclarations(decls []declaration) map[string][]declaration {
result := make(map[string][]declaration, len(decls))
for _, d := range decls {
key := d.Kind + ":" + d.Name
result[key] = append(result[key], d)

Copilot uses AI. Check for mistakes.
Comment on lines 404 to 416
changes := diffDeclarations(baseDecls, headDecls)
if len(changes) == 0 {
return SemanticDiffResult{
Format: DiffFormatStructural,
Diff: "no structural changes detected",
}
}

return SemanticDiffResult{
Format: DiffFormatStructural,
Diff: strings.Join(changes, "\n"),
}
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/example shows structural diffs including a per-declaration unified diff hunk (base/head/@@) for modified declarations, but the current implementation only outputs summary lines like function_declaration X: modified without any intra-declaration diff. Either update the implementation to include the intended per-declaration diff output (or at least the changed declaration text), or adjust the PR description/docs/tests so the advertised output matches actual behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 14 to 18
)

require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82

require (
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new third-party dependency is introduced here. This repo’s CI includes script/licenses-check, which regenerates and diffs third-party-licenses.*.md and third-party/; without committing the regenerated outputs, the license-check workflow will fail. Please run ./script/licenses and commit any updated license artifacts alongside the dependency change.

Copilot uses AI. Check for mistakes.
Modified declarations now show what changed inside them:
- Inline line-level diffs using LCS for precise change detection
- Whitespace-normalized comparison for brace languages (Go, JS, etc.)
  so indentation-only changes collapse to '(whitespace/formatting only)'
- Exact comparison for whitespace-significant languages (Python)
- Recursive nesting into classes/modules to pinpoint which method changed
  e.g. 'class Dog: modified > method bark: modified > line change'
- Max depth of 5 prevents unbounded recursion
- Added method_definition to JS/TS/TSX declaration kinds for class methods
var_declaration and const_declaration nodes contain spec children
(var_spec, const_spec) with name fields. The Go name extractor now
handles these the same way as type_declaration, so nested diffs show
'var_declaration storeLine' instead of '_var_declaration_25'.
Copy link

@Mister-g666 Mister-g666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sammorrowdrums/tree-sitter-semantic-diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants