-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add tree-sitter structural diff for code files #1982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sammorrowdrums/compare-file-contents
Are you sure you want to change the base?
Add tree-sitter structural diff for code files #1982
Conversation
Extends the compare_file_contents tool with AST-based structural diffing for code files using tree-sitter. Instead of line-based diffs, this shows declaration-level changes (functions, classes, types) which gives models more concise and semantically meaningful context. Supported languages: Go, Python, JavaScript, TypeScript, Ruby, Rust, Java, C/C++. Requires CGO_ENABLED=1 for the tree-sitter C bindings. Windows builds are removed from goreleaser as CGO cross-compilation is not supported without additional toolchain setup. For unsupported languages, falls back to unified line-based diff.
The CGO-enabled binary was dynamically linked against musl libc from Alpine, but the distroless runtime image uses glibc. This caused 'no such file or directory' at runtime because the dynamic linker couldn't be found. Fix by passing -linkmode external -extldflags '-static' to produce a fully static binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Extends the existing compare_file_contents semantic diff engine to support tree-sitter-based structural diffs for code files, returning declaration-level change summaries for a set of common programming languages, while keeping unified diffs as a fallback for unsupported paths.
Changes:
- Add tree-sitter structural diff implementation + language mapping for multiple code extensions.
- Update semantic diff routing to use structural diffs for supported code files and adjust tests accordingly.
- Enable CGO in build/release configs and add the
go-tree-sitterdependency.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/github/structural_diff.go |
Implements tree-sitter parsing + declaration extraction and change summarization for supported languages. |
pkg/github/structural_diff_test.go |
Adds structural-diff coverage across supported languages and fallback behavior. |
pkg/github/semantic_diff.go |
Routes code files to structural diff; updates DetectDiffFormat accordingly. |
pkg/github/semantic_diff_test.go |
Updates expectations for Go/code paths (structural) vs non-code (unified). |
pkg/github/compare_file_contents_test.go |
Updates tool-level expectations to match structural output for Go files. |
go.mod / go.sum |
Adds github.com/smacker/go-tree-sitter dependency. |
Dockerfile |
Enables CGO and installs build deps (gcc/musl-dev) for tree-sitter. |
.goreleaser.yaml |
Enables CGO for releases and removes Windows from release targets. |
Comments suppressed due to low confidence (1)
Dockerfile:20
- The build stage is Alpine (musl) but the runtime stage is distroless/base-debian12 (glibc). With CGO_ENABLED=1 this will typically produce a dynamically musl-linked binary that won’t run in the Debian-based runtime image. To avoid runtime failures, build on a Debian-based golang image (glibc) to match the runtime, or switch the runtime stage to an Alpine/musl-compatible base, or explicitly build a fully static binary (e.g., via appropriate external linker flags) and use distroless/static.
# Install git and C compiler for CGO (tree-sitter)
RUN --mount=type=cache,target=/var/cache/apk \
apk add git gcc musl-dev
# Build the server
# go build automatically download required module dependencies to /go/pkg/mod
RUN --mount=type=cache,target=/go/pkg/mod \
--mount=type=cache,target=/root/.cache/go-build \
--mount=type=bind,target=. \
CGO_ENABLED=1 go build -ldflags="-s -w -linkmode external -extldflags '-static' -X main.version=${VERSION} -X main.commit=$(git rev-parse HEAD) -X main.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
-o /bin/github-mcp-server ./cmd/github-mcp-server
# Make a stage to run the app
FROM gcr.io/distroless/base-debian12
| builds: | ||
| - env: | ||
| - CGO_ENABLED=0 | ||
| - CGO_ENABLED=1 | ||
| ldflags: | ||
| - -s -w -X main.version={{.Version}} -X main.commit={{.Commit}} -X main.date={{.Date}} | ||
| goos: | ||
| - linux | ||
| - windows | ||
| - darwin | ||
| main: ./cmd/github-mcp-server |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR enables CGO in release builds and removes Windows from GoReleaser targets, but the repo CI still runs script/test and go build on windows-latest. Since go-tree-sitter requires CGO (and Windows runners typically lack a GCC/Clang toolchain by default), the Windows CI job will likely fail. Consider either (1) adding build tags + a !cgo stub that falls back to unified diffs, so non-CGO builds still compile, or (2) updating CI to drop Windows or install a working CGO toolchain there.
| ) | ||
|
|
||
| require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82 | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go.mod is not in the usual go mod tidy canonical form: the new direct dependency is in its own single-line require block. Since CI runs go mod tidy -diff, this is likely to fail. Please re-run go mod tidy so the dependency is merged into the main direct-requires block and any indirects are updated accordingly.
| ) | |
| require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82 | |
| github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82 | |
| ) |
| baseDecl, inBase := baseMap[key] | ||
| headDecl, inHead := headMap[key] | ||
|
|
||
| switch { | ||
| case inBase && !inHead: | ||
| changes = append(changes, fmt.Sprintf("%s %s: removed", baseDecl.Kind, baseDecl.Name)) | ||
| case !inBase && inHead: | ||
| changes = append(changes, fmt.Sprintf("%s %s: added", headDecl.Kind, headDecl.Name)) | ||
| case baseDecl.Text != headDecl.Text: | ||
| changes = append(changes, fmt.Sprintf("%s %s: modified", baseDecl.Kind, baseDecl.Name)) | ||
| } | ||
| } | ||
|
|
||
| return changes | ||
| } | ||
|
|
||
| // indexDeclarations creates a lookup map from declaration key to declaration. | ||
| // The key combines kind and name to handle same-name declarations of different kinds. | ||
| func indexDeclarations(decls []declaration) map[string]declaration { | ||
| result := make(map[string]declaration, len(decls)) | ||
| for _, d := range decls { | ||
| key := d.Kind + ":" + d.Name | ||
| result[key] = d |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexDeclarations collapses declarations into a map keyed by kind:name, so any file that can contain multiple declarations sharing the same key will lose information and produce incorrect diffs. This can happen with common constructs like const a=1, b=2 (JS/TS lexical_declaration) or Go type (...) / var (...) blocks where multiple specs share one parent node and nameExtractor returns only one name. Consider emitting one declaration per spec/declarator (preferred) or incorporating node position (start byte/row) into the key and diffing as a multiset instead of a map.
| baseDecl, inBase := baseMap[key] | |
| headDecl, inHead := headMap[key] | |
| switch { | |
| case inBase && !inHead: | |
| changes = append(changes, fmt.Sprintf("%s %s: removed", baseDecl.Kind, baseDecl.Name)) | |
| case !inBase && inHead: | |
| changes = append(changes, fmt.Sprintf("%s %s: added", headDecl.Kind, headDecl.Name)) | |
| case baseDecl.Text != headDecl.Text: | |
| changes = append(changes, fmt.Sprintf("%s %s: modified", baseDecl.Kind, baseDecl.Name)) | |
| } | |
| } | |
| return changes | |
| } | |
| // indexDeclarations creates a lookup map from declaration key to declaration. | |
| // The key combines kind and name to handle same-name declarations of different kinds. | |
| func indexDeclarations(decls []declaration) map[string]declaration { | |
| result := make(map[string]declaration, len(decls)) | |
| for _, d := range decls { | |
| key := d.Kind + ":" + d.Name | |
| result[key] = d | |
| baseDecls, inBase := baseMap[key] | |
| headDecls, inHead := headMap[key] | |
| switch { | |
| case inBase && !inHead: | |
| for _, d := range baseDecls { | |
| changes = append(changes, fmt.Sprintf("%s %s: removed", d.Kind, d.Name)) | |
| } | |
| case !inBase && inHead: | |
| for _, d := range headDecls { | |
| changes = append(changes, fmt.Sprintf("%s %s: added", d.Kind, d.Name)) | |
| } | |
| default: | |
| // Both base and head have declarations for this key. Treat them as multisets | |
| // and compute added/removed/modified counts. | |
| // First, match declarations with identical Text (unchanged). | |
| headCounts := make(map[string]int, len(headDecls)) | |
| for _, d := range headDecls { | |
| headCounts[d.Text]++ | |
| } | |
| baseMatched := make([]bool, len(baseDecls)) | |
| headMatched := make([]bool, len(headDecls)) | |
| for i, bd := range baseDecls { | |
| if count := headCounts[bd.Text]; count > 0 { | |
| headCounts[bd.Text]-- | |
| baseMatched[i] = true | |
| // mark one matching head declaration as used | |
| for j, hd := range headDecls { | |
| if !headMatched[j] && hd.Text == bd.Text { | |
| headMatched[j] = true | |
| break | |
| } | |
| } | |
| } | |
| } | |
| // Collect unmatched declarations. | |
| var unmatchedBase []declaration | |
| for i, bd := range baseDecls { | |
| if !baseMatched[i] { | |
| unmatchedBase = append(unmatchedBase, bd) | |
| } | |
| } | |
| var unmatchedHead []declaration | |
| for i, hd := range headDecls { | |
| if !headMatched[i] { | |
| unmatchedHead = append(unmatchedHead, hd) | |
| } | |
| } | |
| // Pair up unmatched base and head declarations as "modified". | |
| modifiedCount := len(unmatchedBase) | |
| if len(unmatchedHead) < modifiedCount { | |
| modifiedCount = len(unmatchedHead) | |
| } | |
| for i := 0; i < modifiedCount; i++ { | |
| d := unmatchedBase[i] | |
| changes = append(changes, fmt.Sprintf("%s %s: modified", d.Kind, d.Name)) | |
| } | |
| // Remaining unmatched base declarations are "removed". | |
| for i := modifiedCount; i < len(unmatchedBase); i++ { | |
| d := unmatchedBase[i] | |
| changes = append(changes, fmt.Sprintf("%s %s: removed", d.Kind, d.Name)) | |
| } | |
| // Remaining unmatched head declarations are "added". | |
| for i := modifiedCount; i < len(unmatchedHead); i++ { | |
| d := unmatchedHead[i] | |
| changes = append(changes, fmt.Sprintf("%s %s: added", d.Kind, d.Name)) | |
| } | |
| } | |
| } | |
| return changes | |
| } | |
| // indexDeclarations creates a lookup map from declaration key to declarations. | |
| // The key combines kind and name to handle same-name declarations of different kinds. | |
| func indexDeclarations(decls []declaration) map[string][]declaration { | |
| result := make(map[string][]declaration, len(decls)) | |
| for _, d := range decls { | |
| key := d.Kind + ":" + d.Name | |
| result[key] = append(result[key], d) |
pkg/github/structural_diff.go
Outdated
| changes := diffDeclarations(baseDecls, headDecls) | ||
| if len(changes) == 0 { | ||
| return SemanticDiffResult{ | ||
| Format: DiffFormatStructural, | ||
| Diff: "no structural changes detected", | ||
| } | ||
| } | ||
|
|
||
| return SemanticDiffResult{ | ||
| Format: DiffFormatStructural, | ||
| Diff: strings.Join(changes, "\n"), | ||
| } | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description/example shows structural diffs including a per-declaration unified diff hunk (base/head/@@) for modified declarations, but the current implementation only outputs summary lines like function_declaration X: modified without any intra-declaration diff. Either update the implementation to include the intended per-declaration diff output (or at least the changed declaration text), or adjust the PR description/docs/tests so the advertised output matches actual behavior.
| ) | ||
|
|
||
| require github.com/smacker/go-tree-sitter v0.0.0-20240827094217-dd81d9e9be82 | ||
|
|
||
| require ( |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new third-party dependency is introduced here. This repo’s CI includes script/licenses-check, which regenerates and diffs third-party-licenses.*.md and third-party/; without committing the regenerated outputs, the license-check workflow will fail. Please run ./script/licenses and commit any updated license artifacts alongside the dependency change.
Modified declarations now show what changed inside them: - Inline line-level diffs using LCS for precise change detection - Whitespace-normalized comparison for brace languages (Go, JS, etc.) so indentation-only changes collapse to '(whitespace/formatting only)' - Exact comparison for whitespace-significant languages (Python) - Recursive nesting into classes/modules to pinpoint which method changed e.g. 'class Dog: modified > method bark: modified > line change' - Max depth of 5 prevents unbounded recursion - Added method_definition to JS/TS/TSX declaration kinds for class methods
var_declaration and const_declaration nodes contain spec children (var_spec, const_spec) with name fields. The Go name extractor now handles these the same way as type_declaration, so nested diffs show 'var_declaration storeLine' instead of '_var_declaration_25'.
Mister-g666
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sammorrowdrums/tree-sitter-semantic-diff
Summary
Extends the
compare_file_contentstool (from #1981) with tree-sitter-based structural diffing for code files. Instead of line-based unified diffs, code files now show declaration-level changes (functions, classes, types, imports) which gives models more concise and semantically meaningful context.Example output
For a Go file where a function body changed:
vs a traditional unified diff that includes file headers and less semantic context.
Supported languages
Go, Python, JavaScript, TypeScript, Ruby, Rust, Java, C/C++ (10 grammar variants)
Unsupported languages fall back to unified line-based diff.
CGO requirement
Tree-sitter requires C bindings via CGO. This PR:
CGO_ENABLED=1in Dockerfile and goreleasergccandmusl-devto the Docker build stageTesting
Dependency
Stacked on #1981. Review that PR first.
Closes part of #1973