Skip to content

Comments

feat: add reference graph for Python#1460

Merged
KRRT7 merged 57 commits intomainfrom
call-graphee
Feb 19, 2026
Merged

feat: add reference graph for Python#1460
KRRT7 merged 57 commits intomainfrom
call-graphee

Conversation

@KRRT7
Copy link
Collaborator

@KRRT7 KRRT7 commented Feb 12, 2026

Summary

  • Add a persistent SQLite-backed reference graph that indexes function call edges using Jedi, with file-hash-based caching and parallel indexing
  • Expose ReferenceGraph in codeflash/languages/python/ behind a DependencyResolver protocol, removing is_python() gating from the optimizer
  • Rich Live display for index building with project-relative paths and dependency summary
  • Two flat human-readable DB tables (indexed_files, call_edges) with full text keys
  • Skip reference graph in CI where the cache DB doesn't persist
  • Simplify compat.py by removing unnecessary class wrapper

Test plan

  • 16 unit tests in tests/test_reference_graph.py covering indexing, caching, cross-file edges, persistence
  • uv run prek run --from-ref origin/main passes

KRRT7 and others added 19 commits February 10, 2026 04:57
Store only the type string instead of the full Jedi Name object,
removing the need for arbitrary_types_allowed and the runtime
dependency on jedi in the model layer.
Introduces CallGraph that uses Jedi infer()+goto() to build call edges,
stores them in codeflash_cache.db with content-hash invalidation, and
serves as a drop-in replacement for get_function_sources_from_jedi().
Create CallGraph in Optimizer.run() for Python runs, pass it through
FunctionOptimizer to code_context_extractor where it replaces
get_function_sources_from_jedi() calls when available.
Covers same-file calls, cross-file calls, class instantiation,
nested function exclusion, module-level exclusion, site-packages
exclusion, empty/syntax-error files, and cache persistence.
Replace the simple progress bar with a Live + Tree + Panel display
that shows files being analyzed, call edges discovered, cache hits,
and summary stats during call graph indexing.
…cy summary

Add cross-file edge detection to IndexResult, replace tree sub-entries
with flat per-file dependency labels using plain language, and add a
post-indexing summary panel showing per-function dependency stats.
Use the call graph to sort functions by callee count (most dependencies
first) in --all mode without benchmarks, replacing arbitrary ordering.
Separate Jedi analysis (CPU-bound) from DB persistence so uncached files
can be analyzed across multiple worker processes. Files are dispatched to
a pool of up to 8 workers when >= 8 need indexing, with sequential
fallback for small batches or on pool failure.
Use bounded deque for results, batch updates every 8 results with manual
refresh to reduce flicker, and filter source_files to Python-only before
passing to the call graph indexer.
Add DependencyResolver protocol and IndexResult to base.py, move
call_graph.py to languages/python/, and use factory method in optimizer
instead of is_python() gating.
…ve paths in call graph

Display file paths relative to project root in the call graph live
display for easier navigation. Filter indexed files by the language
support's file extensions to avoid processing irrelevant file types.
…h sections

Split the runtime estimate and PR message into separate log lines to
avoid awkward line wrapping. Add console rules between sections for
clearer visual separation.
…bles

Replace the normalized relational hierarchy (cg_projects → cg_languages →
cg_indexed_files/cg_call_edges) with two self-describing tables (indexed_files,
call_edges) where every row includes project_root and language as text columns.
Skip dependency resolver creation in CI environments where the cache DB
doesn't persist between runs. Also apply ruff formatting to call_graph.py.
@KRRT7 KRRT7 changed the title feat: add persistent call graph with language support layer feat: add call graph for python Feb 12, 2026
@claude
Copy link
Contributor

claude bot commented Feb 12, 2026

PR Review Summary

Prek Checks

✅ All checks pass (ruff check and ruff format both pass).

Mypy

✅ No new mypy errors introduced by this PR. The new reference_graph.py has zero mypy errors. All errors found in changed files are pre-existing.

Code Review

Still-open bug (from prior review):

  • codeflash/languages/python/support.py:37-38fs.jedi_definition was removed from FunctionSource (replaced with definition_type: str | None), but function_sources_to_helpers() still accesses fs.jedi_definition.line. This will raise AttributeError at runtime when the reference graph resolver is enabled. (See existing comment)

Resolved from prior reviews:

  • FunctionSource constructor calls in code_context_extractor.pyjedi_definition= keyword args removed
  • get_code_optimization_context() now accepts call_graph keyword argument
  • ✅ Token limits updated to 64K
  • call_graph_summary uses batch count_callees_per_function
  • count_callees_per_function uses (file_path, qualified_name) tuple keys

No new critical issues found in the latest changes. The reference graph feature is currently disabled in optimizer.py (commented out), so the support.py bug won't trigger at runtime until it's enabled.

Test Coverage

File Stmts Miss Cover Status
cli_cmds/console.py 179 133 26% Modified
code_utils/code_replacer.py 410 71 83% Modified
code_utils/compat.py 12 0 100% Modified
code_utils/config_consts.py 58 7 88% Modified
discovery/functions_to_optimize.py 549 165 70% Modified
languages/__init__.py 20 14 30% Modified
languages/base.py 127 2 98% Modified
languages/javascript/support.py 956 249 74% Modified
languages/python/__init__.py 3 0 100% Modified
languages/python/context/code_context_extractor.py 634 47 93% Modified
languages/python/context/unused_definition_remover.py 483 28 94% Modified
languages/python/reference_graph.py 274 76 72% New file
languages/python/support.py 286 140 51% Modified
models/models.py 626 138 78% Modified
optimization/function_optimizer.py 1169 950 19% Modified
optimization/optimizer.py 446 361 19% Modified

Overall project coverage: 79%

Coverage notes:

  • New file reference_graph.py has 72% coverage (close to the 75% threshold) — the main uncovered paths are the parallel indexing worker functions and some error handling branches
  • console.py (26%) has low coverage but most of the new code is Rich UI display logic (call_graph_live_display, call_graph_summary) which is difficult to unit test
  • optimizer.py and function_optimizer.py have low coverage (19%) but this is pre-existing — they are integration-heavy modules
  • 475 new test lines were added in tests/test_reference_graph.py with thorough unit and integration tests for the new ReferenceGraph class

Test results: 2411 passed, 8 failed (all in test_tracer.py — pre-existing, unrelated to this PR), 57 skipped


Last updated: 2026-02-19T07:50:00Z

KRRT7 and others added 4 commits February 18, 2026 21:51
Deeply nested expression trees (e.g. large dict/list literals) at module
or class level caused the recursive ast.NodeVisitor to exceed Python's
default recursion limit. Replace the FunctionWithReturnStatement visitor
class with an iterative stack-based traversal.
The optimized code achieves a **26% runtime improvement** by making the AST traversal in `function_has_return_statement` more targeted and efficient.

**Key Optimization:**

The critical change is in how `function_has_return_statement` traverses the AST when searching for `Return` nodes:

**Original approach:**
```python
stack.extend(ast.iter_child_nodes(node))
```
This visits *all* child nodes including expressions, names, constants, and other non-statement nodes.

**Optimized approach:**
```python
for child in ast.iter_child_nodes(node):
    if isinstance(child, ast.stmt):
        stack.append(child)
```
This only pushes statement nodes onto the stack, since `Return` is a statement type (`ast.stmt`).

**Why This Is Faster:**

1. **Reduced Node Traversal**: In typical Python functions, there are many more expression nodes (variable references, literals, operators, etc.) than statement nodes. For example, a simple `return x + y` has 1 Return statement but multiple Name and BinOp expression nodes underneath. The optimization skips all the expression-level nodes.

2. **Lower Python Overhead**: Fewer nodes in the stack means fewer loop iterations, fewer `isinstance` checks on non-Return nodes, and less list manipulation overhead.

3. **Preserved Correctness**: Since `Return` nodes are always statements in Python's AST (they inherit from `ast.stmt`), filtering to only statement nodes cannot miss any Return nodes.

**Performance Impact by Test Case:**

The optimization shows particularly strong gains for:
- **Functions without returns** (up to 91% faster): Early termination without traversing deep expression trees
- **Large codebases** (34-41% faster on tests with 1000+ functions): The cumulative effect across many function bodies
- **Functions with complex expressions but no returns** (82% faster): Avoiding expensive traversal of unused expression subtrees
- **Generator functions without explicit returns** (64% faster): Skipping yield expression internals

The optimization maintains correctness across all test cases including nested classes, async functions, properties, and various control structures, while delivering consistent runtime improvements.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Feb 18, 2026

⚡️ Codeflash found optimizations for this PR

📄 26% (0.26x) speedup for find_functions_with_return_statement in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 12.0 milliseconds 9.48 milliseconds (best of 46 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch call-graphee).

Static Badge

KRRT7 and others added 2 commits February 18, 2026 17:24
Replace per-function SQL loops in get_callees() and count_callees_per_function()
with temp table JOINs, and thread resolved path strings through to avoid
redundant resolve() calls.
…2026-02-18T22.22.36

⚡️ Speed up function `find_functions_with_return_statement` by 26% in PR #1460 (`call-graphee`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Feb 18, 2026

This PR is now faster! 🚀 @KRRT7 accepted my optimizations from:

The optimized code achieves a **146% speedup** (from 1.47ms to 595μs) by eliminating the overhead of `ast.iter_child_nodes()` and replacing it with direct field access on AST nodes.

**Key optimizations:**

1. **Direct stack initialization**: Instead of starting with `[function_node]` and then traversing into its body, the stack is initialized directly with `list(function_node.body)`. This skips one iteration and avoids processing the function definition wrapper itself.

2. **Manual field traversal**: Rather than calling `ast.iter_child_nodes(node)` which is a generator that yields all child nodes, the code directly accesses `node._fields` and uses `getattr()` to inspect each field. This eliminates the generator overhead and function call costs associated with `ast.iter_child_nodes()`.

3. **Targeted statement filtering**: By checking `isinstance(child, ast.stmt)` or `isinstance(item, ast.stmt)` only on relevant fields (handling both single statements and lists of statements), the traversal focuses on statement nodes where `ast.Return` can appear, avoiding unnecessary checks on expression nodes.

**Why this is faster:**

- **Reduced function call overhead**: `ast.iter_child_nodes()` is a generator function that incurs call/yield overhead on every iteration. Direct attribute access via `getattr()` is faster for small numbers of fields.
- **Fewer iterations**: The line profiler shows the original code's `ast.iter_child_nodes()` line hit 5,453 times (69% of runtime), while the optimized version's field iteration hits only 3,290 times (17.4% of runtime).
- **Better cache locality**: Direct field access patterns may benefit from better CPU cache utilization compared to generator state management.

**Test case performance:**

The optimization shows dramatic improvements particularly for:
- **Functions with many sequential statements** (2365% faster for 1000 statements, 1430% faster for 1000 nested functions)
- **Simple functions** (234-354% faster for basic return detection)
- **Moderately complex control flow** (80-125% faster for nested conditionals/loops)

The speedup is consistent across all test cases, with early-return scenarios benefiting the most as the optimization allows faster discovery of the return statement before processing unnecessary nodes.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Feb 18, 2026

⚡️ Codeflash found optimizations for this PR

📄 147% (1.47x) speedup for function_has_return_statement in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 1.47 milliseconds 595 microseconds (best of 58 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch call-graphee).

Static Badge

…2026-02-18T22.34.56

⚡️ Speed up function `function_has_return_statement` by 147% in PR #1460 (`call-graphee`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Feb 18, 2026

This PR is now faster! 🚀 @KRRT7 accepted my optimizations from:

# Conflicts:
#	.codex/skills/.gitignore
#	.gemini/skills/.gitignore
#	codeflash/languages/python/context/code_context_extractor.py
Add DependencyResolver parameter back to get_code_optimization_context()
that was lost during file move from codeflash/context/ to
codeflash/languages/python/context/. When call_graph is available, use it
for helper discovery instead of Jedi-based fallback.
@KRRT7 KRRT7 changed the title feat: add call graph for python feat: add reference graph for Python Feb 19, 2026
@KRRT7 KRRT7 merged commit 3dabd44 into main Feb 19, 2026
26 of 28 checks passed
@KRRT7 KRRT7 deleted the call-graphee branch February 19, 2026 07:52
KRRT7 added a commit that referenced this pull request Feb 19, 2026
feat: add reference graph for Python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant