Conversation
There was a problem hiding this comment.
Pull request overview
Improves manifest-list caching to prevent quadratic memory growth by deduplicating cached ManifestFile objects by manifest_path, addressing the memory issue described in #2325.
Changes:
- Reworked manifest caching to store individual
ManifestFileinstances keyed bymanifest_path(instead of caching whole manifest-list tuples). - Updated/added tests to validate
ManifestFileidentity reuse across repeated reads and across overlapping manifest lists. - Added benchmark tests to measure cache memory growth and deduplication behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
pyiceberg/manifest.py |
Changes cache strategy to dedupe ManifestFile objects by manifest_path and adds a lock for cache access. |
tests/utils/test_manifest.py |
Updates the existing cache test and adds new unit tests for cross-manifest-list deduplication. |
tests/benchmark/test_memory_benchmark.py |
Adds benchmark tests intended to reproduce/guard the memory-growth behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| https://github.com/apache/iceberg-python/issues/2325 | ||
|
|
||
| The issue: When caching manifest lists as tuples, overlapping ManifestFile objects | ||
| are duplicated across cache entries, causing O(N²) memory growth instead of O(N). |
pyiceberg/manifest.py
Outdated
| # Global cache for ManifestFile objects, keyed by manifest_path. | ||
| # This deduplicates ManifestFile objects across manifest lists, which commonly | ||
| # share manifests after append operations. | ||
| _manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=512) |
There was a problem hiding this comment.
Why bump this up from 128 -> 512 (it's okay to say it's arbitrary)
There was a problem hiding this comment.
good catch. now that we're only caching ManifestFile objects, they have relatively small memory footprint. we were catching manifest list before, each pointing to many many ManifestFiles
also #2952 should make this configurable
There was a problem hiding this comment.
The target size of a Manifest is 8MB. 512*8MB=4GB, which seems high. Should we keep this at 128 (1GB) until we make this configurable?
There was a problem hiding this comment.
sounds reasonable! lets do it
Fokko
left a comment
There was a problem hiding this comment.
Thanks @kevinjqliu for the extensive explanation and testing. This looks good to me, but maybe we should reduce the cache size. Let me know what you think!
|
Thanks for the reviewing @Fokko @jayceslesar @rambleraptor @geruh 😄 |
Rationale for this change
Fix part of #2325
Context: #2325 (comment)
Cache Manifest File object instead of Manifest List (tuple of Manifest Files).
This PR fix the O(N²) cache inefficiency, into the expected O(N) linear growth pattern.
Are these changes tested?
Yes, with benchmark test (
tests/benchmark/test_memory_benchmark.py)Result running from main branch: https://gist.github.com/kevinjqliu/970f4b51a12aaa0318a2671173430736
Result running from this branch: https://gist.github.com/kevinjqliu/24990d18d2cea2fa468597c16bfa27fd
Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache
test_manifest_cache_memory_growthtest_memory_after_gc_with_cache_clearedtest_manifest_cache_deduplication_efficiencyMemory Growth Benchmark (50 append operations)
Memory at Each Iteration
This fix reduces memory growth by ~67%, bringing per-iteration growth from ~27 KB down to ~9 KB.
The improvement comes from caching individual
ManifestFileobjects by theirmanifest_pathinstead of caching entire manifest list tuples. This deduplicatesManifestFileobjects that appear in multiple manifest lists (common after appends).Are there any user-facing changes?