fix manifest cache by kevinjqliu · Pull Request #2951 · apache/iceberg-python

kevinjqliu · 2026-01-25T19:28:28Z

Rationale for this change

Fix part of #2325
Context: #2325 (comment)

Cache Manifest File object instead of Manifest List (tuple of Manifest Files).
This PR fix the O(N²) cache inefficiency, into the expected O(N) linear growth pattern.

Are these changes tested?

Yes, with benchmark test (tests/benchmark/test_memory_benchmark.py)
Result running from main branch: https://gist.github.com/kevinjqliu/970f4b51a12aaa0318a2671173430736
Result running from this branch: https://gist.github.com/kevinjqliu/24990d18d2cea2fa468597c16bfa27fd

Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache

Test	main	fix branch
`test_manifest_cache_memory_growth`	❌ FAILED	✅ PASSED
`test_memory_after_gc_with_cache_cleared`	✅ PASSED	✅ PASSED
`test_manifest_cache_deduplication_efficiency`	✅ PASSED	✅ PASSED

Memory Growth Benchmark (50 append operations)

Metric	main	fix branch	Improvement
Initial memory	3,233.4 KB	3,210.7 KB	-0.7%
Final memory	4,280.6 KB	3,558.9 KB	-16.9%
Total growth	1,047.2 KB	348.1 KB	-66.8%
Growth per iteration	26,809 bytes	8,913 bytes	-66.8%

Memory at Each Iteration

Iteration	main	fix branch	Δ
10	3,233.4 KB	3,210.7 KB	-22.7 KB
20	3,471.0 KB	3,371.4 KB	-99.6 KB
30	3,719.3 KB	3,467.1 KB	-252.2 KB
40	3,943.9 KB	3,483.2 KB	-460.7 KB
50	4,280.6 KB	3,558.9 KB	-721.7 KB

This fix reduces memory growth by ~67%, bringing per-iteration growth from ~27 KB down to ~9 KB.

The improvement comes from caching individual ManifestFile objects by their manifest_path instead of caching entire manifest list tuples. This deduplicates ManifestFile objects that appear in multiple manifest lists (common after appends).

Are there any user-facing changes?

pyiceberg/manifest.py

Copilot

Pull request overview

Improves manifest-list caching to prevent quadratic memory growth by deduplicating cached ManifestFile objects by manifest_path, addressing the memory issue described in #2325.

Changes:

Reworked manifest caching to store individual ManifestFile instances keyed by manifest_path (instead of caching whole manifest-list tuples).
Updated/added tests to validate ManifestFile identity reuse across repeated reads and across overlapping manifest lists.
Added benchmark tests to measure cache memory growth and deduplication behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`pyiceberg/manifest.py`	Changes cache strategy to dedupe `ManifestFile` objects by `manifest_path` and adds a lock for cache access.
`tests/utils/test_manifest.py`	Updates the existing cache test and adds new unit tests for cross-manifest-list deduplication.
`tests/benchmark/test_memory_benchmark.py`	Adds benchmark tests intended to reproduce/guard the memory-growth behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/utils/test_manifest.py

pyiceberg/manifest.py

tests/benchmark/test_memory_benchmark.py

tests/utils/test_manifest.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

geruh

LGTM, great fix!!

tests/benchmark/test_memory_benchmark.py

geruh · 2026-01-25T21:48:59Z

tests/benchmark/test_memory_benchmark.py

+https://github.com/apache/iceberg-python/issues/2325
+
+The issue: When caching manifest lists as tuples, overlapping ManifestFile objects
+are duplicated across cache entries, causing O(N²) memory growth instead of O(N).


Couldn't agree more! 🚀

pyiceberg/manifest.py

rambleraptor · 2026-01-26T23:58:10Z

pyiceberg/manifest.py

+# Global cache for ManifestFile objects, keyed by manifest_path.
+# This deduplicates ManifestFile objects across manifest lists, which commonly
+# share manifests after append operations.
+_manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=512)


Why bump this up from 128 -> 512 (it's okay to say it's arbitrary)

good catch. now that we're only caching ManifestFile objects, they have relatively small memory footprint. we were catching manifest list before, each pointing to many many ManifestFiles

also #2952 should make this configurable

The target size of a Manifest is 8MB. 512*8MB=4GB, which seems high. Should we keep this at 128 (1GB) until we make this configurable?

sounds reasonable! lets do it

Fokko

Thanks @kevinjqliu for the extensive explanation and testing. This looks good to me, but maybe we should reduce the cache size. Let me know what you think!

pyiceberg/manifest.py

kevinjqliu · 2026-01-29T01:57:44Z

Thanks for the reviewing @Fokko @jayceslesar @rambleraptor @geruh 😄

cache manifest, not tuple

cab3823

kevinjqliu mentioned this pull request Jan 25, 2026

Avro reader memory leak #2325

Open

3 tasks

geruh reviewed Jan 25, 2026

View reviewed changes

pyiceberg/manifest.py Outdated Show resolved Hide resolved

kevinjqliu added 6 commits January 25, 2026 15:22

thx drew

0f2bf0d

typo

fa2863f

add memory benchmark

a5b7544

dont lock during io

3c32b5d

fix benchmark to use cache

c2fbb9c

fix benchmark

76c71aa

kevinjqliu requested a review from Copilot January 25, 2026 20:53

Copilot started reviewing on behalf of kevinjqliu January 25, 2026 20:53 View session

Copilot AI reviewed Jan 25, 2026

View reviewed changes

kevinjqliu and others added 4 commits January 25, 2026 16:14

update docs

d92accf

Update tests/utils/test_manifest.py

1721483

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix test

1f5861b

more docs

50a366c

kevinjqliu requested review from Fokko and geruh January 25, 2026 21:28

kevinjqliu added this to the PyIceberg 0.11.0 milestone Jan 25, 2026

geruh approved these changes Jan 25, 2026

View reviewed changes

kris-gaudel mentioned this pull request Jan 25, 2026

Make manifest cache size configurable and allow for disabling #2952

Open

jayceslesar reviewed Jan 26, 2026

View reviewed changes

pyiceberg/manifest.py Show resolved Hide resolved

feedback

462e975

rambleraptor approved these changes Jan 26, 2026

View reviewed changes

Fokko approved these changes Jan 28, 2026

View reviewed changes

kevinjqliu commented Jan 29, 2026

View reviewed changes

pyiceberg/manifest.py Outdated Show resolved Hide resolved

Apply suggestions from code review

fc64acb

kevinjqliu merged commit 326884c into apache:main Jan 29, 2026
11 checks passed

kevinjqliu deleted the kevinjqliu/fix-manifest-cache branch January 29, 2026 01:57

kris-gaudel mentioned this pull request Jan 31, 2026

Add env variables to configure manifest cache #2993

Open

Conversation

kevinjqliu commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache

Memory Growth Benchmark (50 append operations)

Memory at Each Iteration

Are there any user-facing changes?

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geruh Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rambleraptor Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kevinjqliu commented Jan 25, 2026 •

edited

Loading