Add a memory bound FileStatisticsCache for the Listing Table by mkleen · Pull Request #20047 · apache/datafusion

mkleen · 2026-01-28T13:50:43Z

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Add heap size estimation for file statistics and the relevant data types used in caching (This is temporary until Add a crate for HeapSize trait arrow-rs#9138 is resolved)
Redesign DefaultFileStatisticsCache to use a LruQueue to make it memory-bound following Adds memory-bound DefaultListFilesCache #18855
Introduce a size limit and use it together with the heap-size to limit the memory usage of the cache
Move FileStatisticsCache creation into CacheManager, making it session-scoped and shared across statements and tables.
Disable caching in some of the SQL-logic tests where the change altered the output result, because the cache is now session-scoped and not query-scoped anymore.
Closes Add a default FileStatisticsCache implementation for the ListingTable #19217
Closes Add limit to DefaultFileStatisticsCache #19052

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

kosiew

@mkleen

Thanks for working on this.

datafusion/execution/src/cache/cache_unit.rs

datafusion/common/src/heap_size.rs

mkleen · 2026-02-04T12:10:45Z

@kosiew Thank you for the feedback!

kosiew

LGTM

mkleen · 2026-02-10T05:18:22Z

@kosiew Anything else needed to get this merged? Another approval maybe?

martin-g · 2026-02-10T07:04:54Z

datafusion/common/src/heap_size.rs

+impl<T: DFHeapSize> DFHeapSize for Arc<T> {
+    fn heap_size(&self) -> usize {
+        // Arc stores weak and strong counts on the heap alongside an instance of T
+        2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()


This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]); let a2 = a1.clone(); let a3 = a1.clone(); let a4 = a3.clone(); // this should be true because all `a`s point to the same object in memory // but the current implementation does not detect this and counts them separately assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

datafusion/execution/src/cache/cache_unit.rs

datafusion/execution/src/cache/cache_manager.rs

martin-g · 2026-02-10T07:21:19Z

datafusion/core/src/execution/context/mod.rs

                builder.with_object_list_cache_ttl(Some(duration))
            }
+            "file_statistics_cache_limit" => {
+                let limit = Self::parse_memory_limit(value)?;


Not caused this PR but parse_memory_limit() panics when the value is an empty string (attempt to subtract with overflow). Needs to be improved either in this PR or a follow-up.

I will do a follow up on this.

datafusion/execution/src/cache/cache_unit.rs

datafusion/sqllogictest/test_files/set_variable.slt

datafusion/execution/src/cache/cache_manager.rs

mkleen · 2026-02-10T08:02:48Z

@martin-g Thanks for this great review! I am on it.

mkleen · 2026-02-13T16:57:27Z

datafusion/execution/src/cache/cache_unit.rs

                        num_columns: 1,
                        table_size_bytes: Precision::Absent,
-                        statistics_size_bytes: 0,
+                        statistics_size_bytes: 304,


This is because of the change of estimate in statistics.

alamb · 2026-02-13T17:07:27Z

@nuno-faria perhaps you have some time to review this PR as well?

nuno-faria

Thanks @mkleen. I do have some reservations related to the ordering information for Parquet files, but maybe I'm missing something.

nuno-faria · 2026-02-14T20:34:03Z

datafusion-cli/src/main.rs

        +-----------------------------------+-----------------+---------------------+------+------------------+
        | filename                          | file_size_bytes | metadata_size_bytes | hits | extra            |
        +-----------------------------------+-----------------+---------------------+------+------------------+
-        | alltypes_plain.parquet            | 1851            | 8882                | 5    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 8882                | 8    | page_index=false |
        | alltypes_tiny_pages.parquet       | 454233          | 269266              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 1347                | 3    | page_index=false |
+        | lz4_raw_compressed_larger.parquet | 380836          | 1347                | 4    | page_index=false |
        +-----------------------------------+-----------------+---------------------+------+------------------+


Isn't this a regression? Each scan now appears to require two reads of the metadata cache. I did a quick check and noticed that the extra read is caused by the list_files_for_scan function in datafusion_catalog_listing not having access to the ordering information, meaning it needs to call do_collect_statistics_and_ordering every time.

datafusion/datafusion/catalog-listing/src/table.rs

Lines 734 to 735 in 53b0ffb

let (statistics, ordering) = if self.options.collect_stat {

self.do_collect_statistics_and_ordering(ctx, &store, &part_file)

Thanks for pointing this out. Do you have a suggestion on how to improve this?

nuno-faria · 2026-02-14T20:35:33Z

datafusion/execution/src/cache/cache_unit.rs

 ///
 /// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
 #[derive(Default)]
 pub struct DefaultFileStatisticsCache {


Should this be now moved (or renamed) to its own file_statistics_cache.rs, similar to the other caches?

nuno-faria · 2026-02-14T20:37:35Z

datafusion/execution/src/cache/cache_unit.rs

+    }
 }

+impl DefaultFileStatisticsCacheState {


Also there are now 3 similar "LRU + memory limit" cache implementations (metadata, list files, files statistics). Maybe one day they could be merged into a generic one.

Yes, i had the same thought. Shall i create a follow-up ticket on that?

nuno-faria · 2026-02-14T20:40:07Z

datafusion/execution/src/cache/cache_unit.rs

+        let old_value = self.lru_queue.put(key.clone(), value);
+        self.memory_used += entry_size;
+
+        if let Some(old_entry) = &old_value {
+            self.memory_used -= old_entry.heap_size();
+        } else {
+            self.memory_used += key.as_ref().heap_size();


nit: In my opinion I think the code would be easier to read if key.as_ref().heap_size() was added in the first self.memory_used += entry_size; and then, if an older entry exists, removed at self.memory_used -= old_entry.heap_size();, removing the else.

nuno-faria · 2026-02-14T20:41:30Z

datafusion/execution/src/cache/cache_unit.rs

+            let cached = entry.1.clone();
            entries.insert(
-                path.clone(),
+                path,
                FileStatisticsCacheEntry {
                    object_meta: cached.meta.clone(),
                    num_rows: cached.statistics.num_rows,
                    num_columns: cached.statistics.column_statistics.len(),
                    table_size_bytes: cached.statistics.total_byte_size,
-                    statistics_size_bytes: 0, // TODO: set to the real size in the future
+                    statistics_size_bytes: cached.statistics.heap_size(),
                    has_ordering: cached.ordering.is_some(),
                },


I think the clone on cached is not necessary.

nuno-faria · 2026-02-14T20:46:21Z

datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

+# Disable file statistics cache because file statistics have been previously created
+statement ok
+set datafusion.runtime.file_statistics_cache_limit = "0K";
+


I don't understand the need to disable the cache in this test, as well as the other two (parquet_filter_pushdown.slt, array.slt).

I tried commenting this and the plan changed in a query:

-02)--DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=A/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=C/2.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=B/1.parquet]]}, projection=[int_col, bigint_col, partition_col], output_ordering=[partition_col@2 ASC NULLS LAST, int_col@0 ASC NULLS LAST, bigint_col@1 ASC NULLS LAST], file_type=parquet +02)--DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=A/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=B/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=C/2.parquet]]}, projection=[int_col, bigint_col, partition_col], file_type=parquet

The output_ordering information is missing, which might be related to this comment https://github.com/apache/datafusion/pull/20047/changes#r2807891883.

Thanks for the feedback. I will look into this.

mkleen · 2026-02-14T21:11:29Z

@nuno-faria Thank you for this review!

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026

mkleen force-pushed the file-stats-cache branch from a66420a to 3b33739 Compare January 28, 2026 13:56

mkleen mentioned this pull request Jan 28, 2026

Add limit to DefaultFileStatisticsCache #19052

Open

mkleen force-pushed the file-stats-cache branch from 3b33739 to 8e5560b Compare January 28, 2026 14:19

github-actions bot removed the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40

github-actions bot added the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen marked this pull request as ready for review January 28, 2026 16:23

mkleen changed the title ~~Add a default FileStatisticsCache implementation for the ListingTable~~ Add a default FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a default FileStatisticsCache with a size limit~~ Add a FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a FileStatisticsCache with a size limit~~ Add FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026

mkleen changed the title ~~Add a memory bound FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026

mkleen mentioned this pull request Jan 31, 2026

Add heap memory estimation for statistics #19599

Closed

kosiew requested changes Feb 4, 2026

View reviewed changes

datafusion/execution/src/cache/cache_unit.rs Show resolved Hide resolved

datafusion/common/src/heap_size.rs Show resolved Hide resolved

mkleen force-pushed the file-stats-cache branch from 59c6bce to 4542db8 Compare February 4, 2026 12:08

mkleen requested a review from kosiew February 4, 2026 12:10

kosiew approved these changes Feb 5, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 205f96c to 92899a7 Compare February 10, 2026 05:58

martin-g reviewed Feb 10, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 92899a7 to 2e3aff9 Compare February 11, 2026 14:49

mkleen added 23 commits February 12, 2026 20:02

Add todo to add heapsize for ordering in CachedFileMetadata

67f26b7

Fix comment/docs on DefaultFileStatisticsCache

e181411

Simplify test data generation

fb5c6c0

Remove potential stale entry, if entry is too large

90e4dd4

Fix typo in sql logic test comment

37c4c86

Fix comment about default behaviour in cache manager

4ca5a1b

Fix variable name in test

d514f9b

Fix variable name in test

7904fb0

Disable cache for sql logic test

ba48ab4

Include key into memory estimation

1302562

Fix fmt

004eae1

Fix clippy

ad77dee

minor

6f9ed82

Add more key memory accounting

9dda2f4

Fix Formatting

44b9406

Account path as string and remove dependency to object_store

06fa782

Improve error handling

3f78e51

Fix fmt

bbe8250

Remove path.clone

3a59701

Simplify accounting for statistics

22a8323

Adapt offset buffer

c75ecb9

Fix heap size for Arc

f80a73d

Adapt estimate in test

6d34bff

mkleen force-pushed the file-stats-cache branch from 30f7fa6 to 6d34bff Compare February 12, 2026 19:03

Fix sql logic test

d3ff716

mkleen requested a review from martin-g February 12, 2026 19:41

mkleen commented Feb 13, 2026

View reviewed changes

nuno-faria reviewed Feb 14, 2026

View reviewed changes

	let (statistics, ordering) = if self.options.collect_stat {
	self.do_collect_statistics_and_ordering(ctx, &store, &part_file)

Conversation

mkleen commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 4, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 13, 2026

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleen commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkleen commented Jan 28, 2026 •

edited

Loading