Skip to content

Add a memory bound FileStatisticsCache for the Listing Table#20047

Open
mkleen wants to merge 31 commits intoapache:mainfrom
mkleen:file-stats-cache
Open

Add a memory bound FileStatisticsCache for the Listing Table#20047
mkleen wants to merge 31 commits intoapache:mainfrom
mkleen:file-stats-cache

Conversation

@mkleen
Copy link
Contributor

@mkleen mkleen commented Jan 28, 2026

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 28, 2026
@mkleen mkleen marked this pull request as ready for review January 28, 2026 16:23
@mkleen mkleen changed the title Add a default FileStatisticsCache implementation for the ListingTable Add a default FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a default FileStatisticsCache with a size limit Add a FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add a FileStatisticsCache with a size limit Add FileStatisticsCache with a size limit Jan 28, 2026
@mkleen mkleen changed the title Add FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026
@mkleen mkleen changed the title Add a memory bound FileStatisticsCache with a size limit Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleen

Thanks for working on this.

@mkleen
Copy link
Contributor Author

mkleen commented Feb 4, 2026

@kosiew Thank you for the feedback!

@mkleen mkleen requested a review from kosiew February 4, 2026 12:10
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkleen
Copy link
Contributor Author

mkleen commented Feb 10, 2026

@kosiew Anything else needed to get this merged? Another approval maybe?

impl<T: DFHeapSize> DFHeapSize for Arc<T> {
fn heap_size(&self) -> usize {
// Arc stores weak and strong counts on the heap alongside an instance of T
2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]);
let a2 = a1.clone();
let a3 = a1.clone();
let a4 = a3.clone();

// this should be true because all `a`s point to the same object in memory 
// but the current implementation does not detect this and counts them separately
assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

builder.with_object_list_cache_ttl(Some(duration))
}
"file_statistics_cache_limit" => {
let limit = Self::parse_memory_limit(value)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not caused this PR but parse_memory_limit() panics when the value is an empty string (attempt to subtract with overflow). Needs to be improved either in this PR or a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do a follow up on this.

@mkleen
Copy link
Contributor Author

mkleen commented Feb 10, 2026

@martin-g Thanks for this great review! I am on it.

@mkleen mkleen requested a review from martin-g February 12, 2026 19:41
num_columns: 1,
table_size_bytes: Precision::Absent,
statistics_size_bytes: 0,
statistics_size_bytes: 304,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of the change of estimate in statistics.

@alamb
Copy link
Contributor

alamb commented Feb 13, 2026

@nuno-faria perhaps you have some time to review this PR as well?

Copy link
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkleen. I do have some reservations related to the ordering information for Parquet files, but maybe I'm missing something.

Comment on lines 644 to 650
+-----------------------------------+-----------------+---------------------+------+------------------+
| filename | file_size_bytes | metadata_size_bytes | hits | extra |
+-----------------------------------+-----------------+---------------------+------+------------------+
| alltypes_plain.parquet | 1851 | 8882 | 5 | page_index=false |
| alltypes_plain.parquet | 1851 | 8882 | 8 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 269266 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 1347 | 3 | page_index=false |
| lz4_raw_compressed_larger.parquet | 380836 | 1347 | 4 | page_index=false |
+-----------------------------------+-----------------+---------------------+------+------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a regression? Each scan now appears to require two reads of the metadata cache. I did a quick check and noticed that the extra read is caused by the list_files_for_scan function in datafusion_catalog_listing not having access to the ordering information, meaning it needs to call do_collect_statistics_and_ordering every time.

let (statistics, ordering) = if self.options.collect_stat {
self.do_collect_statistics_and_ordering(ctx, &store, &part_file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Do you have a suggestion on how to improve this?

///
/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
#[derive(Default)]
pub struct DefaultFileStatisticsCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be now moved (or renamed) to its own file_statistics_cache.rs, similar to the other caches?

}
}

impl DefaultFileStatisticsCacheState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there are now 3 similar "LRU + memory limit" cache implementations (metadata, list files, files statistics). Maybe one day they could be merged into a generic one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i had the same thought. Shall i create a follow-up ticket on that?

Comment on lines +111 to +117
let old_value = self.lru_queue.put(key.clone(), value);
self.memory_used += entry_size;

if let Some(old_entry) = &old_value {
self.memory_used -= old_entry.heap_size();
} else {
self.memory_used += key.as_ref().heap_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In my opinion I think the code would be easier to read if key.as_ref().heap_size() was added in the first self.memory_used += entry_size; and then, if an older entry exists, removed at self.memory_used -= old_entry.heap_size();, removing the else.

Comment on lines +223 to 233
let cached = entry.1.clone();
entries.insert(
path.clone(),
path,
FileStatisticsCacheEntry {
object_meta: cached.meta.clone(),
num_rows: cached.statistics.num_rows,
num_columns: cached.statistics.column_statistics.len(),
table_size_bytes: cached.statistics.total_byte_size,
statistics_size_bytes: 0, // TODO: set to the real size in the future
statistics_size_bytes: cached.statistics.heap_size(),
has_ordering: cached.ordering.is_some(),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the clone on cached is not necessary.

Comment on lines +180 to +183
# Disable file statistics cache because file statistics have been previously created
statement ok
set datafusion.runtime.file_statistics_cache_limit = "0K";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need to disable the cache in this test, as well as the other two (parquet_filter_pushdown.slt, array.slt).

I tried commenting this and the plan changed in a query:

-02)--DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=A/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=C/2.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=B/1.parquet]]}, projection=[int_col, bigint_col, partition_col], output_ordering=[partition_col@2 ASC NULLS LAST, int_col@0 ASC NULLS LAST, bigint_col@1 ASC NULLS LAST], file_type=parquet
+02)--DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=A/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=B/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_sorted_statistics/test_table/partition_col=C/2.parquet]]}, projection=[int_col, bigint_col, partition_col], file_type=parquet

The output_ordering information is missing, which might be related to this comment https://github.com/apache/datafusion/pull/20047/changes#r2807891883.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I will look into this.

@mkleen
Copy link
Contributor Author

mkleen commented Feb 14, 2026

@nuno-faria Thank you for this review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a default FileStatisticsCache implementation for the ListingTable Add limit to DefaultFileStatisticsCache

5 participants