fix: handle bytes values in string column statistics from Parquet #2995
+134
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
When using
add_files()to import Parquet files written by DuckDB into PyIceberg tables, the operation fails withAttributeError: 'bytes' object has no attribute 'encode'.This occurs because the Parquet specification defines column statistics (min_value, max_value) as binary data:
This change is a follow-up improvement to #1354, which fixed handling of missing column statistics. This PR addresses the case where statistics are present but returned as
bytesinstead ofstr.This change is indirectly related to broader DuckDB-PyIceberg interoperability challenges documented in duckdb/duckdb#12958
When PyArrow reads these statistics from Parquet files, it may return them as Python
bytesobjects rather than decodedstrvalues, which is valid per the Parquet spec. However, PyIceberg'sStatsAggregatoronly expected string statistics to bestrobjects, causing failures when processing files from writers like DuckDB that expose this binary representation.This PR fixes the issue by adding proper handling for
bytesvalues in string column statistics:StatsAggregator.min_as_bytes(): Decode bytes to UTF-8 before truncation and serializationStatsAggregator.max_as_bytes(): Decode bytes to UTF-8 before processing (previously raised ValueError)to_bytes()for StringType: Add defensive isinstance check as a safety fallbackThis improves interoperability with DuckDB and other Parquet writers that expose statistics in their binary form.
Are these changes tested?
Yes, this PR includes unit tests that verify:
StatsAggregator.min_as_bytes()correctly handles bytes values for string columnsStatsAggregator.max_as_bytes()correctly handles bytes values for string columnsto_bytes()for StringType correctly handles both str and bytes inputsThe tests ensure the fix works correctly while maintaining backward compatibility with existing str-based statistics.
Are there any user-facing changes?
Yes - This is a bug fix that enables users to successfully use
add_files()with Parquet files written by DuckDB and potentially other writers that return statistics as bytes.Before:
add_files()would fail withAttributeError: 'bytes' object has no attribute 'encode'when processing DuckDB-written Parquet files.After:
add_files()correctly processes Parquet files regardless of whether statistics are returned asstrorbytes.This change is backward compatible - existing workflows with Parquet files that return str statistics will continue to work as before.