Skip to content

Conversation

@GabrielAmazonas
Copy link

@GabrielAmazonas GabrielAmazonas commented Feb 1, 2026

Rationale for this change

When using add_files() to import Parquet files written by DuckDB into PyIceberg tables, the operation fails with AttributeError: 'bytes' object has no attribute 'encode'.

This occurs because the Parquet specification defines column statistics (min_value, max_value) as binary data:

struct Statistics {
  5: optional binary max_value;
  6: optional binary min_value;
}

This change is a follow-up improvement to #1354, which fixed handling of missing column statistics. This PR addresses the case where statistics are present but returned as bytes instead of str.

This change is indirectly related to broader DuckDB-PyIceberg interoperability challenges documented in duckdb/duckdb#12958

When PyArrow reads these statistics from Parquet files, it may return them as Python bytes objects rather than decoded str values, which is valid per the Parquet spec. However, PyIceberg's StatsAggregator only expected string statistics to be str objects, causing failures when processing files from writers like DuckDB that expose this binary representation.

This PR fixes the issue by adding proper handling for bytes values in string column statistics:

  1. StatsAggregator.min_as_bytes(): Decode bytes to UTF-8 before truncation and serialization
  2. StatsAggregator.max_as_bytes(): Decode bytes to UTF-8 before processing (previously raised ValueError)
  3. to_bytes() for StringType: Add defensive isinstance check as a safety fallback
  4. Add comprehensive unit tests for both StatsAggregator and to_bytes

This improves interoperability with DuckDB and other Parquet writers that expose statistics in their binary form.

Are these changes tested?

Yes, this PR includes unit tests that verify:

  • StatsAggregator.min_as_bytes() correctly handles bytes values for string columns
  • StatsAggregator.max_as_bytes() correctly handles bytes values for string columns
  • to_bytes() for StringType correctly handles both str and bytes inputs

The tests ensure the fix works correctly while maintaining backward compatibility with existing str-based statistics.

Are there any user-facing changes?

Yes - This is a bug fix that enables users to successfully use add_files() with Parquet files written by DuckDB and potentially other writers that return statistics as bytes.

Before: add_files() would fail with AttributeError: 'bytes' object has no attribute 'encode' when processing DuckDB-written Parquet files.

After: add_files() correctly processes Parquet files regardless of whether statistics are returned as str or bytes.

This change is backward compatible - existing workflows with Parquet files that return str statistics will continue to work as before.

@GabrielAmazonas GabrielAmazonas force-pushed the feat/add-files-duckdb branch 3 times, most recently from a572b1d to 7c1392d Compare February 1, 2026 12:04
Problem:
When using `add_files()` with Parquet files written by DuckDB, PyIceberg
fails with `AttributeError: 'bytes' object has no attribute 'encode'`.

Root Cause:
The Parquet format stores column statistics (min_value, max_value) as binary
data in the Statistics struct (see parquet.thrift). When PyArrow reads these
statistics from Parquet files, it may return them as Python `bytes` objects
rather than decoded `str` values. This is valid per the Parquet specification:

  struct Statistics {
    5: optional binary max_value;
    6: optional binary min_value;
  }

PyIceberg's StatsAggregator expected string statistics to always be `str`,
causing failures when processing Parquet files from writers like DuckDB that
expose this binary representation.

Fix:
1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by
   decoding to UTF-8 string before truncation and serialization.

2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to
   decode bytes values before processing (was raising ValueError).

3. In `to_bytes()` for StringType: Add defensive isinstance check to handle
   bytes values as a safety fallback.

4. Add unit tests for both StatsAggregator bytes handling and to_bytes.
@GabrielAmazonas GabrielAmazonas changed the title Fix: Handle bytes values in string column statistics from Parquet fix: handle bytes values in string column statistics from Parquet Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant