fix: handle bytes values in string column statistics from Parquet #2995

GabrielAmazonas · 2026-02-01T11:45:25Z

Rationale for this change

When using add_files() to import Parquet files written by DuckDB into PyIceberg tables, the operation fails with AttributeError: 'bytes' object has no attribute 'encode'.

This occurs because the Parquet specification defines column statistics (min_value, max_value) as binary data:

struct Statistics {
  5: optional binary max_value;
  6: optional binary min_value;
}

This change is a follow-up improvement to #1354, which fixed handling of missing column statistics. This PR addresses the case where statistics are present but returned as bytes instead of str.

This change is indirectly related to broader DuckDB-PyIceberg interoperability challenges documented in duckdb/duckdb#12958

When PyArrow reads these statistics from Parquet files, it may return them as Python bytes objects rather than decoded str values, which is valid per the Parquet spec. However, PyIceberg's StatsAggregator only expected string statistics to be str objects, causing failures when processing files from writers like DuckDB that expose this binary representation.

This PR fixes the issue by adding proper handling for bytes values in string column statistics:

StatsAggregator.min_as_bytes(): Decode bytes to UTF-8 before truncation and serialization
StatsAggregator.max_as_bytes(): Decode bytes to UTF-8 before processing (previously raised ValueError)
to_bytes() for StringType: Add defensive isinstance check as a safety fallback
Add comprehensive unit tests for both StatsAggregator and to_bytes

This improves interoperability with DuckDB and other Parquet writers that expose statistics in their binary form.

Are these changes tested?

Yes, this PR includes unit tests that verify:

StatsAggregator.min_as_bytes() correctly handles bytes values for string columns
StatsAggregator.max_as_bytes() correctly handles bytes values for string columns
to_bytes() for StringType correctly handles both str and bytes inputs

The tests ensure the fix works correctly while maintaining backward compatibility with existing str-based statistics.

Are there any user-facing changes?

Yes - This is a bug fix that enables users to successfully use add_files() with Parquet files written by DuckDB and potentially other writers that return statistics as bytes.

Before: add_files() would fail with AttributeError: 'bytes' object has no attribute 'encode' when processing DuckDB-written Parquet files.

After: add_files() correctly processes Parquet files regardless of whether statistics are returned as str or bytes.

This change is backward compatible - existing workflows with Parquet files that return str statistics will continue to work as before.

Problem: When using `add_files()` with Parquet files written by DuckDB, PyIceberg fails with `AttributeError: 'bytes' object has no attribute 'encode'`. Root Cause: The Parquet format stores column statistics (min_value, max_value) as binary data in the Statistics struct (see parquet.thrift). When PyArrow reads these statistics from Parquet files, it may return them as Python `bytes` objects rather than decoded `str` values. This is valid per the Parquet specification: struct Statistics { 5: optional binary max_value; 6: optional binary min_value; } PyIceberg's StatsAggregator expected string statistics to always be `str`, causing failures when processing Parquet files from writers like DuckDB that expose this binary representation. Fix: 1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by decoding to UTF-8 string before truncation and serialization. 2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to decode bytes values before processing (was raising ValueError). 3. In `to_bytes()` for StringType: Add defensive isinstance check to handle bytes values as a safety fallback. 4. Add unit tests for both StatsAggregator bytes handling and to_bytes.

GabrielAmazonas force-pushed the feat/add-files-duckdb branch 3 times, most recently from a572b1d to 7c1392d Compare February 1, 2026 12:04

GabrielAmazonas force-pushed the feat/add-files-duckdb branch from 7c1392d to a6bc9c1 Compare February 1, 2026 12:16

GabrielAmazonas changed the title ~~Fix: Handle bytes values in string column statistics from Parquet~~ fix: handle bytes values in string column statistics from Parquet Feb 1, 2026

GabrielAmazonas closed this Feb 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle bytes values in string column statistics from Parquet #2995

fix: handle bytes values in string column statistics from Parquet #2995

GabrielAmazonas commented Feb 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: handle bytes values in string column statistics from Parquet #2995

fix: handle bytes values in string column statistics from Parquet #2995

Conversation

GabrielAmazonas commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GabrielAmazonas commented Feb 1, 2026 •

edited

Loading