Optimize upsert performance for large datasets #2943

EnyMan · 2026-01-22T20:21:19Z

Summary

This PR improves the performance of the upsert() operation, particularly for large upserts with 10,000+ rows. The changes address three main bottlenecks in the current implementation.

Problem

The current upsert implementation has several performance issues that become significant with large datasets:

Expensive match filter generation: For composite keys, create_match_filter() generates Or(And(EqualTo, EqualTo), ...) expressions - one And clause per unique key combination. With 10k+ rows, this creates big expression trees that are slow to evaluate, up to n*m leaves for a single key column.
Per-batch insert filtering: The insert logic filters rows using expression evaluation (expression_to_pyarrow) on each batch, which is inefficient and doesn't leverage PyArrow's join capabilities.
Row-by-row comparison: get_rows_to_update() uses Python loops to compare rows one at a time (source_table.slice(source_idx, 1)), missing the opportunity for vectorized operations.

Solution

1. Coarse Match Filter for Initial Scan (Biggest Performance Win)

Added create_coarse_match_filter() that generates a less precise but much faster filter for the initial table scan. This is where the majority of the performance improvement comes from.

For small datasets (< 10,000 unique keys): Uses And(In(col1, values), In(col2, values)) instead of exact key-tuple matching
For large datasets with dense numeric keys (>10% density): Uses range filters (col >= min AND col <= max)
For large datasets with sparse keys: Returns AlwaysTrue() to allow full scan (exact matching happens downstream anyway)

This is safe because exact key matching occurs in get_rows_to_update() via the join operation.

Key insight - AlwaysTrue() is where the biggest win happens:

The benchmark data was sparse, triggering the AlwaysTrue() path. Counter-intuitively, this is actually the best case for performance improvement. The speedup doesn't come from reading fewer files - it comes from avoiding the massive expression tree construction and evaluation:

Original: Build Or(And(...), And(...), ...) with millions of nodes (8s), then evaluate during scan (382s)
Optimized: Return AlwaysTrue() instantly (0.07s), scan without filter overhead (1.4s)

With sparse data, you'd read most/all files anyway, so the "full scan" isn't a penalty - but avoiding an expression tree with n×m nodes (n keys × m columns) and evaluating it across f files is a huge win.

When this optimization helps less:

Small datasets (< 10k keys): Already fast with In() predicates, minimal improvement
Dense numeric keys: Range filter helps, but less dramatic than the sparse case
Workloads where the original filter already performed well

2. Anti-Join for Insert Filtering

Replaced per-batch expression filtering with a single anti-join operation after processing all batches:

# Before: Per-batch expression filtering (slow)
for batch in matched_iceberg_record_batches:
    expr_match = create_match_filter(rows, join_cols)
    expr_match_arrow = expression_to_pyarrow(bind(...))
    rows_to_insert = rows_to_insert.filter(~expr_match_arrow)

# After: Single anti-join (fast)
combined_matched_keys = pa.concat_tables(matched_target_keys).group_by(join_cols).aggregate([])
rows_to_insert = df_keys.join(combined_matched_keys, keys=join_cols, join_type="left anti")

Note on memory usage: The new approach accumulates matched keys in memory during batch processing. We only store key columns (not full rows) to minimize memory footprint, and deduplicate after the loop. For tables with millions of matching rows, this could increase peak memory usage compared to the previous approach. A potential future improvement would be incremental deduplication during the loop.

3. Vectorized Row Comparison

Replaced row-by-row Python comparison with vectorized PyArrow operations:

# Before: Python loop with slice()
for source_idx, target_idx in zip(...):
    source_row = source_table.slice(source_idx, 1)
    target_row = target_table.slice(target_idx, 1)
    for key in non_key_cols:
        if source_row.column(key)[0].as_py() != target_row.column(key)[0].as_py():
            to_update_indices.append(source_idx)

# After: Vectorized with take() and compute
matched_source = source_table.take(source_indices)
matched_target = target_table.take(target_indices)
for col in non_key_cols:
    col_diff = _compare_columns_vectorized(source_col, target_col)
    diff_masks.append(col_diff)
combined_mask = functools.reduce(pc.or_, diff_masks)

The _compare_columns_vectorized() function handles:

Primitive types: Uses pc.not_equal() with proper null handling
Struct types: Recursively compares each nested field
List/Map types: Falls back to Python comparison (still batched)

Benchmark Results

Ran benchmarks on a table with ~2M rows, doing incremental upserts:

Run	Table Size	Batches	Original	Optimized	Speedup
2	2M rows	32	11.9 min	2.3 min	5.1x
3	2M rows	96	31.5 min	3.9 min	8.0x
4	2M rows	160+	51.2 min	5.5 min	9.3x

Why times increase with each run: The table uses bucketing, and each upsert modifies files independently, causing file count to increase over time. The original implementation's big filter expression (Or(And(...), ...)) had to be evaluated against every file, so more files = dramatically more time. The optimized version avoids this by using AlwaysTrue(), making the scan time grow linearly with data size rather than exponentially with file count.

This file increase could be mitigated with table maintenance (compaction), which is not yet implemented in PyIceberg.

Where the Time Went (Run 2: 2M rows, 32 batches)

Step	Original	Optimized
Filter creation	7.9s	0.07s (114x faster)
Table scan	382.2s	1.4s (273x faster)
Batch processing	212.0s	24.5s
Insert filtering	(included above)	0.2s

The coarse filter approach shows the biggest improvement:

Original filter complexity: Or(And(...), And(...), ...) with millions of nodes
Optimized filter: AlwaysTrue() or simple And(In(), In())

Incremental Adoption

If the anti-join change is concerning due to memory implications, the coarse match filter optimization can be contributed separately as it provides the majority of the performance benefit and doesn't change the memory characteristics.

Suggested PR split:

Coarse match filter for initial scan (biggest win, minimal risk)
Vectorized row comparison in get_rows_to_update()
Anti-join for insert filtering

Future Considerations

Why Rust bindings weren't explored for this PR:

In ticket #2159, a suggestion was made to side-step performance issues by using the Python binding of the rust implementation. However, we would like to stick with a Python-centric implementation, because our use case requires mocking datetime using time-machine for:

Creating historical backfills with accurate snapshot timestamps
Deterministically rerunning failed pipeline runs with the same timestamps

This is why I kept the implementation in pure Python rather than exploring Rust bindings.

Potential hybrid approach:

The data processing (filtering, joins, comparisons) is where most of the time is spent and could benefit from Rust bindings. However - and I'll be selfish here - snapshot creation and metadata operations should remain in Python to preserve the ability to mock time. Without this, our backfill and replay workflows would break.

A future optimization could:

Move scan filtering and row comparison to Rust for performance
Keep snapshot/commit operations in Python for datetime mocking flexibility

I'd happily trade some performance for keeping our time-mocking capability intact.

Testing

Added comprehensive tests for:

create_coarse_match_filter() behavior across different dataset sizes and types
Threshold boundary conditions (< 10k, = 10k, > 10k unique keys)
Density calculations for range filter decisions
_compare_columns_vectorized() with primitives, nulls, structs, nested structs, and lists
Edge cases: empty datasets, single values, negative numbers, composite keys

Breaking Changes

None. The API remains unchanged; this is purely an internal optimization.

Files Changed

pyiceberg/table/__init__.py - upsert method optimizations
pyiceberg/table/upsert_util.py - new create_coarse_match_filter(), vectorized comparison functions
tests/table/test_upsert.py - new tests for optimization functions

Note: This code was co-written with the help of an AI agent (Claude), primarily to speed up exploration and understanding of the PyIceberg codebase. All the speed up ideas are mine. The benchmark results are from our real-world production data that we actively use and store. I have reviewed all the generated code. All related tests pass.

… comparisons

…empty structs

… matched keys

Remove logging reset pyarrow squash

astronautas · 2026-01-23T09:51:45Z

🙏 🙏 Just to illustrate. We're trying to upsert 30k x 10 cols rows into another 100k rows. It should fly like nothing, but interpreter gets killed after trying to upsert for a like a minute or two :/ I mean, we're not even talking big data here...

We're probably going to patch up our scripts with PySpark (yuck).

astronautas · 2026-01-23T09:52:27Z

I would seriously consider Rust bindings in the future. This won't fly, at least for merges :/

kevinjqliu · 2026-02-02T03:27:34Z

thanks for the pr! this is a rather large change so i'll have to take some time and understand it. i want to verify the correctness of the filtering

also i feel like the upsert code is getting harder and harder to maintain. I want to propose an alternative solution to support the upsert in pyiceberg. my general feeling is that pyiceberg is "not an engine" and will never perform upsert efficiently, but it might be able to delegate the upsert to an engine. i'll raise as a separate discussion

EnyMan added 8 commits January 22, 2026 21:02

feat: Optimize upsert process with coarse match filter and vectorized…

a947f0a

… comparisons

feat: Enhance vectorized comparison to handle struct-level nulls and …

4ab5c93

…empty structs

feat: Optimize insert filtering in upsert process using anti-join for…

db39b67

… matched keys

feat: Further optimize the filter for big datasets

e36d994

Remove logging reset pyarrow squash

refactor: Improve code readability with consistent formatting

902fdc3

test: Add type assertions for Reference in upsert filter tests

8123826

refactor: Remove redundant imports in upsert filter tests

e605971

refactor: reformated

1d6dad1

EnyMan mentioned this pull request Jan 27, 2026

PyIceberg Production Use case survey #1202

Open

kevinjqliu self-requested a review January 29, 2026 04:46

mwa28 mentioned this pull request Jan 29, 2026

Upserting large table extremely slow #2159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize upsert performance for large datasets #2943

Optimize upsert performance for large datasets #2943

EnyMan commented Jan 22, 2026

Uh oh!

astronautas commented Jan 23, 2026 •

edited

Loading

Uh oh!

astronautas commented Jan 23, 2026

Uh oh!

kevinjqliu commented Feb 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize upsert performance for large datasets #2943

Are you sure you want to change the base?

Optimize upsert performance for large datasets #2943

Conversation

EnyMan commented Jan 22, 2026

Summary

Problem

Solution

1. Coarse Match Filter for Initial Scan (Biggest Performance Win)

2. Anti-Join for Insert Filtering

3. Vectorized Row Comparison

Benchmark Results

Where the Time Went (Run 2: 2M rows, 32 batches)

Incremental Adoption

Future Considerations

Testing

Breaking Changes

Files Changed

Uh oh!

astronautas commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astronautas commented Jan 23, 2026

Uh oh!

kevinjqliu commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

astronautas commented Jan 23, 2026 •

edited

Loading

kevinjqliu commented Feb 2, 2026 •

edited

Loading