GH-48986: [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown (15/15) by cbb330 · Pull Request #49181 · apache/arrow

cbb330 · 2026-02-08T06:37:16Z

Summary

Part 15/15 of ORC predicate pushdown implementation.

⚠️ Depends on PRs 1-14 being merged first

Adds Python API for ORC predicate pushdown by exposing a filters parameter on orc.read_table(). This provides API parity with Parquet's read_table() function.

This is the final PR in the stacked series.

GitHub Issue: ORC Predicate Pushdown #48986

Changes

Add filters parameter to orc.read_table() supporting both Expression and DNF tuple formats
Delegate to Dataset API when filters is specified
Add comprehensive documentation with examples in module docstring
Add 5 test functions covering smoke tests, integration, and correctness

Implementation

The implementation is pure Python with no Cython changes. It reuses existing Dataset API bindings and the filters_to_expression() utility from Parquet for DNF tuple conversion.

When filters is specified, the function delegates to:

dataset = ds.dataset(source, format='orc', filesystem=filesystem)
return dataset.to_table(columns=columns, filter=filter_expr)

This leverages the C++ predicate pushdown infrastructure added in PRs 1-5.

Test Coverage

Expression format: ds.field('id') > 100
DNF tuple format: [('id', '>', 100)]
Integration with column projection
Correctness validation against post-filtering
Edge case: filters=None

Examples

Expression format:

import pyarrow.orc as orc
import pyarrow.dataset as ds

table = orc.read_table('data.orc', filters=ds.field('id') > 1000)

DNF tuple format (Parquet-compatible):

# Single condition
table = orc.read_table('data.orc', filters=[('id', '>', 1000)])

# Multiple conditions (AND)
table = orc.read_table('data.orc', filters=[('id', '>', 100), ('id', '<', 200)])

# OR conditions
table = orc.read_table('data.orc', filters=[[('x', '==', 1)], [('x', '==', 2)]])

With column projection:

table = orc.read_table('data.orc',
                       columns=['id', 'value'],
                       filters=[('id', '>', 1000)])

Supported Operators

==, !=, <, >, <=, >=, in, not in

Currently optimized for INT32 and INT64 columns.

Rationale

This Python API makes ORC predicate pushdown accessible to Python users without requiring them to use the lower-level Dataset API directly. It mirrors Parquet's read_table(filters=...) API for consistency.

The implementation replaces the placeholder commit from the original plan with a full working implementation.

Add internal utilities for extracting min/max statistics from ORC stripe metadata. This establishes the foundation for statistics-based stripe filtering in predicate pushdown. Changes: - Add MinMaxStats struct to hold extracted statistics - Add ExtractStripeStatistics() function for INT64 columns - Statistics extraction returns std::nullopt for missing/invalid data - Validates statistics integrity (min <= max) This is an internal-only change with no public API modifications. Part of incremental ORC predicate pushdown implementation (PR1/15).

Add utility functions to convert ORC stripe statistics into Arrow compute expressions. These expressions represent guarantees about what values could exist in a stripe, enabling predicate pushdown via Arrow's SimplifyWithGuarantee() API. Changes: - Add BuildMinMaxExpression() for creating range expressions - Support null handling with OR is_null(field) when nulls present - Add convenience overload accepting MinMaxStats directly - Expression format: (field >= min AND field <= max) [OR is_null(field)] This is an internal-only utility with no public API changes. Part of incremental ORC predicate pushdown implementation (PR2/15).

Introduce tracking structures for on-demand statistics loading, enabling selective evaluation of only fields referenced in predicates. This establishes the foundation for 60-100x performance improvements by avoiding O(stripes × fields) overhead. Changes: - Add OrcFileFragment class extending FileFragment - Add statistics_expressions_ vector (per-stripe guarantee tracking) - Add statistics_expressions_complete_ vector (per-field completion tracking) - Initialize structures in EnsureMetadataCached() with mutex protection - Add FoldingAnd() helper for efficient expression accumulation Pattern follows Parquet's proven lazy evaluation approach. This is infrastructure-only with no public API exposure yet. Part of incremental ORC predicate pushdown implementation (PR3/15).

Implement first end-to-end working predicate pushdown for ORC files. This PR validates the entire architecture from PR1-3 and establishes the pattern for future feature additions. Scope limited to prove the concept: - INT64 columns only - Greater-than operator (>) only Changes: - Add FilterStripes() public API to OrcFileFragment - Add TestStripes() internal method for stripe evaluation - Implement lazy statistics evaluation (processes only referenced fields) - Integrate with Arrow's SimplifyWithGuarantee() for correctness - Add ARROW_ORC_DISABLE_PREDICATE_PUSHDOWN feature flag - Cache ORC reader to avoid repeated file opens - Conservative fallback: include all stripes if statistics unavailable The implementation achieves significant performance improvements by skipping stripes that provably cannot contain matching data. Part of incremental ORC predicate pushdown implementation (PR4/15).

Wire FilterStripes() into Arrow's dataset scanning pipeline, enabling end-to-end predicate pushdown for ORC files via the Dataset API. Changes: - Add MakeFragment() override to create OrcFileFragment instances - Modify OrcScanTask to call FilterStripes when filter present - Add stripe index determination in scan execution path - Log stripe skipping at DEBUG level for observability - Maintain backward compatibility (no filter = read all stripes) Integration points: - OrcFileFormat now creates OrcFileFragment (not generic FileFragment) - Scanner checks for OrcFileFragment and applies predicate pushdown - Filtered stripe indices ready for future ReadStripe optimizations This enables users to benefit from predicate pushdown via: dataset.to_table(filter=expr) Part of incremental ORC predicate pushdown implementation (PR5/15).

…servative guarantees

…tem, and repeated-filter stability

…e-like compatibility

… coverage

…S NOT NULL

…ters=...)

…ction

… coverage

…r column in output

…tput columns

cbb330 added 5 commits January 27, 2026 02:19

github-actions bot added Component: C++ Component: Python awaiting review Awaiting review labels Feb 8, 2026

cbb330 mentioned this pull request Feb 8, 2026

[Python] Add filters parameter to orc.read_table() for predicate pushdown cbb330/arrow#1

Closed

cbb330 force-pushed the orc-pr-15-python branch from 23df0d5 to ea68d12 Compare February 15, 2026 04:24

cbb330 added 11 commits February 14, 2026 21:26

[C++][Dataset] Apply stripe-selected ORC scanning with simplified con…

70019a9

…servative guarantees

[C++][Dataset] Add ORC predicate pushdown tests for all-null, filesys…

df15093

…tem, and repeated-filter stability

[Python][Dataset] Add filters support to orc.read_table with path/fil…

79cdd3e

…e-like compatibility

[Python][Tests] Add ORC read_table(filters=...) smoke and correctness…

07ac686

… coverage

[Python][Tests] Add all-null stripe semantics tests for IS NULL and I…

5953dda

…S NOT NULL

[Python][Tests] Add BufferReader fallback coverage for read_table(fil…

4c2afaf

…ters=...)

[Python][Tests] Add Parquet-vs-ORC predicate parity integration test

5924525

[Python][Dataset] Fix file-like ORC filter fallback with output proje…

deed1be

…ction

[Python][Tests] Add BufferReader projection and empty-column fallback…

11e120e

… coverage

[Python][Tests] Verify projected path-based ORC filters without filte…

32fbcda

…r column in output

[Python][Tests] Verify projected path-based ORC filters with empty ou…

745b051

…tput columns

cbb330 force-pushed the orc-pr-15-python branch from d155fe7 to 745b051 Compare February 15, 2026 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-48986: [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown (15/15)#49181

GH-48986: [Python][Dataset] Add filters parameter to orc.read_table() for predicate pushdown (15/15)#49181
cbb330 wants to merge 16 commits intoapache:mainfrom
cbb330:orc-pr-15-python

cbb330 commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cbb330 commented Feb 8, 2026

Summary

Changes

Implementation

Test Coverage

Examples

Supported Operators

Rationale

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant