feat: Add DeleteFileIndex to improve position delete lookup#2918
feat: Add DeleteFileIndex to improve position delete lookup#2918kevinjqliu merged 3 commits intoapache:mainfrom
Conversation
jayceslesar
left a comment
There was a problem hiding this comment.
I basically left 2 nits, existing integration tests are passing which gives confidence and the unit tests also look good here
| if lower and upper and lower == upper: | ||
| try: | ||
| return lower.decode("utf-8") | ||
| except (UnicodeDecodeError, AttributeError): | ||
| pass |
There was a problem hiding this comment.
consider using contextlib.suppress here instead of the except pass
| ) | ||
|
|
||
| from pydantic import Field | ||
| from sortedcontainers import SortedList |
There was a problem hiding this comment.
Unrelated to this PR, but I noticed that there is just one more occurrence of sortedcontainers outside of the tests. Might be interesting to see if we can get rid of it.
There was a problem hiding this comment.
last occurrence
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
note the old solution uses bisect_right, which might have been a bug
There was a problem hiding this comment.
might be worth it to add an integration test scenario for this
There was a problem hiding this comment.
Yess, The old bisect_right returned the the point to the right of the sorted sequences with the same key. This meant that any delete file with the same seq number was excluded.
Per the Iceberg spec https://iceberg.apache.org/spec/#scan-planning: "The data file's data sequence number is less than or equal to the delete file's data sequence number"
This situation can occur when data files and position delete files are added in the same commit using the row delta logic i.e. a merge statement.
Looking at a test on the java side like this test. We can see Java is inclusive by returning the files with the same sequence number as part of the filter lookup.
There was a problem hiding this comment.
Pull request overview
This PR introduces a DeleteFileIndex class to significantly improve the performance of position delete file lookup during scan planning. The implementation replaces the previous linear scanning approach with an efficient indexing strategy that organizes delete files by exact file path and by partition key, using binary search for sequence number filtering.
Changes:
- Added new
DeleteFileIndexandPositionDeletesclasses for efficient delete file indexing and retrieval - Replaced
_match_deletes_to_data_filefunction withDeleteFileIndex.for_data_filemethod - Removed obsolete tests for the old implementation
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
pyiceberg/table/delete_file_index.py |
New module implementing DeleteFileIndex for indexing position deletes by path and partition, and PositionDeletes for lazy-sorted sequence number filtering using bisect |
pyiceberg/table/__init__.py |
Integrated DeleteFileIndex into DataScan.plan_files(), replacing the old SortedList-based approach; removed unused imports and old _match_deletes_to_data_file function |
tests/table/test_delete_file_index.py |
Comprehensive test suite covering empty index, sequence number filtering, path-specific deletes, partitioned deletes, deletion vectors, lazy sorting, and immutability after indexing |
tests/table/test_init.py |
Removed tests for the old _match_deletes_to_data_file function and cleaned up unused imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! This is awesome
I think this speeds up matching delete files from O(n) -> O(1), and also evaluates lazily
Great start to unlock eq deletes and DVs
| ) | ||
|
|
||
| from pydantic import Field | ||
| from sortedcontainers import SortedList |
There was a problem hiding this comment.
last occurrence
| return evaluator.eval(delete_file) | ||
|
|
||
|
|
||
| def _referenced_data_file_path(delete_file: DataFile) -> str | None: |
There was a problem hiding this comment.
a little awkward that delete_file is of type DataFile, we can refactor this later perhaps
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
note the old solution uses bisect_right, which might have been a bug
| def _create_deletion_vector( | ||
| sequence_number: int = 1, file_path: str = "s3://bucket/data.parquet", spec_id: int = 0 | ||
| ) -> ManifestEntry: | ||
| delete_file = DataFile.from_args( | ||
| content=DataFileContent.POSITION_DELETES, | ||
| file_path=f"s3://bucket/deletion-vector-{sequence_number}.puffin", | ||
| file_format=FileFormat.PUFFIN, | ||
| partition=Record(), | ||
| record_count=10, | ||
| file_size_in_bytes=100, | ||
| lower_bounds={PATH_FIELD_ID: file_path.encode()}, | ||
| upper_bounds={PATH_FIELD_ID: file_path.encode()}, | ||
| ) | ||
| delete_file._spec_id = spec_id | ||
| return ManifestEntry.from_args(status=ManifestEntryStatus.ADDED, sequence_number=sequence_number, data_file=delete_file) |
There was a problem hiding this comment.
ha, might as well add it to the source code 😄 we can follow up with this
| self._ensure_indexed() | ||
| if not self._files: | ||
| return [] | ||
| start_idx = bisect_left(self._seqs, seq) |
There was a problem hiding this comment.
might be worth it to add an integration test scenario for this
|
Thanks @geruh and thank you @Fokko @jayceslesar @.copilot for the review |
Related to #2255.
Rationale for this change
This PR is a piece of the existing DFI PR in #2255. However, this rips out the existing delete->data matching behavior for deletes and indexes them for efficient lookup.
The previous implementation:
_InclusiveMetricsEvaluatorinstance for each data fileNow we extend this workflow with a
DeleteFileIndexthat:This aligns with the Java implementation of the DeleteFileIndex, following the python infra.
Are these changes tested?
New tests added and existing tests continue to pass
Are there any user-facing changes?
No