Skip to content

Add Deltalake query support#1023

Merged
esoteric-ephemera merged 42 commits intomainfrom
deltalake
Feb 24, 2026
Merged

Add Deltalake query support#1023
esoteric-ephemera merged 42 commits intomainfrom
deltalake

Conversation

@tsmathis
Copy link
Collaborator

should just work™️

@tsmathis

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@esoteric-ephemera

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@codecov-commenter

This comment was marked as outdated.

tschaume

This comment was marked as outdated.

tschaume

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@esoteric-ephemera

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@tsmathis
Copy link
Collaborator Author

@esoteric-ephemera, @tschaume - this is ready for review again. Cleaned up/up to date, etc.

Some refreshers:
perf (non-rigorous):

# w/ deltalake & pyarrow
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents: 100% | <progress_bar> | 1914019/1914019 [07:31<00:00, 4240.33it/s]

454.2317273330045 (seconds)

# w/ mp-api v0.46.0
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents:  36%| <progress_bar> | 513085/1435073 [09:16<20:53, 735.30it/s]
zsh: killed     python
/Users/tsmathis/miniconda3/envs/test_api_pypi/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
  
# dies @ ~35%, :shrug:

local caching (w/ warnings):

>>> tasks = mpr.materials.tasks.search()
mp_api.client.core.client - WARNING - Dataset for tasks already exists at /Users/tsmathis/mp_datasets/parsed/core/tasks, returning existing dataset.
mp_api.client.core.client - INFO - Delete or move existing dataset or re-run search query with MPRester(force_renew=True) to refresh local dataset.

Access controlled dataset compatibility (+ accurate pbar):

commercial_user_key = os.environ.get("BY_C_KEY")
non_commercial_user_key = os.environ.get("BY_NC_KEY")

>>> with MPRester(commercial_user_key) as mpr:
...    by_c_tasks = mpr.materials.tasks.search()
Retrieving CoreTaskDoc documents: 100%| <progress_bar> | 1914019/1914019 [07:35<00:00, 4200.25it/s]
>>> len(by_c_tasks)
1914019
>>> with MPRester(non_commercial_user_key, force_renew=True) as mpr:  # clear local cache
...    by_nc_tasks = mpr.materials.tasks.search()
# well pbar for BY_NC will actually shows count of v7 tasks atm...,
# w/ update to v8 tasks pbar will be accurate for non-commercial users
>>> len(by_nc_tasks)
1797065

Warnings on sub-optimal usage w/ links to docs:

>>> _ = tasks[0]
<stdin>:1: MPDatasetIndexingWarning:
            Pythonic indexing into arrow-based MPDatasets is sub-optimal, consider using
            idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
            docs.materialsproject.org/downloading-data/arrow-datasets

>>> _ = tasks[0:10]
<stdin>:1: MPDatasetSlicingWarning:
                Pythonic slicing of arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

>>> for i in tasks:
...     _ = i
...
<stdin>:1: MPDatasetIterationWarning:
                Iterating through arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

@tsmathis
Copy link
Collaborator Author

still need to write docs.materialsproject.org/downloading-data/arrow-datasets though...
next on the todos

@esoteric-ephemera
Copy link
Collaborator

Overall looks good to me! One thing, which may just be an inconsistency in how the tasks on AWS and Mongo are currently formatted: the task_ids differ (AWS: prefix-less, Mongo: mp- prefix)

Should this get taken care of in tasks-v8?

@tsmathis
Copy link
Collaborator Author

tsmathis commented Feb 19, 2026

Overall looks good to me! One thing, which may just be an inconsistency in how the tasks on AWS and Mongo are currently formatted: the task_ids differ (AWS: prefix-less, Mongo: mp- prefix)

Should this get taken care of in tasks-v8?

tasks-v8 is prefix-less!
so should be no mismatch there once tasks-v8 is live :)

@esoteric-ephemera
Copy link
Collaborator

OK perfect then, I think this is ready. Might just add a warning if users query by a prefixed ID to let them know about the transition in 1053

I'll merge in 1046, then this, and then 1053

@esoteric-ephemera esoteric-ephemera merged commit 118d7db into main Feb 24, 2026
3 checks passed
@esoteric-ephemera esoteric-ephemera deleted the deltalake branch February 24, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants