perf: Optimize Morton order with hypercube and vectorization#3708
perf: Optimize Morton order with hypercube and vectorization#3708mkitti wants to merge 16 commits intozarr-developers:mainfrom
Conversation
|
As of 865df2a, I see the speed ups below. This optimization involves just eliding the bounds check when within the largest power of 2 hypercube. Benchmarking Scriptedit: This benchmark did not correctly eliminate caching effects. |
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
|
The changes here don't measurably affect performance but do add a lot of code. Do you think you can put together a realistic benchmark that reveals an effect of these changes? E.g., reading exactly 1 sub-chunk from a high-dimensional sharded array? I think I want to see some measurable performance impact on a realistic workload before thinking about raising the code complexity in this way. |
|
|
The most efficient changes are probably represented by 1b1076f. This the Diff: main...mkitti:zarr-python:1b1076f136645300d5132839403494ea2920ac13 For about 64 lines of code, you get a 4x speed up for powers of 2 shapes such as |
|
The magic number functions are not as effective as they could be because we do not have a compiler than can apply SIMD optimizations. That algorithm is what you would find in C++ libraries such as libmorton. The explanation for those are here: That's 150 lines of code with a small effect measured in the nanoseconds: |
|
could you add something to our existing indexing benchmark routines that demonstrates the performance gains here? |
|
for context, this library does not aspire to high-performance morton code generation. Instead we want array indexing to be fast. So for these changes to be attractive, they should make array indexing faster in at least some cases. |
Add benchmarks that clear the _morton_order LRU cache before each iteration to measure the full Morton computation cost: - test_sharded_morton_indexing: 512-4096 chunks per shard - test_sharded_morton_indexing_large: 32768 chunks per shard Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
oh and while you're at it, could you ensure that the lru store cache is bounded? I forgot to do this in #3705 |
|
The default is maxsize=128. Do you want it smaller? Claude suggests 16. |
|
16 MB I assume? That seems fine! We just want it to be explicit. |
| def read_with_cache_clear() -> None: | ||
| _morton_order.cache_clear() | ||
| getitem(data, indexer) |
There was a problem hiding this comment.
Claude's test involves clearing the cache before each benchmark run. Let's pretend that this represents thrashing the now bounded cache.
No. It will memoize the last 16 calls. https://docs.python.org/3/library/functools.html#functools.lru_cache
|
|
The end-to-end array indexing benchmarks have been added to the pull requeest description. End-to-End Sharded Array IndexingBenchmark with 32³ = 32,768 chunks per shard (
The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts. |
Summary
This PR optimizes the Morton order computation in
_morton_orderanddecode_mortonfunctions with multiple techniques that together provide 3-5x speedup in Morton computation, resulting in measurable end-to-end improvements for sharded array indexing.Optimizations
1. Hypercube Optimization
Calculate the largest power-of-2 hypercube that fits within
chunk_shape. Within this hypercube, Morton codes are guaranteed to be in bounds, eliminating the need for bounds checking.2. Vectorized Morton Decoding
Created
decode_morton_vectorizedto decode multiple Morton codes at once using NumPy array operations instead of scalar bit manipulation.3. Efficient Bit Counting
Replaced
math.ceil(math.log2(c))with(c-1).bit_length()for more efficient and accurate bit width calculation.4. Singleton Dimension Removal
For shapes with singleton dimensions (size 1) like
(1,1,32,32,32), remove them before Morton computation, then expand coordinates back. This enables better optimization for the non-singleton dimensions.Benchmark Results
Morton Order Computation (micro-benchmark, no caching)
(8, 8)(32, 32)(8, 8, 8)(32, 32, 32)(16, 16, 16, 16)(1, 1, 32, 32, 32)End-to-End Sharded Array Indexing
Benchmark with 32³ = 32,768 chunks per shard (
test_sharded_morton_indexing_large):The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts.
Benchmark Script
Checklist
docs/user-guide/*.mdchanges/