Skip to content

perf: Optimize Morton order with hypercube and vectorization#3708

Open
mkitti wants to merge 16 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-decode-optimization
Open

perf: Optimize Morton order with hypercube and vectorization#3708
mkitti wants to merge 16 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-decode-optimization

Conversation

@mkitti
Copy link

@mkitti mkitti commented Feb 13, 2026

Summary

This PR optimizes the Morton order computation in _morton_order and decode_morton functions with multiple techniques that together provide 3-5x speedup in Morton computation, resulting in measurable end-to-end improvements for sharded array indexing.

Optimizations

1. Hypercube Optimization

Calculate the largest power-of-2 hypercube that fits within chunk_shape. Within this hypercube, Morton codes are guaranteed to be in bounds, eliminating the need for bounds checking.

2. Vectorized Morton Decoding

Created decode_morton_vectorized to decode multiple Morton codes at once using NumPy array operations instead of scalar bit manipulation.

3. Efficient Bit Counting

Replaced math.ceil(math.log2(c)) with (c-1).bit_length() for more efficient and accurate bit width calculation.

4. Singleton Dimension Removal

For shapes with singleton dimensions (size 1) like (1,1,32,32,32), remove them before Morton computation, then expand coordinates back. This enables better optimization for the non-singleton dimensions.

Benchmark Results

Morton Order Computation (micro-benchmark, no caching)

Shape Original Optimized Speedup
(8, 8) 0.27ms 0.13ms 2.0x
(32, 32) 5.21ms 1.37ms 3.8x
(8, 8, 8) 2.57ms 0.76ms 3.4x
(32, 32, 32) 220ms 48ms 4.6x
(16, 16, 16, 16) 447ms 107ms 4.2x
(1, 1, 32, 32, 32) 237ms 71ms 3.3x

End-to-End Sharded Array Indexing

Benchmark with 32³ = 32,768 chunks per shard (test_sharded_morton_indexing_large):

Branch Mean Time Morton Contribution
Main (original) 7.07s ~220ms
Optimized 6.89s ~50ms
Improvement 171ms (2.4%) ~4x faster

The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts.

Benchmark Script
#!/usr/bin/env python
"""Detailed benchmark showing contribution of each Morton order optimization."""

import time
from math import prod

import numpy as np

from zarr.core.indexing import (
    _morton_order,
    decode_morton,
    decode_morton_vectorized,
)


def _morton_order_original(chunk_shape):
    """Original implementation - no optimizations."""
    n_total = prod(chunk_shape)
    order = []
    i = 0
    while len(order) < n_total:
        m = decode_morton(i, chunk_shape)
        if all(x < y for x, y in zip(m, chunk_shape, strict=False)):
            order.append(m)
        i += 1
    return tuple(order)


def benchmark(func, shape, iterations=10):
    """Benchmark a function and return average time in ms (no caching)."""
    if hasattr(func, 'cache_clear'):
        times = []
        for _ in range(iterations):
            func.cache_clear()
            start = time.perf_counter()
            func(shape)
            times.append(time.perf_counter() - start)
        return sum(times) / len(times) * 1000
    else:
        start = time.perf_counter()
        for _ in range(iterations):
            func(shape)
        return (time.perf_counter() - start) / iterations * 1000


def main():
    summary_shapes = [
        (8, 8),
        (32, 32),
        (8, 8, 8),
        (32, 32, 32),
        (16, 16, 16, 16),
        (1, 1, 32, 32, 32),
    ]

    print(f"| {'Shape':<20} | {'Original':>10} | {'Optimized':>10} | {'Speedup':>8} |")
    print(f"|{'-'*22}|{'-'*12}|{'-'*12}|{'-'*10}|")

    for shape in summary_shapes:
        t_orig = benchmark(_morton_order_original, shape)
        t_full = benchmark(_morton_order, shape)
        speedup = t_orig / t_full if t_full > 0 else float("inf")
        print(f"| \`{str(shape):<19}\` | {t_orig:>8.2f}ms | {t_full:>8.2f}ms | **{speedup:>5.1f}x** |")


if __name__ == "__main__":
    main()

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 13, 2026
@mkitti mkitti changed the title perf: Skip bounds check for initial elements in 2^n hypercube perf: Optimize _morton_coding Feb 13, 2026
@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

As of 865df2a, I see the speed ups below. This optimization involves just eliding the bounds check when within the largest power of 2 hypercube.

     Shape                    Original    Optimized    Speedup
     --------------------------------------------------------
     (8, 8)                    0.304ms      0.020ms    14.98x
     (16, 16)                  1.322ms      0.086ms    15.41x
     (32, 32)                  5.706ms      0.383ms    14.91x
     (9, 9)                    0.943ms      0.084ms    11.20x
     (15, 15)                  1.252ms      0.115ms    10.91x
     (100, 100)              102.101ms     10.108ms    10.10x
     (8, 8, 8)                 2.733ms      0.185ms    14.76x
     (5, 7, 11)                4.446ms      0.445ms     9.99x
Benchmarking Script
#!/usr/bin/env python
"""Benchmark script for Morton order optimization."""

import time
from math import prod

from zarr.core.indexing import _morton_order, decode_morton


def _morton_order_original(chunk_shape):
    """Original implementation without optimization."""
    n_total = prod(chunk_shape)
    order = []
    i = 0
    while len(order) < n_total:
        m = decode_morton(i, chunk_shape)
        if all(x < y for x, y in zip(m, chunk_shape, strict=False)):
            order.append(m)
        i += 1
    return tuple(order)


def benchmark(func, shape, iterations=10):
    """Benchmark a function and return average time in ms."""
    start = time.perf_counter()
    for _ in range(iterations):
        func(shape)
    return (time.perf_counter() - start) / iterations * 1000


def main():
    test_shapes = [
        (8, 8),       # Perfect power of 2
        (16, 16),     # Perfect power of 2
        (32, 32),     # Perfect power of 2
        (9, 9),       # Slightly larger than power of 2
        (15, 15),     # Just under power of 2
        (100, 100),   # Large non-power-of-2
        (8, 8, 8),    # 3D power of 2
        (5, 7, 11),   # 3D mixed
    ]

    print(f"{'Shape':<20} {'Original':>12} {'Optimized':>12} {'Speedup':>10}")
    print("-" * 56)

    for shape in test_shapes:
        # Benchmark original
        original_time = benchmark(_morton_order_original, shape)

        # Clear cache and benchmark optimized
        _morton_order.cache_clear()
        optimized_time = benchmark(_morton_order, shape)
        _morton_order.cache_clear()

        speedup = original_time / optimized_time if optimized_time > 0 else float("inf")
        print(f"{str(shape):<20} {original_time:>10.3f}ms {optimized_time:>10.3f}ms {speedup:>8.2f}x")


if __name__ == "__main__":
    main()

edit: This benchmark did not correctly eliminate caching effects.

@mkitti mkitti changed the title perf: Optimize _morton_coding perf: Optimize Morton order with hypercube, vectorization, and magic numbers Feb 13, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 13, 2026
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 13, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 13, 2026

Merging this PR will not alter performance

✅ 48 untouched benchmarks
🆕 3 new benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 WallTime test_sharded_morton_indexing[(32, 32, 32)-memory] N/A 1.2 s N/A
🆕 WallTime test_sharded_morton_indexing[(16, 16, 16)-memory] N/A 147.5 ms N/A
🆕 WallTime test_sharded_morton_indexing_large[(32, 32, 32)-memory] N/A 9.3 s N/A

Comparing mkitti:mkitti-morton-decode-optimization (65205b3) with main (23596c1)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2026

The changes here don't measurably affect performance but do add a lot of code. Do you think you can put together a realistic benchmark that reveals an effect of these changes? E.g., reading exactly 1 sub-chunk from a high-dimensional sharded array? I think I want to see some measurable performance impact on a realistic workload before thinking about raising the code complexity in this way.

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

benchmark_morton_detailed.py

==========================================================================================
Morton Order Optimization Benchmarks
==========================================================================================

### 2D Power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(16, 16)                 1.40ms     0.80ms       0.43ms     0.41ms     3.4x
(32, 32)                 5.42ms     3.72ms       1.47ms     1.42ms     3.8x
(64, 64)                24.03ms    18.35ms       5.83ms     5.53ms     4.3x

### 2D Non-power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(15, 15)                 1.42ms     1.12ms       1.15ms     1.13ms     1.3x
(100, 100)             101.74ms    92.95ms      80.90ms    80.50ms     1.3x

### 3D Power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(8, 8, 8)                2.86ms     1.72ms       0.79ms     0.91ms     3.1x
(16, 16, 16)            24.41ms    16.88ms       5.95ms     5.98ms     4.1x
(32, 32, 32)           215.02ms   161.18ms      50.25ms    49.65ms     4.3x

### 3D Mixed
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(5, 7, 11)               4.17ms     3.94ms       3.81ms     4.08ms     1.0x

### With Singletons
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(1, 32, 32)              5.91ms     5.96ms       6.19ms     1.90ms     3.1x
(1, 1, 32, 32, 32)     238.82ms   242.77ms     238.35ms    73.14ms     3.3x
(1, 1, 1, 16, 16)        1.60ms     1.55ms       2.17ms     0.76ms     2.1x

==========================================================================================
Summary Table for PR (representative shapes)
==========================================================================================

Shape                |   Original |  Hypercube | + Vectorized | + Magic Nums |  + Singleton
------------------------------------------------------------------------------------------
(32, 32)             |     6.03ms |     3.92ms |       1.54ms |       1.57ms |        N/A
(100, 100)           |    99.72ms |    90.92ms |      77.44ms |      78.32ms |        N/A
(32, 32, 32)         |   218.05ms |   159.02ms |      48.33ms |      48.14ms |        N/A
(1, 1, 32, 32, 32)   |   245.68ms |   245.70ms |     239.57ms |        N/A   |      73.12ms

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

The most efficient changes are probably represented by 1b1076f. This the Vec+Generic column above.

Diff: main...mkitti:zarr-python:1b1076f136645300d5132839403494ea2920ac13

$ git diff --compact-summary main 1b1076f
 src/zarr/core/indexing.py | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 3 deletions(-)

For about 64 lines of code, you get a 4x speed up for powers of 2 shapes such as (32, 32, 32). That reduces the time from 200ms to 50ms. That's a common case where you have (1024,1024,1024) sized shards and (32,32,32) inner chunks.

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

The magic number functions are not as effective as they could be because we do not have a compiler than can apply SIMD optimizations. That algorithm is what you would find in C++ libraries such as libmorton. The explanation for those are here:
https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/

That's 150 lines of code with a small effect measured in the nanoseconds:
mkitti/zarr-python@1b1076f...mkitti:zarr-python:db31842b072c9d68038b2c37d1350fd0938e6310

@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2026

could you add something to our existing indexing benchmark routines that demonstrates the performance gains here?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2026

for context, this library does not aspire to high-performance morton code generation. Instead we want array indexing to be fast. So for these changes to be attractive, they should make array indexing faster in at least some cases.

@mkitti mkitti changed the title perf: Optimize Morton order with hypercube, vectorization, and magic numbers perf: Optimize Morton order with hypercube and vectorization Feb 13, 2026
mkitti and others added 3 commits February 13, 2026 06:32
Add benchmarks that clear the _morton_order LRU cache before each
iteration to measure the full Morton computation cost:

- test_sharded_morton_indexing: 512-4096 chunks per shard
- test_sharded_morton_indexing_large: 32768 chunks per shard

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2026

oh and while you're at it, could you ensure that the lru store cache is bounded? I forgot to do this in #3705

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

The default is maxsize=128. Do you want it smaller? Claude suggests 16.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 13, 2026

16 MB I assume? That seems fine! We just want it to be explicit.

Comment on lines +146 to +148
def read_with_cache_clear() -> None:
_morton_order.cache_clear()
getitem(data, indexer)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude's test involves clearing the cache before each benchmark run. Let's pretend that this represents thrashing the now bounded cache.

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

16 MB I assume? That seems fine! We just want it to be explicit.

No. It will memoize the last 16 calls.

https://docs.python.org/3/library/functools.html#functools.lru_cache

Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.

@mkitti
Copy link
Author

mkitti commented Feb 13, 2026

The end-to-end array indexing benchmarks have been added to the pull requeest description.

End-to-End Sharded Array Indexing

Benchmark with 32³ = 32,768 chunks per shard (test_sharded_morton_indexing_large):

Branch Mean Time Morton Contribution
Main (original) 7.07s ~220ms
Optimized 6.89s ~50ms
Improvement 171ms (2.4%) ~4x faster

The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants