perf: Optimize Morton order with hypercube and vectorization by mkitti · Pull Request #3708 · zarr-developers/zarr-python

mkitti · 2026-02-13T05:19:22Z

Summary

This PR optimizes the Morton order computation in _morton_order and decode_morton functions with multiple techniques that together provide 3-5x speedup in Morton computation, resulting in measurable end-to-end improvements for sharded array indexing.

Optimizations

1. Hypercube Optimization

Calculate the largest power-of-2 hypercube that fits within chunk_shape. Within this hypercube, Morton codes are guaranteed to be in bounds, eliminating the need for bounds checking.

2. Vectorized Morton Decoding

Created decode_morton_vectorized to decode multiple Morton codes at once using NumPy array operations instead of scalar bit manipulation.

3. Efficient Bit Counting

Replaced math.ceil(math.log2(c)) with (c-1).bit_length() for more efficient and accurate bit width calculation.

4. Singleton Dimension Removal

For shapes with singleton dimensions (size 1) like (1,1,32,32,32), remove them before Morton computation, then expand coordinates back. This enables better optimization for the non-singleton dimensions.

Benchmark Results

Morton Order Computation (micro-benchmark, no caching)

Shape	Original	Optimized	Speedup
`(8, 8)`	0.27ms	0.13ms	2.0x
`(32, 32)`	5.21ms	1.37ms	3.8x
`(8, 8, 8)`	2.57ms	0.76ms	3.4x
`(32, 32, 32)`	220ms	48ms	4.6x
`(16, 16, 16, 16)`	447ms	107ms	4.2x
`(1, 1, 32, 32, 32)`	237ms	71ms	3.3x

End-to-End Sharded Array Indexing

Benchmark with 32³ = 32,768 chunks per shard (test_sharded_morton_indexing_large):

Branch	Mean Time	Morton Contribution
Main (original)	7.07s	~220ms
Optimized	6.89s	~50ms
Improvement	171ms (2.4%)	~4x faster

The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts.

Benchmark Script

#!/usr/bin/env python
"""Detailed benchmark showing contribution of each Morton order optimization."""

import time
from math import prod

import numpy as np

from zarr.core.indexing import (
    _morton_order,
    decode_morton,
    decode_morton_vectorized,
)


def _morton_order_original(chunk_shape):
    """Original implementation - no optimizations."""
    n_total = prod(chunk_shape)
    order = []
    i = 0
    while len(order) < n_total:
        m = decode_morton(i, chunk_shape)
        if all(x < y for x, y in zip(m, chunk_shape, strict=False)):
            order.append(m)
        i += 1
    return tuple(order)


def benchmark(func, shape, iterations=10):
    """Benchmark a function and return average time in ms (no caching)."""
    if hasattr(func, 'cache_clear'):
        times = []
        for _ in range(iterations):
            func.cache_clear()
            start = time.perf_counter()
            func(shape)
            times.append(time.perf_counter() - start)
        return sum(times) / len(times) * 1000
    else:
        start = time.perf_counter()
        for _ in range(iterations):
            func(shape)
        return (time.perf_counter() - start) / iterations * 1000


def main():
    summary_shapes = [
        (8, 8),
        (32, 32),
        (8, 8, 8),
        (32, 32, 32),
        (16, 16, 16, 16),
        (1, 1, 32, 32, 32),
    ]

    print(f"| {'Shape':<20} | {'Original':>10} | {'Optimized':>10} | {'Speedup':>8} |")
    print(f"|{'-'*22}|{'-'*12}|{'-'*12}|{'-'*10}|")

    for shape in summary_shapes:
        t_orig = benchmark(_morton_order_original, shape)
        t_full = benchmark(_morton_order, shape)
        speedup = t_orig / t_full if t_full > 0 else float("inf")
        print(f"| \`{str(shape):<19}\` | {t_orig:>8.2f}ms | {t_full:>8.2f}ms | **{speedup:>5.1f}x** |")


if __name__ == "__main__":
    main()

Checklist

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

mkitti · 2026-02-13T05:34:57Z

As of 865df2a, I see the speed ups below. This optimization involves just eliding the bounds check when within the largest power of 2 hypercube.

     Shape                    Original    Optimized    Speedup
     --------------------------------------------------------
     (8, 8)                    0.304ms      0.020ms    14.98x
     (16, 16)                  1.322ms      0.086ms    15.41x
     (32, 32)                  5.706ms      0.383ms    14.91x
     (9, 9)                    0.943ms      0.084ms    11.20x
     (15, 15)                  1.252ms      0.115ms    10.91x
     (100, 100)              102.101ms     10.108ms    10.10x
     (8, 8, 8)                 2.733ms      0.185ms    14.76x
     (5, 7, 11)                4.446ms      0.445ms     9.99x

Benchmarking Script

#!/usr/bin/env python
"""Benchmark script for Morton order optimization."""

import time
from math import prod

from zarr.core.indexing import _morton_order, decode_morton


def _morton_order_original(chunk_shape):
    """Original implementation without optimization."""
    n_total = prod(chunk_shape)
    order = []
    i = 0
    while len(order) < n_total:
        m = decode_morton(i, chunk_shape)
        if all(x < y for x, y in zip(m, chunk_shape, strict=False)):
            order.append(m)
        i += 1
    return tuple(order)


def benchmark(func, shape, iterations=10):
    """Benchmark a function and return average time in ms."""
    start = time.perf_counter()
    for _ in range(iterations):
        func(shape)
    return (time.perf_counter() - start) / iterations * 1000


def main():
    test_shapes = [
        (8, 8),       # Perfect power of 2
        (16, 16),     # Perfect power of 2
        (32, 32),     # Perfect power of 2
        (9, 9),       # Slightly larger than power of 2
        (15, 15),     # Just under power of 2
        (100, 100),   # Large non-power-of-2
        (8, 8, 8),    # 3D power of 2
        (5, 7, 11),   # 3D mixed
    ]

    print(f"{'Shape':<20} {'Original':>12} {'Optimized':>12} {'Speedup':>10}")
    print("-" * 56)

    for shape in test_shapes:
        # Benchmark original
        original_time = benchmark(_morton_order_original, shape)

        # Clear cache and benchmark optimized
        _morton_order.cache_clear()
        optimized_time = benchmark(_morton_order, shape)
        _morton_order.cache_clear()

        speedup = original_time / optimized_time if optimized_time > 0 else float("inf")
        print(f"{str(shape):<20} {original_time:>10.3f}ms {optimized_time:>10.3f}ms {speedup:>8.2f}x")


if __name__ == "__main__":
    main()

edit: This benchmark did not correctly eliminate caching effects.

codspeed-hq · 2026-02-13T08:02:21Z

Merging this PR will not alter performance

✅ 48 untouched benchmarks
🆕 3 new benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	WallTime	`test_sharded_morton_indexing[(32, 32, 32)-memory]`	N/A	1.2 s	N/A
🆕	WallTime	`test_sharded_morton_indexing[(16, 16, 16)-memory]`	N/A	147.5 ms	N/A
🆕	WallTime	`test_sharded_morton_indexing_large[(32, 32, 32)-memory]`	N/A	9.3 s	N/A

_{Comparing mkitti:mkitti-morton-decode-optimization (65205b3) with main (23596c1)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

d-v-b · 2026-02-13T08:48:19Z

The changes here don't measurably affect performance but do add a lot of code. Do you think you can put together a realistic benchmark that reveals an effect of these changes? E.g., reading exactly 1 sub-chunk from a high-dimensional sharded array? I think I want to see some measurable performance impact on a realistic workload before thinking about raising the code complexity in this way.

mkitti · 2026-02-13T10:04:30Z

benchmark_morton_detailed.py

==========================================================================================
Morton Order Optimization Benchmarks
==========================================================================================

### 2D Power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(16, 16)                 1.40ms     0.80ms       0.43ms     0.41ms     3.4x
(32, 32)                 5.42ms     3.72ms       1.47ms     1.42ms     3.8x
(64, 64)                24.03ms    18.35ms       5.83ms     5.53ms     4.3x

### 2D Non-power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(15, 15)                 1.42ms     1.12ms       1.15ms     1.13ms     1.3x
(100, 100)             101.74ms    92.95ms      80.90ms    80.50ms     1.3x

### 3D Power-of-2
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(8, 8, 8)                2.86ms     1.72ms       0.79ms     0.91ms     3.1x
(16, 16, 16)            24.41ms    16.88ms       5.95ms     5.98ms     4.1x
(32, 32, 32)           215.02ms   161.18ms      50.25ms    49.65ms     4.3x

### 3D Mixed
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(5, 7, 11)               4.17ms     3.94ms       3.81ms     4.08ms     1.0x

### With Singletons
Shape                  Original  Hypercube  Vec+Generic   Full Opt  Speedup
---------------------------------------------------------------------------
(1, 32, 32)              5.91ms     5.96ms       6.19ms     1.90ms     3.1x
(1, 1, 32, 32, 32)     238.82ms   242.77ms     238.35ms    73.14ms     3.3x
(1, 1, 1, 16, 16)        1.60ms     1.55ms       2.17ms     0.76ms     2.1x

==========================================================================================
Summary Table for PR (representative shapes)
==========================================================================================

Shape                |   Original |  Hypercube | + Vectorized | + Magic Nums |  + Singleton
------------------------------------------------------------------------------------------
(32, 32)             |     6.03ms |     3.92ms |       1.54ms |       1.57ms |        N/A
(100, 100)           |    99.72ms |    90.92ms |      77.44ms |      78.32ms |        N/A
(32, 32, 32)         |   218.05ms |   159.02ms |      48.33ms |      48.14ms |        N/A
(1, 1, 32, 32, 32)   |   245.68ms |   245.70ms |     239.57ms |        N/A   |      73.12ms

mkitti · 2026-02-13T10:42:53Z

The most efficient changes are probably represented by 1b1076f. This the Vec+Generic column above.

Diff: main...mkitti:zarr-python:1b1076f136645300d5132839403494ea2920ac13

$ git diff --compact-summary main 1b1076f
 src/zarr/core/indexing.py | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 3 deletions(-)

For about 64 lines of code, you get a 4x speed up for powers of 2 shapes such as (32, 32, 32). That reduces the time from 200ms to 50ms. That's a common case where you have (1024,1024,1024) sized shards and (32,32,32) inner chunks.

mkitti · 2026-02-13T10:55:00Z

The magic number functions are not as effective as they could be because we do not have a compiler than can apply SIMD optimizations. That algorithm is what you would find in C++ libraries such as libmorton. The explanation for those are here:
https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/

That's 150 lines of code with a small effect measured in the nanoseconds:
mkitti/zarr-python@1b1076f...mkitti:zarr-python:db31842b072c9d68038b2c37d1350fd0938e6310

d-v-b · 2026-02-13T10:55:01Z

could you add something to our existing indexing benchmark routines that demonstrates the performance gains here?

d-v-b · 2026-02-13T10:58:30Z

for context, this library does not aspire to high-performance morton code generation. Instead we want array indexing to be fast. So for these changes to be attractive, they should make array indexing faster in at least some cases.

Add benchmarks that clear the _morton_order LRU cache before each iteration to measure the full Morton computation cost: - test_sharded_morton_indexing: 512-4096 chunks per shard - test_sharded_morton_indexing_large: 32768 chunks per shard Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

d-v-b · 2026-02-13T12:20:51Z

oh and while you're at it, could you ensure that the lru store cache is bounded? I forgot to do this in #3705

mkitti · 2026-02-13T12:24:44Z

The default is maxsize=128. Do you want it smaller? Claude suggests 16.

d-v-b · 2026-02-13T12:26:20Z

16 MB I assume? That seems fine! We just want it to be explicit.

mkitti · 2026-02-13T12:26:35Z

tests/benchmarks/test_indexing.py

+    def read_with_cache_clear() -> None:
+        _morton_order.cache_clear()
+        getitem(data, indexer)


Claude's test involves clearing the cache before each benchmark run. Let's pretend that this represents thrashing the now bounded cache.

mkitti · 2026-02-13T12:29:07Z

16 MB I assume? That seems fine! We just want it to be explicit.

No. It will memoize the last 16 calls.

https://docs.python.org/3/library/functools.html#functools.lru_cache

Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.

mkitti · 2026-02-13T12:33:28Z

The end-to-end array indexing benchmarks have been added to the pull requeest description.

End-to-End Sharded Array Indexing

Benchmark with 32³ = 32,768 chunks per shard (test_sharded_morton_indexing_large):

Branch	Mean Time	Morton Contribution
Main (original)	7.07s	~220ms
Optimized	6.89s	~50ms
Improvement	171ms (2.4%)	~4x faster

The 2.4% end-to-end improvement directly correlates with the Morton order speedup. While I/O dominates total time, the Morton optimization provides measurable benefits for workloads with large chunks-per-shard counts.

perf: Skip bounds check for initial elements in 2^n hypercube

47730f3

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 13, 2026

mkitti changed the title ~~perf: Skip bounds check for initial elements in 2^n hypercube~~ perf: Optimize _morton_coding Feb 13, 2026

lint:Use a list comprehension rather than a for loop

865df2a

mkitti added 6 commits February 13, 2026 00:47

pref:Add decode_morton_vectorized

ce0099c

perf:Replace math.log2() with bit_length()

1b1076f

perf:Use magic numbers for 2D and 3D

47a68eb

perf:Add 4D Morton magic numbers

6fb6d00

perf:Add Morton magic numbers for 5D

db31842

perf:Remove singleton dimensions to reduce ndims

f9952f1

mkitti changed the title ~~perf: Optimize _morton_coding~~ perf: Optimize Morton order with hypercube, vectorization, and magic numbers Feb 13, 2026

Add changes

aedce5a

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 13, 2026

fix:Address type annotation and linting issues

ef18210

d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 13, 2026

mkitti changed the title ~~perf: Optimize Morton order with hypercube, vectorization, and magic numbers~~ perf: Optimize Morton order with hypercube and vectorization Feb 13, 2026

mkitti and others added 3 commits February 13, 2026 06:32

perf:Remove magic number functions

24dcbd5

test:Add power of 2 sharding indexing tests

7b3db07

mkitti commented Feb 13, 2026

View reviewed changes

fix:Bound LRU cache of _morton_order to 16

1cdcbdf

d-v-b added 2 commits February 13, 2026 13:37

Merge branch 'main' into mkitti-morton-decode-optimization

536f520

Merge branch 'main' into mkitti-morton-decode-optimization

65205b3

Uh oh!

Conversation

mkitti commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Optimizations

1. Hypercube Optimization

2. Vectorized Morton Decoding

3. Efficient Bit Counting

4. Singleton Dimension Removal

Benchmark Results

Morton Order Computation (micro-benchmark, no caching)

End-to-End Sharded Array Indexing

Checklist

Uh oh!

mkitti commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

d-v-b commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkitti commented Feb 13, 2026

Uh oh!

mkitti commented Feb 13, 2026

Uh oh!

mkitti commented Feb 13, 2026

Uh oh!

d-v-b commented Feb 13, 2026

Uh oh!

d-v-b commented Feb 13, 2026

Uh oh!

d-v-b commented Feb 13, 2026

Uh oh!

mkitti commented Feb 13, 2026

Uh oh!

d-v-b commented Feb 13, 2026

Uh oh!

mkitti Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

mkitti commented Feb 13, 2026

Uh oh!

mkitti commented Feb 13, 2026

End-to-End Sharded Array Indexing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkitti commented Feb 13, 2026 •

edited

Loading

mkitti commented Feb 13, 2026 •

edited

Loading

codspeed-hq bot commented Feb 13, 2026 •

edited

Loading

d-v-b commented Feb 13, 2026 •

edited

Loading