Skip to content

Add heap-based BPE merge path for large inputs (>128 bytes)#7580

Open
Copilot wants to merge 16 commits intomainfrom
copilot/fix-byte-pair-encoding-performance
Open

Add heap-based BPE merge path for large inputs (>128 bytes)#7580
Copilot wants to merge 16 commits intomainfrom
copilot/fix-byte-pair-encoding-performance

Conversation

Copy link
Contributor

Copilot AI commented Feb 12, 2026

Heap-based BPE Optimization for Large Inputs

Changes Made

  • Added threshold check (128 bytes) in BytePairEncode method with explanatory comment
  • Implemented BytePairEncodeLarge with O(n log n) heap-based algorithm
  • Added comprehensive test suite (correctness + consistency, no timing-based tests)
  • Addressed code review feedback
    • Added comment explaining threshold choice
    • Removed timing-based tests
    • Removed redundant comment about captured variables
    • Changed PriorityQueue to use default capacity instead of pre-allocating to max
    • Fixed CompareTo to use standard ordering (lower rank = smaller CompareTo value)
    • Updated test comments to accurately describe what they verify
    • Reverted unrelated BOM/whitespace changes in other files
    • Added parameterless PriorityQueue constructor
    • Added comment noting CurRank assumes rank == token Id (Tiktoken-specific)
    • Removed stackalloc for State array; always use ArrayPool
    • Replaced List+ToArray with ArrayPool for result buffer
  • Code builds successfully for both netstandard2.0 and net8.0
Original prompt

Problem

The current BytePairEncoder.BytePairEncode in src/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs uses an O(n·m) algorithm (linear scan for minimum rank + O(n) element removal via copy on each merge iteration). This is fine for typical pre-tokenizer regex pieces (which are small), but degrades quadratically when a single piece is large (100+ bytes). This can happen with adversarial or unusual inputs where the pre-tokenizer regex produces no split points — e.g., long runs of repeated characters with no whitespace/punctuation.

This mirrors a known issue in the upstream OpenAI tiktoken implementation, addressed in openai/tiktoken#495 which added a heap-based _byte_pair_merge_large function. Relevant upstream issues:

Required changes

Add an alternative heap-based BPE merge path in src/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs that is used only for large pieces, keeping the common-case path completely untouched.

Design

  1. In the existing BytePairEncode method, add a length check after the existing mergingBytes.Length == 1 fast path. If mergingBytes.Length is above a threshold (use 128 as a reasonable starting point — the upstream Rust implementation uses 100, but the existing C# linear-scan is very cache-friendly on small spans so a slightly higher threshold is appropriate), call a new separate method BytePairEncodeLarge and return its result. The rest of the existing method body must remain exactly as-is — no changes to the common-case code path.

  2. Add a new private static method BytePairEncodeLarge with the same signature as BytePairEncode that implements the heap-based algorithm. The algorithm:

    • Uses a state array (linked list via prev/end/nextEnd indices) instead of shifting elements on removal — O(1) removal.
    • Uses a PriorityQueue<int, (int Rank, int Start)> (min-heap) to find the next best merge in O(log n) instead of linear scan.
    • Uses lazy invalidation: each state entry tracks its current nextRank; when a heap entry is popped, if its rank doesn't match the state's current nextRank, it was invalidated and is skipped.
    • This gives O(n log n) merge phase instead of O(n·m).
  3. The result mapping (the indexMappingSpan logic at the end of the current method that maps byte-level indices back to UTF-16 char indices) must also be present in the new method, producing the same (int Id, int TokenIndex, int TokenLength)[] output format.

  4. Both paths must produce identical results for any input — the heap-based algorithm finds the same sequence of minimum-rank merges, just more efficiently.

Key constraints

  • Zero impact on the common case: The only change to the existing hot path is a single integer comparison that is always predicted-not-taken. The new method is completely separate — no new allocations, no JIT impact on the hot path.
  • Use ArrayPool for the state array if it exceeds a reasonable stackalloc threshold, matching the existing code's pattern.
  • Use PriorityQueue<int, (int Rank, int Start)> from System.Collections.Generic (available in .NET 6+). The int element is the start index. The priority tuple ensures min-rank ordering with start-position tiebreaking.
  • Add tests that verify large inputs (e.g., 1000+ repeated characters) produce correct results and complete in reasonable time, and that the results match the existing algorithm for smaller inputs.

Reference implementation

The upstream Rust implementation is in _byte_pair_merge_large at https://github.com/openai/tiktoken/blob/6ec8149812275a88a99dfddb54a60b6d98d75bc2/src/lib.rs#L47-L138

The existing C# code to modify is at:
https://github.com/dotnet/machinelearning/blob/3604580949518d2173f10c67936a9e42fddc5701/src/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 12, 2026 22:49
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
…p capacity

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix performance issue in BytePairEncoder for large inputs Add heap-based BPE merge path for large inputs (>128 bytes) Feb 12, 2026
Copilot AI requested a review from stephentoub February 12, 2026 22:58
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 91.15646% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.08%. Comparing base (3604580) to head (99c5107).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 89.90% 9 Missing and 2 partials ⚠️
...est/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs 94.28% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7580      +/-   ##
==========================================
+ Coverage   69.05%   69.08%   +0.02%     
==========================================
  Files        1483     1483              
  Lines      274362   274513     +151     
  Branches    28270    28285      +15     
==========================================
+ Hits       189466   189648     +182     
+ Misses      77510    77482      -28     
+ Partials     7386     7383       -3     
Flag Coverage Δ
Debug 69.08% <91.15%> (+0.02%) ⬆️
production 63.35% <90.17%> (+0.03%) ⬆️
test 89.52% <94.28%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Utils/PriorityQueue.cs 63.49% <100.00%> (+1.82%) ⬆️
...est/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs 99.14% <94.28%> (+0.05%) ⬆️
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 91.57% <89.90%> (+3.92%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@stephentoub
Copy link
Member

@copilot, can you write and run a benchmark using benchmarkdotnet, comparing the performance before and after this change? Please run it with an input that would have a mergingBytes length of 127, 128, and 1000.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

@copilot, can you write and run a benchmark using benchmarkdotnet, comparing the performance before and after this change? Please run it with an input that would have a mergingBytes le...

Here are the BenchmarkDotNet results comparing before (original O(n²) only) vs after (heap-based for >128 bytes):

Before (original algorithm for all sizes):

| Method     | Mean      | Error    | StdDev   | Allocated |
|----------- |----------:|---------:|---------:|----------:|
| Length127  |  28.73 us | 0.080 us | 0.066 us |     760 B |
| Length128  |  28.93 us | 0.044 us | 0.037 us |     592 B |
| Length1000 | 888.37 us | 2.554 us | 2.264 us |   3305 B |

After (heap-based for >128 bytes):

| Method     | Mean      | Error    | StdDev   | Allocated |
|----------- |----------:|---------:|---------:|----------:|
| Length127  |  28.91 us | 0.031 us | 0.024 us |     760 B |
| Length128  |  29.07 us | 0.039 us | 0.036 us |     592 B |
| Length1000 | 281.79 us | 0.473 us | 0.369 us |  15585 B |

Summary:

  • Length 127: ~28.7 us → ~28.9 us — no meaningful change (uses original path)
  • Length 128: ~28.9 us → ~29.1 us — no meaningful change (uses original path, threshold is > 128)
  • Length 1000: 888 us → 282 us — 3.15× faster, but allocates 15.6 KB vs 3.3 KB

The speed improvement at length 1000 is significant. The memory increase comes from the PriorityQueue's internal list (pre-allocated to mergingBytes.Length - 1 capacity) and accumulation of stale entries during merge operations. The gap widens further at larger sizes due to the O(n²) vs O(n log n) complexity difference.

Environment: BenchmarkDotNet v0.13.12, Ubuntu 24.04.3 LTS, Intel Xeon Platinum 8370C 2.80GHz, .NET 8.0.16.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build -c Release (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

…odeLarge

Co-authored-by: tarekgh <10833894+tarekgh@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from tarekgh February 13, 2026 17:20
…ecific)

Co-authored-by: tarekgh <10833894+tarekgh@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from tarekgh February 13, 2026 17:40
…is only called for >128 bytes

Co-authored-by: tarekgh <10833894+tarekgh@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

…deLarge

Co-authored-by: tarekgh <10833894+tarekgh@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 13, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • dnceng.pkgs.visualstudio.com
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet build src/Microsoft.ML.Tokenizers/Microsoft.ML.Tokenizers.csproj --no-restore (dns block)
    • Triggering command: /home/REDACTED/work/machinelearning/machinelearning/.dotnet/dotnet dotnet test test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj --no-build --filter FullyQualifiedName~TestLargeInput -v n (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@tarekgh
Copy link
Member

tarekgh commented Feb 14, 2026

/ba-g the failures are unrelated infrstucture timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants