Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070
Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs
Outdated
Show resolved
Hide resolved
|
@ericstj, I'm not sure this will actually end up saving much end-to-end. It'd help if some of the data wasn't actually looked at, but if it ends up being looked at, it seems it'll actually make the end-to-end worse. |
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
I see this PR is still going though a string when decoding - I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string. |
We have operation status APIs that write into an existing destinatiom buffer, but no one-shot that creates the array. We do for Base64Url but we didn't square the circle for Base64. |
|
Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size? |
That's the operation status API I was referring to.
I thought you'd stated you wanted the decoded data property to be a byte[]. If it is, slicing would entail allocating a whole new array unless the max happened to equal actual, which would happen only ~33% of the time. If the destination property is a |
This is exactly what I was thinking of when mentioning:
And the current implementation is already exposing as |
Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Implemented using Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate in commit 39213fb. The implementation:
|
|
What are developers that need to set the data going to do? We will do the complicated thing internally in our own use, but most devs/call sites won't. It'll end up being worse for them still, right? How are we going to mitigate that? |
|
Thinking about principles here:
So in reviewing this I wouldn't want to see a ton of decoding sprinkled throughout the codebase. I would hope that we can have all the decoding surface area on the types themselves.
Developers can initialize these types in the way that is most convenient to the form of data they have. We can indicate the least expensive initialization (from Base64 UTF8 bytes) but they need not initialize this way if they only have the data in a different form. They need not do the conversion - we will do it in the most efficient way. It's similar to MEAI Developers should avoid accessing members that require the to convert the output, and instead access the member with the data in the format they require. We'll convert to that in the most efficient way and cache the result. Does this sound right @stephentoub |
|
That sounds reasonable, but it also requires more changes / APIs to get there. |
On my tasks branch I made the decision to represent millisecond |
Elsewhere we use DateTimeOffset in the protocol types, e.g. Seems reasonable to use the .NET representation of the concept for dates, times, and time spans. I think the concern Eric is raising is more about cases where there's a non-trivial overhead to the cost of the transformation. |
|
Yes, my concerns are cases where the cost of transformation is significant, or the data at rest is neither the user format nor the wire format (as was the case with UTF16 base64 string). |
|
@ericstj, is this something you're still interested in pushing forward? |
|
Yes, thanks for the reminder. Let me refresh my memory on it and have a look this week. |
|
Fixing up the merge conflict. |
…om data. These factories go straight from data -> base64 utf-8 encoded data without excess garbage.
…-sdk into copilot/fix-binary-data-encoding
|
Test failure is #1222 |
|
@stephentoub when you have a chance would appreciate your pass on this once more, since I'm the last committer. |
| new BlobResourceContents | ||
| { | ||
| Blob = resource.Description!, | ||
| Blob = System.Text.Encoding.UTF8.GetBytes(resource.Description!), |
There was a problem hiding this comment.
@copilot, we prefer to use simple rather than fully-qualified names. Can these "System.Text"s be moved instead to usings at the beginning?
| // gets UTF-8 bytes from UTF-16 chars without intermediate string allocations | ||
| fixed (char* pChars = utf16) | ||
| { | ||
| var byteCount = System.Text.Encoding.UTF8.GetByteCount(pChars, utf16.Length); | ||
| var bytes = new byte[byteCount]; | ||
|
|
||
| fixed (byte* pBytes = bytes) | ||
| { | ||
| System.Text.Encoding.UTF8.GetBytes(pChars, utf16.Length, pBytes, byteCount); | ||
| } | ||
| return bytes; | ||
| } |
There was a problem hiding this comment.
@copilot, this shouldn't require unsafe code on .NET Core... there are overloads that work with spans, so this can be e.g.
byte[] bytes = new byte[Encoding.UTF8.GetByteCount(utf16)]
Encoding.UTF8.GetBytes(utf16, bytes);
return bytes;For the netstandard2.0 build, you can add polyfills for these methods in https://github.com/modelcontextprotocol/csharp-sdk/tree/main/src/Common/Polyfills/System
There was a problem hiding this comment.
I had used the unsafe code to avoid the ifdef, I agree a polyfill approach should work. Thanks for suggestion.
| int maxLength = Base64.GetMaxEncodedToUtf8Length(data.Length); | ||
| byte[] buffer = new byte[maxLength]; | ||
| if (Base64.EncodeToUtf8(data.Span, buffer, out _, out int bytesWritten) == OperationStatus.Done) | ||
| { |
There was a problem hiding this comment.
We expect bytesWritten to always equal buffer.Length, right? Should we assert that?
| throw new FormatException("Invalid base64 data"); | ||
| } | ||
| } | ||
| return _decodedData.Value; |
There was a problem hiding this comment.
| return _decodedData.Value; | |
| return _decodedData.Value; |
| /// </para> | ||
| /// </remarks> | ||
| [JsonIgnore] | ||
| public ReadOnlyMemory<byte> Data |
There was a problem hiding this comment.
@ericstj why do you call this Data here but DecodedData elsewhere like on ContentBlock?
There was a problem hiding this comment.
We can change it.
The change was made here 4287694#diff-9da29ad32a17c90e9b28c79ad3ab9c4b370a757e26657c736020ea47314a60ba
I had assumed we would have liked to use Data everywhere, since that's the better name and more-likely the "useful" property for the caller. The ContentBlock types had already used Data to mean the encoded data, so we didn't change that.
I agree that consistency is better here - lets use DecodedData here. What are your thoughts on Blob vs Data for the encoded data? Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?
There was a problem hiding this comment.
Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?
I think there's value in keeping the names of these C# properties aligned with their spec / JSON schema counterparts. I don't know why they differ in the MCP spec, but alas.
Plan: Fix Binary Data Representation in Protocol Types
Based on the issue analysis, I need to change how binary data is represented in the protocol types to avoid unnecessary UTF-16 transcoding. The current implementation uses
stringproperties for base64-encoded data, which causes UTF-8 → UTF-16 → base64 decode conversions.Changes Required:
Update
BlobResourceContents.BlobpropertystringtoReadOnlyMemory<byte>for zero-copy UTF-8 representationDataproperty for decoded binary data with cachingResourceContents.ConverterUpdate
ImageContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty for decoded binary data with cachingContentBlock.ConverterUpdate
AudioContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty for decoded binary data with cachingContentBlock.ConverterFixed AIContentExtensions.cs to use DecodedData properties
Fixed AIFunctionMcpServerResource.cs to convert strings to UTF-8 bytes
Fixed test servers (TestServer, TestSseServer, ConformanceServer) to use UTF-8 encoding
Update tests
ResourceContentsTests.csto use byte arrays and UTF-8 encodingContentBlockTests.csto use byte arraysBuild and test all changes
Summary:
Successfully converted binary data properties from
stringtoReadOnlyMemory<byte>to avoid UTF-16 transcoding overhead. The changes:ReadOnlyMemory<byte>to store base64-encoded UTF-8 bytes directly from JSONData/DecodedDataproperties that decode and cache the binary dataSystem.Buffers.Text.Base64.DecodeFromUtf8- completely eliminates string intermediate during decodingPerformance Optimizations:
#if NET6_0_OR_GREATERto#if NETfor simplicitySystem.Buffers.Text.Base64.DecodeFromUtf8on .NET to decode directly from UTF-8 base64 bytes to binary without creating an intermediate UTF-16 stringEnd-to-End Performance Path:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.