-
Notifications
You must be signed in to change notification settings - Fork 2k
fix: regression of dict_id in physical plan proto
#20063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
af1a108 to
e0a0061
Compare
brancz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you for fixing this!
Thanks for the review! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kumarUjjawal
FYI @dispanser
|
thanks @kumarUjjawal , appreciated! |
| root_as_message(encoded_schema.ipc_message.as_slice()).map_err( | ||
| |e| { | ||
| Error::General(format!( | ||
| "Error IPC schema message while deserializing ScalarValue::List: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the error messages say ScalarValue::List ? (List)
Isn't this used for any nested type ? List, Map, Struct, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I will update.
…f-45 Undoing [apache#14227](apache#14227) which introduces a bug in protobuf deserialization by passing a wrong schema to record batch construction of the IPC message containing the dictionary. The associated test and a fix will probably be merged into datafusion as apache#20063 but the test fails in main for completely different reason so the patch cannot be backported and doesn't make sense in the context of df46.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kumarUjjawal and @brancz
I feel intuitively something could be made much simpler here, but given this fixes the bug and i don't have any specific ideas of how to improve the PR, let's go with this one
| )); | ||
| }; | ||
|
|
||
| // IPC dictionary batch IDs are assigned when encoding the schema, but our protobuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like somehow I missing something key -- this seems like a pretty massive overhead to create / decode a single value here.
That being said, it seems like the existing code is also doing the overhead, so maybe it is fine for now 🤔
I wonder if we could pull this logic for creating the schema into its own function to try and reduce the size of the overall method / make it easier to understand
| Ok(()) | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the code change reverted, this test fails like
---- cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict stdout ----
Error: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc error: no dict id for field item")
|
Thank you @alamb. I will look into your suggestions. |
Thanks -- let's do it as a follow on PR |

Which issue does this PR close?
Struct(Dict)data type #20011.Rationale for this change
dict_idis intentionally not preserved protobuf (it’s deprecated in Arrow schema metadata), but Arrow IPC still requires dict IDs for dictionary encoding/decoding.What changes are included in this PR?
Are these changes tested?
Yes added a test for this
Are there any user-facing changes?
No