GH-49114: [C++][Parquet] Fix converting schema failure with deep nested two-level encoding list structure#49125
Conversation
|
@pitrou @emkornfield @wgtmac Please take a look. |
b494f79 to
7176212
Compare
There was a problem hiding this comment.
Pull request overview
Fixes Parquet-to-Arrow schema conversion for deep nested two-level encoding LIST structures (and related MAP-in-LIST cases) that previously errored as illegal.
Changes:
- Refactors list element resolution into a new
ResolveList()helper to better handle nested two-level LIST encodings. - Relaxes repetition constraints to allow LIST/MAP nodes to be
REPEATEDwhen they are elements of an enclosing LIST. - Adds/updates unit tests to cover deep nested two-level LISTs and two-level LIST-of-MAP schemas, and adjusts expected error text.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cpp/src/parquet/arrow/schema.cc |
Adjusts LIST/MAP schema conversion logic and introduces ResolveList() for nested/two-level handling. |
cpp/src/parquet/arrow/arrow_schema_test.cc |
Adds coverage for deep nested two-level LIST and allows previously-failing LIST-of-MAP case; updates an error expectation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d9eb0dc to
4f7558f
Compare
4f7558f to
145580f
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Mostly look good to me. Thanks for fixing this!
cpp/src/parquet/arrow/schema.cc
Outdated
| } else if (list_group.field_count() == 1) { | ||
| const auto& repeated_field = list_group.field(0); | ||
| if (repeated_field->is_repeated()) { | ||
| RETURN_NOT_OK(check_two_level_list_repeated(group)); |
There was a problem hiding this comment.
Do we need to validate that list_group has no logical type annotation?
There was a problem hiding this comment.
I think this is not in backward-compatibility rules.
| // optional binary value (STRING); | ||
| // } | ||
| // } | ||
| // } |
There was a problem hiding this comment.
Do we want to add a case for the below which is still produced by parquet-java by default:
// optional group my_list (LIST) {
// repeated group array (MAP) {
// repeated group key_value (MAP_KEY_VALUE) {
// required binary key (STRING);
// optional binary value (STRING);
// }
// }
// }
There was a problem hiding this comment.
I think it's not in backward-compatibility rules too. We may add it into document first.
efddcaa to
404a85e
Compare
|
Do you want to take a look? @pitrou @emkornfield @mapleFU |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 0cf32b2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Fix the failure when converting parquet schema with deep nested two-level encoding list structure to arrow schema.
What changes are included in this PR?
Modified the
ListToSchemaFieldandMapToSchemaFieldmethods.Are these changes tested?
Yes.
Are there any user-facing changes?
It allows some legal schemas that were once considered illegal.