Validate coerce int96 config 17498#20253
Conversation
| let time_units_and_expected = vec![ | ||
| ( | ||
| None, // Same as "ns" time_unit | ||
| None, // default: None = "ns" |
There was a problem hiding this comment.
If we treat None as defaulting to ns then perhaps we should remove the Option and just make it directly a DFTimeUnit?
There was a problem hiding this comment.
yes I will do that
There was a problem hiding this comment.
should I change CoerceInt96Opt too ?
There was a problem hiding this comment.
Hmm perhaps I was mislead by these test comments; I'm not certain that None does default to ns, especially considering theres some test failures now 🤔
Might need more investigation before proceeding with this change (of removing the Option)
There was a problem hiding this comment.
The reason I initially proceeded with removing Option was this part of the docstring in config.rs:
/// (reading) If set, parquet reader will read columns of
/// physical type int96 as originating from a different resolution
/// than nanosecond.
I interpreted this as implying that the default behavior is effectively nanoseconds anyway, so making the field non-optional with a default of DFTimeUnit::Nanosecond seemed safe.
After reviewing the failing tests more closely, it’s clear that None does not behave the same as explicitly setting ns, and that None really means “don’t apply coercion at all”.
I’ll revert this change and keep Option<DFTimeUnit> to preserve the existing semantics.
Does that sound like the right approach to you?
There was a problem hiding this comment.
Yes that sounds good. And perhaps could improve the docstring for the config too, since it seems we were both mislead by what being unset means 😅
|
@mbutrovich FYI |
- Introduce DFTimeUnit enum for INT96 timestamp coercion - Implement FromStr, Display, and ConfigField for validation and SET support - Add conversions to/from arrow::TimeUnit for engine integration - Replace old parse_coerce_int96_string usage with DFTimeUnit - Ensures invalid config values are rejected early
Renamin DFTimeUnit enum variants, and fix docstrings
c00bc6b to
deb1ee1
Compare
Which issue does this PR close?
Rationale for this change
Currently, invalid
coerce_int96values (e.g., "invalid") are accepted atSETtime and only fail later when reading Parquet files. This PR adds early validation so invalid values are rejected immediately with clear error messages, following the same pattern asDFWriterVersion.What changes are included in this PR?
DFTimeUnitenum withFromStr,Display,ConfigFieldimplementationsParquetOptions.coerce_int96fromString/Option<String>toOption<DFTimeUnit>parse_coerce_int96_stringas a separate runtime parser; validation now happens at config timeParquetFormat::with_coerce_int96and related usage to acceptOption<DFTimeUnit>Are these changes tested?
Yes. Added
test_parquet_coerce_int96_validationthat verifies:Are there any user-facing changes?
Not exactly. Invalid
coerce_int96values now error immediately when set viaSETcommand or proto deserialization, instead of failing later during Parquet file reading. This provides better error messages and earlier feedback.