Conversation
|
@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720). I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary. |
8cba13c to
e4b2cf5
Compare
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
I will try to push this forward this week |
|
In theory we should be able to use the API added in |
e4b2cf5 to
95cb436
Compare
|
I tried to reuse existing date_trunc functions where possible and match the structure in date_part::preimage. The calendar duration and sql logic tests were added with the help of an LLM. I reviewed the calendar duration so that should be cohesive with my overall design, but the sql logic tests I have no context in and I would appreciate some extra review (and advice) in that area. Thanks! |
preimage for date_trunc
|
Hey @drin, I'll take a look tomorrow. |
| # Test YEAR granularity - basic comparisons | ||
|
|
||
| query P | ||
| SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-01-01T00:00:00' ORDER BY ts; |
There was a problem hiding this comment.
Does your comment about this still hold?
SELECT PULocationID
,pickup_datetime
FROM taxi_view_2025
WHERE date_trunc('month', pickup_datetime) = '2025-12-03'Without preimage it would always return False, but with preimage, we create an interval [2025-12-01, 2026-01-01) and simplification rule returns col >= 2025-12-01 and col < 2026-01-01 we could get false positives, because 2025-12-03 falls into that interval.
There was a problem hiding this comment.
We'd need to change the behavior to cover this.
- One way would be by having a guard checking the
eq Operatorspecifically fordate_trunc preimageand returnPreimageResult::Noneifrhs != date_trunc(granularity, rhs). ButpreimageisOperatoragnostic. - Another way is by having an optimization rule to do check this
rhs != date_trunc(granularity, rhs)and returnFalsefor the whole column. But that's adding a rule just for one udf. - Another way is to only let
date_trunc preimagework with withrhs = date_trunc(granularity, rhs), but this requires the user to write the date in the right way if they want the query to run faster.
For example:
++WHERE date_trunc('month', pickup_datetime) = '2025-12-01'
--WHERE date_trunc('month', pickup_datetime) = '2025-12-03'There was a problem hiding this comment.
It was actually covered for floor preimage impl by @devanshu0987 in #20059
Check here:
https://github.com/apache/datafusion/pull/20059/changes#diff-077176fcf22cb36a0a51631a43739f5f015f46305be4f49142a450e25b152b84R280-R303
Floor is very similar to date_trunc, so we could replicate the behavior.
There was a problem hiding this comment.
I don't understand why None is returned in the case that a clear value is known. For = '2025-12-03', the value should be False. I assumed that None basically means that the preimage could not be determined because something was invalid (an error). If you use None for valid cases, how do you distinguish invalid cases?
There was a problem hiding this comment.
I guess I have a clarifying question:
What should the interval be in non-obvious cases? What happens if the Interval is None (it seems rewrite_with_preimage is only called on an actual Interval)?
There was a problem hiding this comment.
If there is no valid interval, there is no preimage
There was a problem hiding this comment.
(but, from those notes I also realized I was handling the interval wrong in some cases)
There was a problem hiding this comment.
But these are the notes relevant to this specific case (predicate operator is =):
// Special condition:
// if date_trunc(part, const_rhs) != const_rhs, this is always false
// For this operator, truncation means that we check if the column is INSIDE of a range.
// lhs(=) --> column >= date_trunc(part, const_rhs) AND column < next_interval(part, const_rhs)
date_trunc('month', pickup_datetime) = '2025-12-03' is always false.
date_trunc('month', pickup_datetime) < '2025-12-03' requires an interval to be returned.
Without knowing if the predicate operator is = or <, preimage cannot know whether to return an Interval or None (if None is even the correct return in that case). So preimage must return the interval. But, in rewrite_with_preimage, you can do the appropriate check:
Operator::Eq && lower != <original> => False,
Operator::Eq => and(<check if within interval>),
I'm not sure if this makes sense for intervals from non-truncating functions. I'd have to simmer on that...
There was a problem hiding this comment.
I think it only makes sense for truncating functions. Right now, rewrite_with_preimage is function agnostic and creating the expression only based on the Operator and Interval.
| .as_literal() | ||
| .and_then(|sv| sv.try_as_str().flatten()) | ||
| .map(part_normalization); | ||
|
|
There was a problem hiding this comment.
We should add a guard for Type families. col_expr: TimeStamp needs a lit_expr: TimeStamp and the same for Time types.
There was a problem hiding this comment.
as in if col_expr is a TimeStamp type, then lit_expr must also be a TimeStamp type? Why is that the case?
If I have a nanosecond timestamp (time since epoch) and the comparison is a Time type, if I convert both to nanosecond timestamps aren't they still comparable?
Actually, shouldn't this type of validation be upstream of preimage in whatever function decomposes the predicate?
There was a problem hiding this comment.
They shouldn't be comparable per examples below, but you're right, it should be covered earlier: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/analyzer/type_coercion.rs
Some examples on types:
where date_trunc(month, timestamp_col) = 12:00::TIMEtime doesn't have a month so it wouldn't make sense to compare the two.
where date_trunc(minute, time_col) = 2025-01-01::12:00::TIMESTAMPTimestamp does have minutes, but a time col still won't have calendar granularity to compare with it.
| DateTruncGranularity::Hour => value + MILLIS_PER_HOUR, | ||
| DateTruncGranularity::Minute => value + MILLIS_PER_MINUTE, | ||
| DateTruncGranularity::Second => value + MILLIS_PER_SECOND, | ||
| DateTruncGranularity::Millisecond => value + 1, |
There was a problem hiding this comment.
Keep match arms for granularities finer than rhs?
| DateTruncGranularity::Millisecond => value + 1, | |
| DateTruncGranularity::Millisecond => value + 1, | |
| DateTruncGranularity::Microsecond => value + 1, |
There was a problem hiding this comment.
this is for incrementing milliseconds. If you increment by 1 when the granularity is microseconds then you've incremented by too much. If you have a timestamp in milliseconds and you're truncating microseconds, you should have no change because your timestamp is too coarse.
There was a problem hiding this comment.
Would returning v not create an empty interval [v, v)?
There was a problem hiding this comment.
it does and I guess I should change it, but the interval semantics doesn't actually matter because most of the time we're only using 1 side of it. but i will change it anyways just for completeness.
There was a problem hiding this comment.
I don't think interval semantics should be handled here though (I'll fix it at the call site)
There was a problem hiding this comment.
What I do instead, is I return the same value (no increment) and then I check if the increment call was a no-op (lower == upper), and if so, I print an appropriate error (e.g. "Millisecond too granular for time in Seconds").
0b1bbf1 to
3948017
Compare
ec55abb to
d988b58
Compare
| # Test YEAR granularity with non-aligned literal (2024-06-20 instead of 2024-01-01) | ||
| # date_trunc('year', x) can never equal '2024-06-20' because date_trunc always sets month=01, day=01 | ||
| # So this should return no rows (optimized to EmptyRelation) | ||
|
|
||
| query P | ||
| SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-06-20T14:25:30' ORDER BY ts; | ||
| ---- | ||
|
|
||
| query TT | ||
| EXPLAIN SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-06-20T14:25:30'; | ||
| ---- | ||
| logical_plan EmptyRelation: rows=0 | ||
| physical_plan EmptyExec |
There was a problem hiding this comment.
expansion of rewrite_with_preimage enables these tests where the preimage function doesn't have access to the predicate operator (e.g. Operator::Eq vs Operator::LtEq) but instead returns if udf(literal) == literal where the udf in this case is date_trunc('year', x).
| /// The expression always evaluates to the specified constant | ||
| /// given that `expr` is within the interval |
There was a problem hiding this comment.
The previous documentation here describes the relationship between expr and interval as being direct, but the idea shouldn't be that expr >= lower and expr < upper, It should be:
udf(expr) <op> literal logically implies that expr can be directly compared with interval.
That is, if udf(expr) <op> literal is true, then there is a transformation involving expr and interval that is logically equivalent.
Specifically:
- if
floor(x) < 8, thenpreimageshould returninterval: [8, 9)such that the expression can be rewritten tox < 8. - if
floor(x) < 8.3, thenpreimageshould returninterval: [8, 9)such that the expression can be rewritten tox < 9.
Notice that both expressions yield the same interval because both 8 and 8.3 are literals in that range, and have equivalent outputs (floor(8) == floor(8.3) == { for y in [8, 9): floor(y) }).
Then, rewrite_with_preimage must accommodate the predicate operator (< in this case) to correctly transform the expression using the preimage interval and a boundary condition (is_boundary = floor(y) == y).
There was a problem hiding this comment.
I don't fully understand this logic
For the example floor(x) < 8.3 I would expect there to be no preimage as defined here -- specifically there is no range of inputs for which floor(x) evaluates to 8.3
I agree that it is valid simplificaition to rewrite floor(x) < 8.3 to x < 9.0, but it seems different than "preimage" 🤔
Maybe we just need to give it a different name (maybe that is what you have tried to do with is_boundary)
There was a problem hiding this comment.
BTW I checked with datafusion-cli and the floor(x) < 8.3 case is not optimized today (the preimage is not applied here)
> create OR REPLACE table foo(x float) as values (1.0), (8.0), (9.0);
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from foo where floor(x) < 8.3;
+-----+
| x |
+-----+
| 1.0 |
| 8.0 |
+-----+
2 row(s) fetched.
> explain select * from foo where floor(x) < 8.3;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ FilterExec │ |
| | │ -------------------- │ |
| | │ predicate: │ |
| | │ CAST(floor(x) AS Float64) │ |
| | │ < 8.3 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 112 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.
> explain format indent select * from foo where floor(x) < 8.3;
+---------------+------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------+
| logical_plan | Filter: CAST(floor(foo.x) AS Float64) < Float64(8.3) |
| | TableScan: foo projection=[x] |
| physical_plan | FilterExec: CAST(floor(x@0) AS Float64) < 8.3 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.|
@sdf-jkl I added a few comments to help you review the changes to I think the previous implementation could only support: but now it can support: |
|
Thanks @drin, I'll take a look tomorrow. |
|
Thanks @drin, I'll take a look tomorrow. |
This implementation leverages `general_date_trunc` to truncate the preimage input value and then uses the input granularity to increment the truncated datetime by 1 unit.
This adds sql logic test for date_trunc preimage for test coverage
This is to fix `DateTime<Tz>` being considered to include HTML rather than a templated type. Also improved the phrasing of the comment
This expands the predicate operator handling for rewrite_with_preimage to accommodate boundary cases
These updates are to accommodate the new `is_boundary` condition in `PreimageResult`
This accommodates some review feedback and also fixes some edge cases involving boundary conditions for preimage
This accommodates feedback about: - Abstracting a closure from within `preimage` - Reusing `valid_with_time` - Possibly producing empty intervals (lower == upper) This also adds a unit test to accommodate some cases that are actually impossible except from direct invocation.
938a21f to
860028c
Compare
|
I guess the first question is: do we really want to support edge cases like Why not fall back to not using preimage, like the One argument for making the change I can think of is:
On the other hand:
|
|
Well, The extra complexity is:
and the optimization potentially:
Considering the complexity of a time type vs timestamp type amplified by the date trunc granularity, I think the extra complexity cost is basically 0 (for Related:
|
|
I'm convinced. We could split the PR into two smaller ones to make review easier:
Let me know if you think that's a good idea. |
|
that's maybe a question for a committer? I don't mind it, but if we split it there will be a dependency. Fortunately, expanding the preimage framework is teeny tiny, but it does mean double the review plus an additional merge. I have no strong opinions either way, except that this one feels just about done already |
|
From my experience waiting for PR reviews, it’s often easier to review several smaller PRs than one large one. I ended up splitting my original You could keep this one as is and add a new small one with the |
|
I hope to find time today to give this a more careful review |
alamb
left a comment
There was a problem hiding this comment.
After reading this PR and the conversations on it, I think it might help to break this down into smaller parts.
- Cases where we can clearly apply the (existing) preimage rewrite -- e.g.
date_trunc(date_col, 'month') = '2025-01-01'where there is a range ofdate_colthat always evaluates to2025-01-01 - Other potential simplifications that don't clearly fall into the "preimage" category (e.g.
date_trunc(date_col, 'month') = '2025-01-02'which will never be true)
I actually think the first one will be fairly straightforward and a clear win, though it will miss some potential predicates like date_trunc(date_col, 'month') < '2025-01-02'
As you have explored in this PR, the existing API for preimage is not sufficient for the second type of rewrite, and I think adding a new API would be eaiser to reason about in a follow on PR
| /// The expression always evaluates to the specified constant | ||
| /// given that `expr` is within the interval |
There was a problem hiding this comment.
I don't fully understand this logic
For the example floor(x) < 8.3 I would expect there to be no preimage as defined here -- specifically there is no range of inputs for which floor(x) evaluates to 8.3
I agree that it is valid simplificaition to rewrite floor(x) < 8.3 to x < 9.0, but it seems different than "preimage" 🤔
Maybe we just need to give it a different name (maybe that is what you have tried to do with is_boundary)
| /// 2. `=` and `!=` operators: | ||
| /// if `Some(false)`, expression rewrite can use constant (false and true, respectively) | ||
| /// | ||
| /// if is_boundary is `None`, then the boundary condition never applies. |
There was a problem hiding this comment.
While trying to understand this, I wonder if it might be easier to express by instead of adding a field to Range we could instead add a new variant. Something like this:
enum PreimageResult {
/// ...
Range { expr: Expr, interval: Box<Interval> },
// The original expression UDF(lit) = lit
// 1. `<` and `>=` operators:
/// if `Some(false)`, expression rewrite should use `interval.upper`
/// 2. `=` and `!=` operators:
/// if `Some(false)`, expression rewrite can use constant (false and true, respectively)
Boundary { expr: Expr, interval: Box<Interval> },
}| /// The expression always evaluates to the specified constant | ||
| /// given that `expr` is within the interval |
There was a problem hiding this comment.
BTW I checked with datafusion-cli and the floor(x) < 8.3 case is not optimized today (the preimage is not applied here)
> create OR REPLACE table foo(x float) as values (1.0), (8.0), (9.0);
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from foo where floor(x) < 8.3;
+-----+
| x |
+-----+
| 1.0 |
| 8.0 |
+-----+
2 row(s) fetched.
> explain select * from foo where floor(x) < 8.3;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ FilterExec │ |
| | │ -------------------- │ |
| | │ predicate: │ |
| | │ CAST(floor(x) AS Float64) │ |
| | │ < 8.3 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 112 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.
> explain format indent select * from foo where floor(x) < 8.3;
+---------------+------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------+
| logical_plan | Filter: CAST(floor(foo.x) AS Float64) < Float64(8.3) |
| | TableScan: foo projection=[x] |
| physical_plan | FilterExec: CAST(floor(x@0) AS Float64) < 8.3 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.|
okay, I can do that. just so we're clear, the first case can only capture when the right-hand constant is aligned. I agree it's a clear win, but continuing with the date_trunc('month', ...) example, it's applicable to 1 value for every 30 values, and that's not including time and timezone offsets. As far as I can tell, date_trunc truncates time stamps to UTC so any other timezone will only get the optimization if it's pre-normalized to UTC midnight. |
I am not sure why it is not applicable to timezone offsets In my mind writing date_trunc(col, 'month') = 2025-12-01is the natural way to write queries that look to bucket date by month (aka to compare date trunc to a date/time with the granulatity of the truncation). the expression date_trunc(col, 'month') = 2025-12-03 -- <-- A value that is not a natural boundaryDoesn't make sense to me at all (as that will be FALSE for all values) The expression date_trunc(col, 'month') < 2025-12-03 Makes slightly more sense as it will actually pass some rows, but I still would be surprised if this was a common epression as the constant doesn't align to the truncation granularity |
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Originally, this attempted to implement a custom optimizer rule in the datafusion expression simplifier. Now, this has been updated to work within the new preimage framework rather than being implemented directly in the expression simplifier.
Which issue does this PR close?
Closes #18319.
Rationale for this change
To transform binary expressions that compare
date_truncwith a constant value into a form that can be better utilized (improved performance).For Bauplan, we can see the following (approximate average over a handful of runs):
Q1:
Q2:
What changes are included in this PR?
A few additional support functions and additional match arms in the simplifier match expression.
Are these changes tested?
Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.
Are there any user-facing changes?
Better performance and occasionally confusing explain plan. In short, a
date_trunc('month', col) = '2025-12-03'::DATEwill always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).Explain plan details below (may be overkill but it was fun to figure out):
Initial query:
After simplify_expressions:
Before and after
date_trunc_optimizer(our custom rule):