Skip to content

implement preimage for date_trunc#18648

Draft
drin wants to merge 8 commits intoapache:mainfrom
drin:octalene.feat-optimize-datetrunc
Draft

implement preimage for date_trunc#18648
drin wants to merge 8 commits intoapache:mainfrom
drin:octalene.feat-optimize-datetrunc

Conversation

@drin
Copy link
Contributor

@drin drin commented Nov 12, 2025

Originally, this attempted to implement a custom optimizer rule in the datafusion expression simplifier. Now, this has been updated to work within the new preimage framework rather than being implemented directly in the expression simplifier.

Which issue does this PR close?

Closes #18319.

Rationale for this change

To transform binary expressions that compare date_trunc with a constant value into a form that can be better utilized (improved performance).

For Bauplan, we can see the following (approximate average over a handful of runs):

Q1:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE date_trunc('month', pickup_datetime) <= '2025-01-08'::DATE

Q2:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE pickup_datetime < date_trunc('month', '2025-02-08'::DATE)
Query Time (s) Options
Q1 ~3 no cache, optimization enabled
Q1 ~35 no cache, optimization disabled
Q2 ~3 no cache, optimization enabled
Q2 ~3 no cache, optimization disabled

What changes are included in this PR?

A few additional support functions and additional match arms in the simplifier match expression.

Are these changes tested?

Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.

Are there any user-facing changes?

Better performance and occasionally confusing explain plan. In short, a date_trunc('month', col) = '2025-12-03'::DATE will always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).

Explain plan details below (may be overkill but it was fun to figure out):

Initial query:

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

After simplify_expressions:

logical_plan after simplify_expressions                    | Projection: taxi_view_2025.PULocationID, taxi_view_2025.pickup_datetime                                                                                            |
|                                                            |   Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)          |
|                                                            |     TableScan: taxi_view_2025

Before and after date_trunc_optimizer (our custom rule):

logical_plan after optimize_projections                    | Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)            |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]                                                                                             |
| logical_plan after date_trunc_optimizer                    | Filter: Boolean(false)                                                                                                                                             |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]

@github-actions github-actions bot added the optimizer Optimizer rules label Nov 12, 2025
@drin drin marked this pull request as draft November 12, 2025 15:18
@drin
Copy link
Contributor Author

drin commented Nov 12, 2025

@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720).

I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary.

@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from 8cba13c to e4b2cf5 Compare November 12, 2025 15:31
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 12, 2026
@drin
Copy link
Contributor Author

drin commented Jan 13, 2026

I will try to push this forward this week

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jan 13, 2026
@alamb
Copy link
Contributor

alamb commented Jan 27, 2026

In theory we should be able to use the API added in

@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from e4b2cf5 to 95cb436 Compare February 23, 2026 20:00
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation and removed optimizer Optimizer rules labels Feb 23, 2026
@drin drin marked this pull request as ready for review February 23, 2026 20:01
@drin
Copy link
Contributor Author

drin commented Feb 23, 2026

I tried to reuse existing date_trunc functions where possible and match the structure in date_part::preimage. The calendar duration and sql logic tests were added with the help of an LLM. I reviewed the calendar duration so that should be cohesive with my overall design, but the sql logic tests I have no context in and I would appreciate some extra review (and advice) in that area.

Thanks!

@drin drin changed the title decomposed date_trunc optimization into expr simplifier implement preimage for date_trunc Feb 23, 2026
@sdf-jkl
Copy link
Contributor

sdf-jkl commented Feb 23, 2026

Hey @drin, I'll take a look tomorrow.

Copy link
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drin thanks for the wait.

I left some suggestions.

# Test YEAR granularity - basic comparisons

query P
SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-01-01T00:00:00' ORDER BY ts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your comment about this still hold?

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

Without preimage it would always return False, but with preimage, we create an interval [2025-12-01, 2026-01-01) and simplification rule returns col >= 2025-12-01 and col < 2026-01-01 we could get false positives, because 2025-12-03 falls into that interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to change the behavior to cover this.

  • One way would be by having a guard checking the eq Operator specifically for date_trunc preimage and return PreimageResult::None if rhs != date_trunc(granularity, rhs). But preimage is Operator agnostic.
  • Another way is by having an optimization rule to do check this rhs != date_trunc(granularity, rhs) and return False for the whole column. But that's adding a rule just for one udf.
  • Another way is to only let date_trunc preimage work with with rhs = date_trunc(granularity, rhs), but this requires the user to write the date in the right way if they want the query to run faster.
    For example:
++WHERE date_trunc('month', pickup_datetime) = '2025-12-01'
--WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually covered for floor preimage impl by @devanshu0987 in #20059
Check here:
https://github.com/apache/datafusion/pull/20059/changes#diff-077176fcf22cb36a0a51631a43739f5f015f46305be4f49142a450e25b152b84R280-R303
Floor is very similar to date_trunc, so we could replicate the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why None is returned in the case that a clear value is known. For = '2025-12-03', the value should be False. I assumed that None basically means that the preimage could not be determined because something was invalid (an error). If you use None for valid cases, how do you distinguish invalid cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I have a clarifying question:

What should the interval be in non-obvious cases? What happens if the Interval is None (it seems rewrite_with_preimage is only called on an actual Interval)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no valid interval, there is no preimage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but, from those notes I also realized I was handling the interval wrong in some cases)

Copy link
Contributor Author

@drin drin Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are the notes relevant to this specific case (predicate operator is =):

// Special condition:
// if date_trunc(part, const_rhs) != const_rhs, this is always false
// For this operator, truncation means that we check if the column is INSIDE of a range.
// lhs(=) --> column >= date_trunc(part, const_rhs) AND column < next_interval(part, const_rhs)

date_trunc('month', pickup_datetime) = '2025-12-03' is always false.
date_trunc('month', pickup_datetime) < '2025-12-03' requires an interval to be returned.

Without knowing if the predicate operator is = or <, preimage cannot know whether to return an Interval or None (if None is even the correct return in that case). So preimage must return the interval. But, in rewrite_with_preimage, you can do the appropriate check:

Operator::Eq && lower != <original> => False,
Operator::Eq => and(<check if within interval>),

I'm not sure if this makes sense for intervals from non-truncating functions. I'd have to simmer on that...

Copy link
Contributor

@sdf-jkl sdf-jkl Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it only makes sense for truncating functions. Right now, rewrite_with_preimage is function agnostic and creating the expression only based on the Operator and Interval.

Copy link
Contributor Author

@drin drin Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and I expanded the rewrite_with_preimage method to accommodate a boundary condition (ca79a3c) and then updated the date_part and floor implementations to accommodate (453ad0b)

.as_literal()
.and_then(|sv| sv.try_as_str().flatten())
.map(part_normalization);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a guard for Type families. col_expr: TimeStamp needs a lit_expr: TimeStamp and the same for Time types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in if col_expr is a TimeStamp type, then lit_expr must also be a TimeStamp type? Why is that the case?

If I have a nanosecond timestamp (time since epoch) and the comparison is a Time type, if I convert both to nanosecond timestamps aren't they still comparable?

Actually, shouldn't this type of validation be upstream of preimage in whatever function decomposes the predicate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't be comparable per examples below, but you're right, it should be covered earlier: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/analyzer/type_coercion.rs

Some examples on types:

where date_trunc(month, timestamp_col) = 12:00::TIME

time doesn't have a month so it wouldn't make sense to compare the two.

where date_trunc(minute, time_col) = 2025-01-01::12:00::TIMESTAMP

Timestamp does have minutes, but a time col still won't have calendar granularity to compare with it.

DateTruncGranularity::Hour => value + MILLIS_PER_HOUR,
DateTruncGranularity::Minute => value + MILLIS_PER_MINUTE,
DateTruncGranularity::Second => value + MILLIS_PER_SECOND,
DateTruncGranularity::Millisecond => value + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep match arms for granularities finer than rhs?

Suggested change
DateTruncGranularity::Millisecond => value + 1,
DateTruncGranularity::Millisecond => value + 1,
DateTruncGranularity::Microsecond => value + 1,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for incrementing milliseconds. If you increment by 1 when the granularity is microseconds then you've incremented by too much. If you have a timestamp in milliseconds and you're truncating microseconds, you should have no change because your timestamp is too coarse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would returning v not create an empty interval [v, v)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does and I guess I should change it, but the interval semantics doesn't actually matter because most of the time we're only using 1 side of it. but i will change it anyways just for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think interval semantics should be handled here though (I'll fix it at the call site)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I do instead, is I return the same value (no increment) and then I check if the increment call was a no-op (lower == upper), and if so, I print an appropriate error (e.g. "Millisecond too granular for time in Seconds").

@alamb
Copy link
Contributor

alamb commented Feb 26, 2026

I am back working through my review backlog and put this one on my list to check out -- thank you @sdf-jkl and @drin

@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from 0b1bbf1 to 3948017 Compare February 27, 2026 17:32
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Feb 27, 2026
@drin drin force-pushed the octalene.feat-optimize-datetrunc branch 2 times, most recently from ec55abb to d988b58 Compare February 27, 2026 17:55
Comment on lines +5470 to +5482
# Test YEAR granularity with non-aligned literal (2024-06-20 instead of 2024-01-01)
# date_trunc('year', x) can never equal '2024-06-20' because date_trunc always sets month=01, day=01
# So this should return no rows (optimized to EmptyRelation)

query P
SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-06-20T14:25:30' ORDER BY ts;
----

query TT
EXPLAIN SELECT ts FROM t1 WHERE date_trunc('year', ts) = timestamp '2024-06-20T14:25:30';
----
logical_plan EmptyRelation: rows=0
physical_plan EmptyExec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expansion of rewrite_with_preimage enables these tests where the preimage function doesn't have access to the predicate operator (e.g. Operator::Eq vs Operator::LtEq) but instead returns if udf(literal) == literal where the udf in this case is date_trunc('year', x).

Comment on lines -26 to -27
/// The expression always evaluates to the specified constant
/// given that `expr` is within the interval
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous documentation here describes the relationship between expr and interval as being direct, but the idea shouldn't be that expr >= lower and expr < upper, It should be:
udf(expr) <op> literal logically implies that expr can be directly compared with interval.

That is, if udf(expr) <op> literal is true, then there is a transformation involving expr and interval that is logically equivalent.

Specifically:

  • if floor(x) < 8, then preimage should return interval: [8, 9) such that the expression can be rewritten to x < 8.
  • if floor(x) < 8.3, then preimage should return interval: [8, 9) such that the expression can be rewritten to x < 9.

Notice that both expressions yield the same interval because both 8 and 8.3 are literals in that range, and have equivalent outputs (floor(8) == floor(8.3) == { for y in [8, 9): floor(y) }).

Then, rewrite_with_preimage must accommodate the predicate operator (< in this case) to correctly transform the expression using the preimage interval and a boundary condition (is_boundary = floor(y) == y).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this logic

For the example floor(x) < 8.3 I would expect there to be no preimage as defined here -- specifically there is no range of inputs for which floor(x) evaluates to 8.3

I agree that it is valid simplificaition to rewrite floor(x) < 8.3 to x < 9.0, but it seems different than "preimage" 🤔

Maybe we just need to give it a different name (maybe that is what you have tried to do with is_boundary)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I checked with datafusion-cli and the floor(x) < 8.3 case is not optimized today (the preimage is not applied here)

> create OR REPLACE table foo(x float) as values (1.0), (8.0), (9.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from foo where floor(x) < 8.3;
+-----+
| x   |
+-----+
| 1.0 |
| 8.0 |
+-----+
2 row(s) fetched.

> explain select * from foo where floor(x) < 8.3;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ CAST(floor(x) AS Float64) │ |
|               | │           < 8.3           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

> explain format indent select * from foo where floor(x) < 8.3;
+---------------+------------------------------------------------------+
| plan_type     | plan                                                 |
+---------------+------------------------------------------------------+
| logical_plan  | Filter: CAST(floor(foo.x) AS Float64) < Float64(8.3) |
|               |   TableScan: foo projection=[x]                      |
| physical_plan | FilterExec: CAST(floor(x@0) AS Float64) < 8.3        |
|               |   DataSourceExec: partitions=1, partition_sizes=[1]  |
|               |                                                      |
+---------------+------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

@drin
Copy link
Contributor Author

drin commented Feb 27, 2026

@sdf-jkl I added a few comments to help you review the changes to rewrite_with_preimage and to see a test case that can be accommodated with the change.

I think the previous implementation could only support:
date_trunc('month', '2026-02-27') < '2026-02-01'

but now it can support:
date_trunc('month', '2026-02-27') < '2026-02-23'

@sdf-jkl
Copy link
Contributor

sdf-jkl commented Mar 2, 2026

Thanks @drin, I'll take a look tomorrow.

@sdf-jkl
Copy link
Contributor

sdf-jkl commented Mar 2, 2026

Thanks @drin, I'll take a look tomorrow.

drin added 7 commits March 2, 2026 16:18
This implementation leverages `general_date_trunc` to truncate the
preimage input value and then uses the input granularity to increment
the truncated datetime by 1 unit.
This adds sql logic test for date_trunc preimage for test coverage
This is to fix `DateTime<Tz>` being considered to include HTML rather
than a templated type. Also improved the phrasing of the comment
This expands the predicate operator handling for rewrite_with_preimage
to accommodate boundary cases
These updates are to accommodate the new `is_boundary` condition in
`PreimageResult`
This accommodates some review feedback and also fixes some edge cases
involving boundary conditions for preimage
This accommodates feedback about:
- Abstracting a closure from within `preimage`
- Reusing `valid_with_time`
- Possibly producing empty intervals (lower == upper)

This also adds a unit test to accommodate some cases that are actually
impossible except from direct invocation.
@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from 938a21f to 860028c Compare March 3, 2026 00:18
@sdf-jkl
Copy link
Contributor

sdf-jkl commented Mar 3, 2026

I guess the first question is: do we really want to support edge cases like date_trunc('month', '2026-02-27') < '2026-02-23'?

Why not fall back to not using preimage, like the floor function? What arguments in favor of changing the API?

One argument for making the change I can think of is:

  • Right now, it's up to the user to know not to use dates like 2026-02-XX (where XX > 01). Leaving that responsibility to the user is not good UX. I'd be confused if some queries where the RHS day is 01 were faster than otherwise identical queries with a different day value.

On the other hand:

  • API change and extra complexity (not that big of a change and doesn't add much complexity to the interval calculation)

@drin
Copy link
Contributor Author

drin commented Mar 3, 2026

Well, date_trunc('month', date_col) < '2026-02-23' can be optimized by this framework, and if you fallback on not using preimage then you can't optimize it at all. The literal, '2026-02-23', could come from a variable; then, it's not an edge case at all, it's just a typical engineering scenario: df = ctx.sql(f"SELECT date_trunc('month', date_col) < {date_literal}"). It seems like an oversight to me to require a user to normalize date_literal in this case as a prerequisite to getting faster query performance. It's also likely to lead to tribal knowledge: "if you are using x function, then you can get better optimization if you do y to align it to a boundary."

The extra complexity is:

  1. An extra comparison (is_boundary = value == lower)
  2. An extra bool attribute in the PreimageResult::Range enum
  3. A few extra match arms (Operator::Eq && is_boundary)

and the optimization potentially:

  1. removes the predicate altogether
  2. removes the operator (and potentially the plan sub-tree)

Considering the complexity of a time type vs timestamp type amplified by the date trunc granularity, I think the extra complexity cost is basically 0 (for date_part). So, to me it's a clear that it's worth the cost.

Related:

@sdf-jkl
Copy link
Contributor

sdf-jkl commented Mar 3, 2026

I'm convinced. We could split the PR into two smaller ones to make review easier:

  1. Apply changes to the preimage api and the functions that already implement it.
  2. Implement preimage for date_trunc

Let me know if you think that's a good idea.

@drin
Copy link
Contributor Author

drin commented Mar 4, 2026

that's maybe a question for a committer? I don't mind it, but if we split it there will be a dependency. Fortunately, expanding the preimage framework is teeny tiny, but it does mean double the review plus an additional merge.

I have no strong opinions either way, except that this one feels just about done already

@sdf-jkl
Copy link
Contributor

sdf-jkl commented Mar 4, 2026

From my experience waiting for PR reviews, it’s often easier to review several smaller PRs than one large one.

I ended up splitting my original date_part preimage PR into two.

You could keep this one as is and add a new small one with the preimage API update.

@alamb
Copy link
Contributor

alamb commented Mar 5, 2026

I hope to find time today to give this a more careful review

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this PR and the conversations on it, I think it might help to break this down into smaller parts.

  1. Cases where we can clearly apply the (existing) preimage rewrite -- e.g. date_trunc(date_col, 'month') = '2025-01-01' where there is a range of date_col that always evaluates to 2025-01-01
  2. Other potential simplifications that don't clearly fall into the "preimage" category (e.g. date_trunc(date_col, 'month') = '2025-01-02' which will never be true)

I actually think the first one will be fairly straightforward and a clear win, though it will miss some potential predicates like date_trunc(date_col, 'month') < '2025-01-02'

As you have explored in this PR, the existing API for preimage is not sufficient for the second type of rewrite, and I think adding a new API would be eaiser to reason about in a follow on PR

Comment on lines -26 to -27
/// The expression always evaluates to the specified constant
/// given that `expr` is within the interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this logic

For the example floor(x) < 8.3 I would expect there to be no preimage as defined here -- specifically there is no range of inputs for which floor(x) evaluates to 8.3

I agree that it is valid simplificaition to rewrite floor(x) < 8.3 to x < 9.0, but it seems different than "preimage" 🤔

Maybe we just need to give it a different name (maybe that is what you have tried to do with is_boundary)

/// 2. `=` and `!=` operators:
/// if `Some(false)`, expression rewrite can use constant (false and true, respectively)
///
/// if is_boundary is `None`, then the boundary condition never applies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to understand this, I wonder if it might be easier to express by instead of adding a field to Range we could instead add a new variant. Something like this:

enum PreimageResult {
  /// ...
  Range { expr: Expr, interval: Box<Interval> }, 
  // The original expression UDF(lit) = lit
  // 1. `<` and `>=` operators:
  ///    if `Some(false)`, expression rewrite should use `interval.upper`
  /// 2. `=` and `!=` operators:
  ///    if `Some(false)`, expression rewrite can use constant (false and true, respectively)
  Boundary { expr: Expr, interval: Box<Interval> }, 
}

Comment on lines -26 to -27
/// The expression always evaluates to the specified constant
/// given that `expr` is within the interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I checked with datafusion-cli and the floor(x) < 8.3 case is not optimized today (the preimage is not applied here)

> create OR REPLACE table foo(x float) as values (1.0), (8.0), (9.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from foo where floor(x) < 8.3;
+-----+
| x   |
+-----+
| 1.0 |
| 8.0 |
+-----+
2 row(s) fetched.

> explain select * from foo where floor(x) < 8.3;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ CAST(floor(x) AS Float64) │ |
|               | │           < 8.3           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

> explain format indent select * from foo where floor(x) < 8.3;
+---------------+------------------------------------------------------+
| plan_type     | plan                                                 |
+---------------+------------------------------------------------------+
| logical_plan  | Filter: CAST(floor(foo.x) AS Float64) < Float64(8.3) |
|               |   TableScan: foo projection=[x]                      |
| physical_plan | FilterExec: CAST(floor(x@0) AS Float64) < 8.3        |
|               |   DataSourceExec: partitions=1, partition_sizes=[1]  |
|               |                                                      |
+---------------+------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

@alamb
Copy link
Contributor

alamb commented Mar 5, 2026

Thank you so much for working on this @drin and @sdf-jkl

@drin
Copy link
Contributor Author

drin commented Mar 5, 2026

okay, I can do that.

just so we're clear, the first case can only capture when the right-hand constant is aligned. I agree it's a clear win, but continuing with the date_trunc('month', ...) example, it's applicable to 1 value for every 30 values, and that's not including time and timezone offsets. As far as I can tell, date_trunc truncates time stamps to UTC so any other timezone will only get the optimization if it's pre-normalized to UTC midnight.

@alamb
Copy link
Contributor

alamb commented Mar 5, 2026

just so we're clear, the first case can only capture when the right-hand constant is aligned. I agree it's a clear win, but continuing with the date_trunc('month', ...) example, it's applicable to 1 value for every 30 values, and that's not including time and timezone offsets.

I am not sure why it is not applicable to timezone offsets

In my mind writing

date_trunc(col, 'month') = 2025-12-01

is the natural way to write queries that look to bucket date by month (aka to compare date trunc to a date/time with the granulatity of the truncation).

the expression

date_trunc(col, 'month') = 2025-12-03 -- <-- A value that is not a natural boundary

Doesn't make sense to me at all (as that will be FALSE for all values)

The expression

date_trunc(col, 'month') < 2025-12-03    

Makes slightly more sense as it will actually pass some rows, but I still would be surprised if this was a common epression as the constant doesn't align to the truncation granularity

@alamb
Copy link
Contributor

alamb commented Mar 5, 2026

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft March 5, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize the evaluation of DATE_TRUNC(<col>) == <constant>) when pushed down

3 participants