Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions datafusion/expr/src/preimage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,23 @@ use crate::Expr;
pub enum PreimageResult {
/// No preimage exists for the specified value
None,
/// The expression always evaluates to the specified constant
/// given that `expr` is within the interval
Comment on lines -26 to -27
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous documentation here describes the relationship between expr and interval as being direct, but the idea shouldn't be that expr >= lower and expr < upper, It should be:
udf(expr) <op> literal logically implies that expr can be directly compared with interval.

That is, if udf(expr) <op> literal is true, then there is a transformation involving expr and interval that is logically equivalent.

Specifically:

  • if floor(x) < 8, then preimage should return interval: [8, 9) such that the expression can be rewritten to x < 8.
  • if floor(x) < 8.3, then preimage should return interval: [8, 9) such that the expression can be rewritten to x < 9.

Notice that both expressions yield the same interval because both 8 and 8.3 are literals in that range, and have equivalent outputs (floor(8) == floor(8.3) == { for y in [8, 9): floor(y) }).

Then, rewrite_with_preimage must accommodate the predicate operator (< in this case) to correctly transform the expression using the preimage interval and a boundary condition (is_boundary = floor(y) == y).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this logic

For the example floor(x) < 8.3 I would expect there to be no preimage as defined here -- specifically there is no range of inputs for which floor(x) evaluates to 8.3

I agree that it is valid simplificaition to rewrite floor(x) < 8.3 to x < 9.0, but it seems different than "preimage" 🤔

Maybe we just need to give it a different name (maybe that is what you have tried to do with is_boundary)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I checked with datafusion-cli and the floor(x) < 8.3 case is not optimized today (the preimage is not applied here)

> create OR REPLACE table foo(x float) as values (1.0), (8.0), (9.0);
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from foo where floor(x) < 8.3;
+-----+
| x   |
+-----+
| 1.0 |
| 8.0 |
+-----+
2 row(s) fetched.

> explain select * from foo where floor(x) < 8.3;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ CAST(floor(x) AS Float64) │ |
|               | │           < 8.3           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

> explain format indent select * from foo where floor(x) < 8.3;
+---------------+------------------------------------------------------+
| plan_type     | plan                                                 |
+---------------+------------------------------------------------------+
| logical_plan  | Filter: CAST(floor(foo.x) AS Float64) < Float64(8.3) |
|               |   TableScan: foo projection=[x]                      |
| physical_plan | FilterExec: CAST(floor(x@0) AS Float64) < 8.3        |
|               |   DataSourceExec: partitions=1, partition_sizes=[1]  |
|               |                                                      |
+---------------+------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

Range { expr: Expr, interval: Box<Interval> },
/// For some UDF, a `preimage` implementation determines that:
/// the result `udf_result` in `udf_result = UDF(expr)`
/// is equivalent to `udf_result = UDF(i)` for any `i` in `interval`.
///
/// Then, `is_boundary` indicates a boundary condition where:
/// the original expression `UDF(expr)` is compared to a value `lit` where:
/// `UDF(lit) == lit`
/// This condition is important for two scenarios:
/// 1. `<` and `>=` operators:
/// if `Some(false)`, expression rewrite should use `interval.upper`
/// 2. `=` and `!=` operators:
/// if `Some(false)`, expression rewrite can use constant (false and true, respectively)
///
/// if is_boundary is `None`, then the boundary condition never applies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to understand this, I wonder if it might be easier to express by instead of adding a field to Range we could instead add a new variant. Something like this:

enum PreimageResult {
  /// ...
  Range { expr: Expr, interval: Box<Interval> }, 
  // The original expression UDF(lit) = lit
  // 1. `<` and `>=` operators:
  ///    if `Some(false)`, expression rewrite should use `interval.upper`
  /// 2. `=` and `!=` operators:
  ///    if `Some(false)`, expression rewrite can use constant (false and true, respectively)
  Boundary { expr: Expr, interval: Box<Interval> }, 
}

Range {
expr: Expr,
interval: Box<Interval>,
is_boundary: Option<bool>,
},
}
1 change: 1 addition & 0 deletions datafusion/functions/src/datetime/date_part.rs
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ impl ScalarUDFImpl for DatePartFunc {
Ok(PreimageResult::Range {
expr: col_expr.clone(),
interval,
is_boundary: None,
})
}

Expand Down
Loading
Loading