HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293thomasrebele wants to merge 14 commits intoapache:masterfrom
Conversation
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
zabetak
left a comment
There was a problem hiding this comment.
Thanks for the PR @thomasrebele , the proposal is very promising.
One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:
IS NOT NULL(CAST($1):BIGINT)=(CAST($1):DOUBLE, 1)IN(CAST($1):TINYINT, 10, 20, 30)
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?
In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
There was a problem hiding this comment.
Ok to use the switch then but let's base it on SqlTypeName.
If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.
There was a problem hiding this comment.
I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.
I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.
About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
…cates with a CAST
f80c231 to
1e9fd2b
Compare
|
The CI fails because of |
The removeCastIfPossible was doing three things: 1) Checking if a cast can be removed based using column stats 2) Removing the cast if possible 3) Adjusting the boundaries in case of DECIMAL casts After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.
No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.
|
Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch. However, after replacing the |
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Show resolved
Hide resolved
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE)); | ||
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP)); |
There was a problem hiding this comment.
Why checkTimeFieldOnIntraDayTimestamps are not relevant here?
There was a problem hiding this comment.
Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
zabetak
left a comment
There was a problem hiding this comment.
Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.
| typeBoundaries = | ||
| getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType()); | ||
| rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries); |
There was a problem hiding this comment.
Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?
There was a problem hiding this comment.
I investigated this, and it turns out that I missed some related cases: CAST(decimal_field TO integer_type). These cases also need some adjustment. I've implemented these test cases and the code necessary to deal with them. The two places where isRemovableCast is called are now quite similar.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
zabetak
left a comment
There was a problem hiding this comment.
Mainly questions plus small suggestions & few nits.
| case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE: | ||
| return true; |
There was a problem hiding this comment.
I don't understand why we don't need additional checks to remove a cast when the source data type is one of these. Could you please add a comment explaining why it is ok to return true and exit early in this case?
There was a problem hiding this comment.
The cast from these types does not introduce modulo-like behavior. I'll add a comment.
| } | ||
|
|
||
| // If the source type is completely within the target type, the cast is lossless | ||
| Range<Float> targetRange = getRangeOfType(cast.getType(), BoundType.CLOSED, BoundType.CLOSED); |
There was a problem hiding this comment.
Is there anything preventing the cast type to be a STRING, CHAR, VARCHAR, or other unsupported types? We want to avoid hitting an IllegalStateException in this case.
There was a problem hiding this comment.
This is caught be the switch statement just above. I'll refactor it to make it clearer.
| if (sourceRange.equals(targetRange.intersection(sourceRange))) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
The intersection method throws an IllegalArgumentException if the ranges are disjointed. Is it guaranteed that the ranges are always connected at this stage?
There was a problem hiding this comment.
Indeed, I did not expect that the intersection will throw an exception. I'll replace it with the encloses method which does not throw.
| return false; | ||
| } | ||
|
|
||
| // If the source type is completely within the target type, the cast is lossless |
There was a problem hiding this comment.
Do we need stats to determine if a cast is lossless? If not then we could possibly move this logic before checking if column stats are empty.
There was a problem hiding this comment.
This check only applies when casting an integer type to another integer type, so we cannot move it before the check whether the column statistics are empty.
| case BIGINT, DATE, TIMESTAMP: | ||
| return Range.closed(-9.223372E18f, 9.223372E18f); | ||
| case DECIMAL: | ||
| return getRangeOfDecimalType(type, lowerBound, upperBound); |
There was a problem hiding this comment.
For every other type we return a closed Range no matter the input bound arguments. Why we can't do the same for DECIMAL?
There was a problem hiding this comment.
The range depends on the precision and scale of the decimal type. Additionally, values are rounded when CASTing them to decimal, see comment at FilterSelectivityEstimator#getRangeOfDecimalType.
| RexNode cast = cast("f_numeric", TINYINT); | ||
| // check rounding of positive numbers | ||
| checkBetweenSelectivity(3, universe, total, cast, 0, 10); | ||
| checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f); |
There was a problem hiding this comment.
This does not seem like a valid BETWEEN expression. I have the impression that we can't compare a float with an int type directly. Some kind of type conversion/alignment should be performed and since we are casting one side to TINYINT then the other side (literal) should be casted as well so we can never end-up with 10.9f.
I think we should drop this kind of invalid expressions from the tests since we don't want to unrelated failures in the future. To be checked if the code can be simplified as well after knowing that these expressions cannot appear.
There was a problem hiding this comment.
Hive executes such queries without problems (e.g., here), so I would just leave those statements as they are.
| checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f); | ||
| checkBetweenSelectivity(4, universe, total, cast, 0, 11); | ||
| checkBetweenSelectivity(4, universe, total, cast, 10, 20); | ||
| checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20); |
|
|
||
| // check rounding of negative numbers | ||
| checkBetweenSelectivity(4, universe, total, cast, -20, -10); | ||
| checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f); |
| checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f); | ||
| checkBetweenSelectivity(1, universe, total, cast, -20, -11); | ||
| checkBetweenSelectivity(3, universe, total, cast, -10, 0); | ||
| checkBetweenSelectivity(3, universe, total, cast, -10.9999f, 0); |
|
|
||
| @Test | ||
| public void testBetweenWithCastDecimal2s1() { | ||
| public void testBetweenWithCastToTinyIntCheckRounding() { |
There was a problem hiding this comment.
If the comment about the validity of some tests holds then potentially this test becomes irrelevant/mergeable with testBetweenWithCastToTinyInt.
There was a problem hiding this comment.
I specifically designed this test case to cover all the boundary cases, so I would keep them. See also the discussion.



See HIVE-29424.
What changes were proposed in this pull request?
This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.
Why are the changes needed?
This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests were added.