HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST by thomasrebele · Pull Request #6293 · apache/hive

thomasrebele · 2026-02-04T00:01:44Z

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

IS NOT NULL(CAST($1):BIGINT)
=(CAST($1):DOUBLE, 1)
IN(CAST($1):TINYINT, 10, 20, 30)

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

zabetak · 2026-02-12T13:25:50Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

thomasrebele

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

thomasrebele · 2026-02-17T11:23:24Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

…cates with a CAST

thomasrebele · 2026-02-20T09:50:18Z

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

The removeCastIfPossible was doing three things: 1) Checking if a cast can be removed based using column stats 2) Removing the cast if possible 3) Adjusting the boundaries in case of DECIMAL casts After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.

No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.

zabetak · 2026-02-23T08:06:44Z

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak · 2026-02-23T12:04:17Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));


Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

zabetak

Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.

zabetak · 2026-02-26T08:48:44Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+        typeBoundaries =
+            getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType());
+        rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries);


Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?

I investigated this, and it turns out that I missed some related cases: CAST(decimal_field TO integer_type). These cases also need some adjustment. I've implemented these test cases and the code necessary to deal with them. The two places where isRemovableCast is called are now quite similar.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

sonarqubecloud · 2026-03-02T09:07:19Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zabetak

Mainly questions plus small suggestions & few nits.

zabetak · 2026-03-02T09:49:18Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
+      return true;


I don't understand why we don't need additional checks to remove a cast when the source data type is one of these. Could you please add a comment explaining why it is ok to return true and exit early in this case?

The cast from these types does not introduce modulo-like behavior. I'll add a comment.

zabetak · 2026-03-02T09:53:21Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    }
+
+    // If the source type is completely within the target type, the cast is lossless
+    Range<Float> targetRange = getRangeOfType(cast.getType(), BoundType.CLOSED, BoundType.CLOSED);


Is there anything preventing the cast type to be a STRING, CHAR, VARCHAR, or other unsupported types? We want to avoid hitting an IllegalStateException in this case.

This is caught be the switch statement just above. I'll refactor it to make it clearer.

zabetak · 2026-03-02T09:55:36Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    if (sourceRange.equals(targetRange.intersection(sourceRange))) {
+      return true;
+    }


The intersection method throws an IllegalArgumentException if the ranges are disjointed. Is it guaranteed that the ranges are always connected at this stage?

Indeed, I did not expect that the intersection will throw an exception. I'll replace it with the encloses method which does not throw.

zabetak · 2026-03-02T09:57:40Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+      return false;
+    }
+
+    // If the source type is completely within the target type, the cast is lossless


Do we need stats to determine if a cast is lossless? If not then we could possibly move this logic before checking if column stats are empty.

This check only applies when casting an integer type to another integer type, so we cannot move it before the check whether the column statistics are empty.

zabetak · 2026-03-02T10:03:32Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+    case BIGINT, DATE, TIMESTAMP:
+      return Range.closed(-9.223372E18f, 9.223372E18f);
+    case DECIMAL:
+      return getRangeOfDecimalType(type, lowerBound, upperBound);


For every other type we return a closed Range no matter the input bound arguments. Why we can't do the same for DECIMAL?

The range depends on the precision and scale of the decimal type. Additionally, values are rounded when CASTing them to decimal, see comment at FilterSelectivityEstimator#getRangeOfDecimalType.

zabetak · 2026-03-02T13:53:55Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    RexNode cast = cast("f_numeric", TINYINT);
+    // check rounding of positive numbers
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);


This does not seem like a valid BETWEEN expression. I have the impression that we can't compare a float with an int type directly. Some kind of type conversion/alignment should be performed and since we are casting one side to TINYINT then the other side (literal) should be casted as well so we can never end-up with 10.9f.

I think we should drop this kind of invalid expressions from the tests since we don't want to unrelated failures in the future. To be checked if the code can be simplified as well after knowing that these expressions cannot appear.

Hive executes such queries without problems (e.g., here), so I would just leave those statements as they are.

zabetak · 2026-03-02T13:54:02Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
+    checkBetweenSelectivity(4, universe, total, cast, 0, 11);
+    checkBetweenSelectivity(4, universe, total, cast, 10, 20);
+    checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20);


See the discussion.

zabetak · 2026-03-02T13:54:09Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+
+    // check rounding of negative numbers
+    checkBetweenSelectivity(4, universe, total, cast, -20, -10);
+    checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);


See the discussion.

zabetak · 2026-03-02T13:54:16Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);
+    checkBetweenSelectivity(1, universe, total, cast, -20, -11);
+    checkBetweenSelectivity(3, universe, total, cast, -10, 0);
+    checkBetweenSelectivity(3, universe, total, cast, -10.9999f, 0);


See the discussion.

zabetak · 2026-03-02T13:57:45Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java


  @Test
-  public void testBetweenWithCastDecimal2s1() {
+  public void testBetweenWithCastToTinyIntCheckRounding() {


If the comment about the validity of some tests holds then potentially this test becomes irrelevant/mergeable with testBetweenWithCastToTinyInt.

I specifically designed this test case to cover all the boundary cases, so I would keep them. See also the discussion.

asf-ci-hive added tests pending tests passed and removed tests pending labels Feb 4, 2026

thomasrebele commented Feb 4, 2026

View reviewed changes

thomasrebele marked this pull request as ready for review February 4, 2026 08:53

zabetak reviewed Feb 12, 2026

View reviewed changes

thomasrebele commented Feb 17, 2026

View reviewed changes

HIVE-29424: CBO plans should use histogram statistics for range predi…

1e9fd2b

…cates with a CAST

thomasrebele force-pushed the tr/HIVE-29424 branch from f80c231 to 1e9fd2b Compare February 19, 2026 16:39

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Feb 19, 2026

zabetak added 3 commits February 20, 2026 15:08

Generalize StatsUtils#isWithin and use in FilterSelectivityEstimator

638376e

Replace FloatInterval with Guava's Range API

ef8dc6c

No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.

zabetak reviewed Feb 23, 2026

View reviewed changes

thomasrebele added 6 commits February 24, 2026 16:51

Avoid a MutableObject<float[]>

b93db3f

Fix tests

ccce3ba

Avoid mutating the arguments

c4b2e5a

Implement review comments

b311f22

Comments

1c6cccf

Compare boundaries directly

fbd7116

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Feb 25, 2026

Fix test and Sonar Qube warnings

fc88104

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Feb 25, 2026

zabetak reviewed Feb 26, 2026

View reviewed changes

thomasrebele added 2 commits February 27, 2026 16:51

Handle casts from decimals to integers

259e726

Simplify

e3f876a

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Feb 27, 2026

Fix tests and some Sonar Qube warnings

33071ae

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending tests failed labels Mar 1, 2026

asf-ci-hive added tests passed and removed tests pending labels Mar 2, 2026

zabetak reviewed Mar 2, 2026

View reviewed changes

		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));

Conversation

thomasrebele commented Feb 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele commented Feb 20, 2026

Uh oh!

zabetak commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 2, 2026

Quality Gate passed

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele Mar 2, 2026 •

edited

Loading