Skip to content

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293

Open
thomasrebele wants to merge 14 commits intoapache:masterfrom
thomasrebele:tr/HIVE-29424
Open

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
thomasrebele wants to merge 14 commits intoapache:masterfrom
thomasrebele:tr/HIVE-29424

Conversation

@thomasrebele
Copy link
Contributor

See HIVE-29424.

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

@thomasrebele thomasrebele marked this pull request as ready for review February 4, 2026 08:53
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

  • IS NOT NULL(CAST($1):BIGINT)
  • =(CAST($1):DOUBLE, 1)
  • IN(CAST($1):TINYINT, 10, 20, 30)


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

Copy link
Contributor Author

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

@thomasrebele
Copy link
Contributor Author

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

The removeCastIfPossible was doing three things:
1) Checking if a cast can be removed based using column stats
2) Removing the cast if possible
3) Adjusting the boundaries in case of DECIMAL casts

After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.
No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.
@zabetak
Copy link
Member

zabetak commented Feb 23, 2026

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

Comment on lines 730 to 731
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.

Comment on lines 453 to 455
typeBoundaries =
getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType());
rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this, and it turns out that I missed some related cases: CAST(decimal_field TO integer_type). These cases also need some adjustment. I've implemented these test cases and the code necessary to deal with them. The two places where isRemovableCast is called are now quite similar.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 2, 2026

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly questions plus small suggestions & few nits.

Comment on lines +228 to +229
case FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE:
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we don't need additional checks to remove a cast when the source data type is one of these. Could you please add a comment explaining why it is ok to return true and exit early in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast from these types does not introduce modulo-like behavior. I'll add a comment.

}

// If the source type is completely within the target type, the cast is lossless
Range<Float> targetRange = getRangeOfType(cast.getType(), BoundType.CLOSED, BoundType.CLOSED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing the cast type to be a STRING, CHAR, VARCHAR, or other unsupported types? We want to avoid hitting an IllegalStateException in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caught be the switch statement just above. I'll refactor it to make it clearer.

Comment on lines +238 to +240
if (sourceRange.equals(targetRange.intersection(sourceRange))) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intersection method throws an IllegalArgumentException if the ranges are disjointed. Is it guaranteed that the ranges are always connected at this stage?

Copy link
Contributor Author

@thomasrebele thomasrebele Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I did not expect that the intersection will throw an exception. I'll replace it with the encloses method which does not throw.

return false;
}

// If the source type is completely within the target type, the cast is lossless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need stats to determine if a cast is lossless? If not then we could possibly move this logic before checking if column stats are empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check only applies when casting an integer type to another integer type, so we cannot move it before the check whether the column statistics are empty.

case BIGINT, DATE, TIMESTAMP:
return Range.closed(-9.223372E18f, 9.223372E18f);
case DECIMAL:
return getRangeOfDecimalType(type, lowerBound, upperBound);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every other type we return a closed Range no matter the input bound arguments. Why we can't do the same for DECIMAL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The range depends on the precision and scale of the decimal type. Additionally, values are rounded when CASTing them to decimal, see comment at FilterSelectivityEstimator#getRangeOfDecimalType.

RexNode cast = cast("f_numeric", TINYINT);
// check rounding of positive numbers
checkBetweenSelectivity(3, universe, total, cast, 0, 10);
checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem like a valid BETWEEN expression. I have the impression that we can't compare a float with an int type directly. Some kind of type conversion/alignment should be performed and since we are casting one side to TINYINT then the other side (literal) should be casted as well so we can never end-up with 10.9f.

I think we should drop this kind of invalid expressions from the tests since we don't want to unrelated failures in the future. To be checked if the code can be simplified as well after knowing that these expressions cannot appear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive executes such queries without problems (e.g., here), so I would just leave those statements as they are.

checkBetweenSelectivity(3, universe, total, cast, 0, 10.9f);
checkBetweenSelectivity(4, universe, total, cast, 0, 11);
checkBetweenSelectivity(4, universe, total, cast, 10, 20);
checkBetweenSelectivity(1, universe, total, cast, 10.9999f, 20);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion.


// check rounding of negative numbers
checkBetweenSelectivity(4, universe, total, cast, -20, -10);
checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion.

checkBetweenSelectivity(1, universe, total, cast, -20, -10.9f);
checkBetweenSelectivity(1, universe, total, cast, -20, -11);
checkBetweenSelectivity(3, universe, total, cast, -10, 0);
checkBetweenSelectivity(3, universe, total, cast, -10.9999f, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion.


@Test
public void testBetweenWithCastDecimal2s1() {
public void testBetweenWithCastToTinyIntCheckRounding() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment about the validity of some tests holds then potentially this test becomes irrelevant/mergeable with testBetweenWithCastToTinyInt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically designed this test case to cover all the boundary cases, so I would keep them. See also the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants