Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement basic version of RLIKE #734

Merged
merged 16 commits into from
Aug 2, 2024
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 28, 2024

Which issue does this PR close?

N/A

Rationale for this change

RLIKE is pretty common in ETL jobs.

What changes are included in this PR?

This PR implements the RLIKE expression, but falls back to Spark for all regexp patterns that are not currently guaranteed to be compatible (which is all of them for now) unless a new config (spark.comet.regexp.allowIncompatible) is enabled.

How are these changes tested?

New tests

Comment on lines +25 to +30
def isSupportedPattern(pattern: String): Boolean = {
// this is a placeholder for implementing logic to determine if the pattern
// is known to be compatible with Spark, so that we can enable regexp automatically
// for common cases and fallback to Spark for more complex cases
false
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on implementing this in a future PR

@andygrove andygrove marked this pull request as ready for review July 28, 2024 15:40
@kazuyukitanimura
Copy link
Contributor

Last time, we had a discussion around dictionary #469 (comment)
Was it addressed?

@andygrove
Copy link
Member Author

Last time, we had a discussion around dictionary #469 (comment) Was it addressed?

Yes, the test now covers the dictionary case (as far as I can tell). The expression is never invoked with a dictionary though , so it seems to be converted before reaching the expression.

Comment on lines 625 to 626
val data = Seq("James Smith", "Michael Rose", "Rames Rose", "Rames rose") ++
gen.generateStrings(100, "rames Rose", 12)
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need enough number of repeated strings in order to make a dictionary. Right now, it has a low chance due to the randomness...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazuyukitanimura I updated the test and have now added dictionary support

@andygrove
Copy link
Member Author

microbenchmark results:

OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
rlike                                               102            120          22          2.0         501.4       1.0X
rlike: Comet (Scan)                                 102            108          11          2.0         499.4       1.0X
rlike: Comet (Scan, Exec)                            67             72           5          3.1         326.1       1.5X

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove andygrove merged commit e33d560 into apache:main Aug 2, 2024
74 checks passed
@andygrove andygrove deleted the rlike3 branch August 2, 2024 12:39
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants