-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Num Shuffle Files : Gets the count of shuffle files pulled into memory for a filter condition #76
Num Shuffle Files : Gets the count of shuffle files pulled into memory for a filter condition #76
Conversation
def getShuffleFileMetadata(path: String, condition: String) | ||
= { | ||
val (deltaLog, unresolvedColumns, targetOnlyPredicates, minMaxOnlyExpressions, equalOnlyExpressions,otherExpressions, removedPredicates) = getResolvedExpressions(path, condition) | ||
deltaLog.withNewTransaction { deltaTxn => | ||
(deltaTxn.filterFiles(targetOnlyPredicates), | ||
deltaTxn.filterFiles(minMaxOnlyExpressions), | ||
deltaTxn.filterFiles(equalOnlyExpressions), | ||
deltaTxn.filterFiles(otherExpressions), | ||
deltaTxn.filterFiles(removedPredicates), | ||
deltaLog.snapshot.filesWithStatsForScan(Nil), | ||
unresolvedColumns) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add any test cases for this method as the file metadata received is directly from Delta Table and I am not making any mutations to it. Additionally, please review if this method would be useful. I can remove it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any use case in mind but I think it may be of help for someone investigating the behavior of a query. My only advice here would be to explicitly add the return type in the method signature so it can serve as reference for someone that potentially would like to use this method.
README.md
Outdated
|
||
then the output might look like | ||
```scala | ||
(18, 100, 300, 600, 800, 800, List()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this output be a map that indicates what each value represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the output to a MAP and updated the method comments. Please check if it looks good, then I will update README.md @MrPowers
LGTM. The list with multiple outputs might be hard to parse, but the functionality looks cool. We should definitely write a blog post on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I just added a couple of comments.
def getShuffleFileMetadata(path: String, condition: String) | ||
= { | ||
val (deltaLog, unresolvedColumns, targetOnlyPredicates, minMaxOnlyExpressions, equalOnlyExpressions,otherExpressions, removedPredicates) = getResolvedExpressions(path, condition) | ||
deltaLog.withNewTransaction { deltaTxn => | ||
(deltaTxn.filterFiles(targetOnlyPredicates), | ||
deltaTxn.filterFiles(minMaxOnlyExpressions), | ||
deltaTxn.filterFiles(equalOnlyExpressions), | ||
deltaTxn.filterFiles(otherExpressions), | ||
deltaTxn.filterFiles(removedPredicates), | ||
deltaLog.snapshot.filesWithStatsForScan(Nil), | ||
unresolvedColumns) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any use case in mind but I think it may be of help for someone investigating the behavior of a query. My only advice here would be to explicitly add the return type in the method signature so it can serve as reference for someone that potentially would like to use this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This PR has two methods:
getNumShuffleFiles
andgetShuffleFileMetadata
First, get the count of shuffle files pulled into memory when a particular condition is provided in merge for a Delta Table.
The second one gets all the file metadata for the parquet files.