Skip to content

Commit

Permalink
num shuffle files : updated documentation with min-max
Browse files Browse the repository at this point in the history
  • Loading branch information
joydeepbroy-zeotap committed Sep 12, 2023
1 parent 73cd514 commit c084aa4
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,8 @@ DeltaHelpers.deltaNumRecordDistribution(path)
## Number of Shuffle Files in Merge & Other Filter Conditions

The function `getNumShuffleFiles` gets the number of shuffle files (think of part files in parquet) that will be pulled into memory for a given filter condition. This is particularly useful to estimate memory requirements in a Delta Merge operation where the number of shuffle files can be a bottleneck.
To better tune your jobs, you can use this function to get the number of shuffle files for different kinds of filter condition and then perform operations like merge, zorder, compaction etc. to see if you reach the desired no. of shuffle files. This method does not query data and is completely based on the Delta Log.
To better tune your jobs, you can use this function to get the number of shuffle files for different kinds of filter condition and then perform operations like merge, zorder, compaction etc. to see if you reach the desired no. of shuffle files.


For example, if the condition is "country = 'GBR' and age >= 30 and age <= 40 and firstname like '%Jo%' " and country is the partition column,
```scala
Expand Down Expand Up @@ -471,6 +472,8 @@ Map(
"UNRESOLVED_COLUMNS =>" -> List())
```

Another important use case this method can help with is to see the min-max range overlap. Adding a min max on a high cardinality column like id say `id >= 900 and id <= 5000` can actually help in reducing the no. of shuffle files delta lake pulls into memory. However, such a operation is not always guaranteed to work and the effect can be viewed when you run this method.

This function works only on the Delta Log and does not scan any data in the Delta Table.

## Change Data Feed Helpers
Expand Down

0 comments on commit c084aa4

Please sign in to comment.