-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect batching implementation #21
Comments
@kpto thanks. Non-deterministic partitioning does not mean that the output would be different, though, does it? As long as we iterate through the entire dataset, and manage to reduce memory footprint through batching, this does not seem like a KO criterion for implementation. I am fairly sure that the current implementation improves the memory footprint (in some way), because without batching I am not able to run it on my machine due to memory overflow. Always open for better solutions, of course. |
@kpto as a side note, please make sure to label your issues and add them to the project board with the right annotations. |
I think one of the important reason for batching is to control the memory usage. To achieve this we should have a precise control on the size of a batch but how the partition ID determined is not detailly documented as it is for debugging/monitoring it seems so I don't think it is a suitable parameter to rely on for the purpose o memory control. The script may work on your machine but you don't know whether it works on others. Also I think we can simply iterate the whole dataset batch by batch, a prior batch assignment does not seem necessary to me. |
The partition ID column persisted on evidence dataframe for later use seems to consume over 8GB of memory. |
True, that seems like a lot |
The current implementation divides the dataset into partitions by assigning a partition ID computed using pyspark function
spark_partition_id
to each row, followed by querying each partition. I think there are few problems:spark_partition_id
is not deterministic. From the information gathered, it's mostly used for debugging and it changes along the execution environment which makes a script run not reproducible.The batching could be done during iteration, for each round a number of rows are read.
The text was updated successfully, but these errors were encountered: