Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect batching implementation #21

Open
kpto opened this issue Oct 30, 2024 · 5 comments
Open

Incorrect batching implementation #21

kpto opened this issue Oct 30, 2024 · 5 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@kpto
Copy link
Collaborator

kpto commented Oct 30, 2024

The current implementation divides the dataset into partitions by assigning a partition ID computed using pyspark function spark_partition_id to each row, followed by querying each partition. I think there are few problems:

  1. According to the pyspark doc, spark_partition_id is not deterministic. From the information gathered, it's mostly used for debugging and it changes along the execution environment which makes a script run not reproducible.
  2. Partition IDs are written as a new column which I suspect pyspark created a clone of the dataset in memory for that.

The batching could be done during iteration, for each round a number of rows are read.

@slobentanzer
Copy link
Collaborator

@kpto thanks. Non-deterministic partitioning does not mean that the output would be different, though, does it? As long as we iterate through the entire dataset, and manage to reduce memory footprint through batching, this does not seem like a KO criterion for implementation.

I am fairly sure that the current implementation improves the memory footprint (in some way), because without batching I am not able to run it on my machine due to memory overflow. Always open for better solutions, of course.

@slobentanzer
Copy link
Collaborator

@kpto as a side note, please make sure to label your issues and add them to the project board with the right annotations.

@kpto kpto added this to OTAR3088 Oct 30, 2024
@kpto kpto moved this to Todo in OTAR3088 Oct 30, 2024
@kpto kpto added bug Something isn't working enhancement New feature or request labels Oct 30, 2024
@kpto
Copy link
Collaborator Author

kpto commented Oct 30, 2024

@kpto thanks. Non-deterministic partitioning does not mean that the output would be different, though, does it? As long as we iterate through the entire dataset, and manage to reduce memory footprint through batching, this does not seem like a KO criterion for implementation.

I think one of the important reason for batching is to control the memory usage. To achieve this we should have a precise control on the size of a batch but how the partition ID determined is not detailly documented as it is for debugging/monitoring it seems so I don't think it is a suitable parameter to rely on for the purpose o memory control. The script may work on your machine but you don't know whether it works on others.

Also I think we can simply iterate the whole dataset batch by batch, a prior batch assignment does not seem necessary to me.

@kpto kpto added this to the Version v0.4.0 milestone Nov 5, 2024
@kpto
Copy link
Collaborator Author

kpto commented Nov 5, 2024

The partition ID column persisted on evidence dataframe for later use seems to consume over 8GB of memory.

@slobentanzer
Copy link
Collaborator

True, that seems like a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants