Incorrect batching implementation #21

kpto · 2024-10-30T15:21:14Z

The current implementation divides the dataset into partitions by assigning a partition ID computed using pyspark function spark_partition_id to each row, followed by querying each partition. I think there are few problems:

According to the pyspark doc, spark_partition_id is not deterministic. From the information gathered, it's mostly used for debugging and it changes along the execution environment which makes a script run not reproducible.
Partition IDs are written as a new column which I suspect pyspark created a clone of the dataset in memory for that.

The batching could be done during iteration, for each round a number of rows are read.

The text was updated successfully, but these errors were encountered:

slobentanzer · 2024-10-30T15:26:02Z

@kpto thanks. Non-deterministic partitioning does not mean that the output would be different, though, does it? As long as we iterate through the entire dataset, and manage to reduce memory footprint through batching, this does not seem like a KO criterion for implementation.

I am fairly sure that the current implementation improves the memory footprint (in some way), because without batching I am not able to run it on my machine due to memory overflow. Always open for better solutions, of course.

slobentanzer · 2024-10-30T15:26:31Z

@kpto as a side note, please make sure to label your issues and add them to the project board with the right annotations.

kpto · 2024-10-30T15:58:30Z

@kpto thanks. Non-deterministic partitioning does not mean that the output would be different, though, does it? As long as we iterate through the entire dataset, and manage to reduce memory footprint through batching, this does not seem like a KO criterion for implementation.

I think one of the important reason for batching is to control the memory usage. To achieve this we should have a precise control on the size of a batch but how the partition ID determined is not detailly documented as it is for debugging/monitoring it seems so I don't think it is a suitable parameter to rely on for the purpose o memory control. The script may work on your machine but you don't know whether it works on others.

Also I think we can simply iterate the whole dataset batch by batch, a prior batch assignment does not seem necessary to me.

kpto · 2024-11-05T13:56:52Z

The partition ID column persisted on evidence dataframe for later use seems to consume over 8GB of memory.

slobentanzer · 2024-11-05T14:04:04Z

True, that seems like a lot

kpto added this to OTAR3088 Oct 30, 2024

kpto moved this to Todo in OTAR3088 Oct 30, 2024

kpto added bug Something isn't working enhancement New feature or request labels Oct 30, 2024

kpto added this to the Version v0.4.0 milestone Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect batching implementation #21

Incorrect batching implementation #21

kpto commented Oct 30, 2024

slobentanzer commented Oct 30, 2024

slobentanzer commented Oct 30, 2024

kpto commented Oct 30, 2024 •

edited

Loading

kpto commented Nov 5, 2024 •

edited

Loading

slobentanzer commented Nov 5, 2024

Incorrect batching implementation #21

Incorrect batching implementation #21

Comments

kpto commented Oct 30, 2024

slobentanzer commented Oct 30, 2024

slobentanzer commented Oct 30, 2024

kpto commented Oct 30, 2024 • edited Loading

kpto commented Nov 5, 2024 • edited Loading

slobentanzer commented Nov 5, 2024

kpto commented Oct 30, 2024 •

edited

Loading

kpto commented Nov 5, 2024 •

edited

Loading