Dataframe with 64000 rows is being processed twice #265

reinzler · 2024-04-04T11:40:44Z

General

Windows Server 2022:
3.10:
Pandas 2.2.1:
Pandarallel version 1.6.6:

Bug description

The issue with pandarallel where a dataframe with 64000 rows is being processed twice can be described as follows:

Description:
When using pandarallel to apply a function to a large dataframe with 64000 rows, the dataframe is processed twice instead of once. This results in redundant computation and potentially incorrect output.
The dataframe is processed twice.

Minimal but working code sample to ease bug fix for `pandarallel` team

import os
import pandas as pd
from pandarallel import pandarallel
from spatialmath import SE3
from Master10 import Master10DH
from tqdm import tqdm
from datetime import datetime
import pytz
from colorama import Fore
import gc


def ik(row):
    robot = Master10DH()
    Tep = (SE3([row['x'], row['y'], row['z']]) * SE3.RPY([row['r'], row['p'], row['w']], unit="deg"))
    sol1 = robot.ikine_NR(Tep)
    return (sol1.success, sol1.reason)


if __name__ == '__main__':
    folder_path = "30_deg_10cm_dataframe"
    files = [file for file in os.listdir(folder_path) if file.endswith(".csv")]
    files = sorted(files, key=lambda x: int(x[x.index('_') + 1:x.index('.')]))
    output_folder = "Newton_Raphson_processed_files"
    os.makedirs(output_folder, exist_ok=True)
    moscow_timezone = pytz.timezone('Europe/Moscow')
    pandarallel.initialize(progress_bar=True, nb_workers=8)
    for file in tqdm(files):
        current_time = datetime.now(moscow_timezone)
        print(f"{Fore.YELLOW}Processing file: {file}, {current_time.strftime('%H:%M:%S')}{Fore.RESET}")
        output_file = os.path.join(output_folder, f"{file.split('.')[0]}_NewtonRaphson.csv")
        if os.path.exists(output_file):
            print(f"File {output_file} already exists. Skipping...")
            continue

        df = pd.read_csv(os.path.join(folder_path, file))
        try:
            df[['solution', 'reason']] = df.parallel_apply(ik, axis=1)
        except ValueError:
            df['solution'] = df.parallel_apply(ik, axis=1)
        df.to_csv(output_file, index=False)
        print(f"{Fore.GREEN}File {output_file} saved{Fore.RESET}")
        gc.collect()

The text was updated successfully, but these errors were encountered:

shermansiu · 2024-04-12T16:47:45Z

The example doesn't work if you don't have access to the Newton Raphson CSVs.

reinzler · 2024-04-13T08:49:54Z

The example doesn't work if you don't have access to the Newton Raphson CSVs.

data_300.csv

shermansiu · 2024-04-27T12:14:54Z

I reduced your code sample to the following:

import pandas as pd
from pandarallel import pandarallel
from spatialmath import SE3
# from Master10 import Master10DH


def ik(row):
    # robot = Master10DH()
    TEP = (SE3([row['x'], row['y'], row['z']]) * SE3.RPY([row['r'], row['p'], row['w']], unit="deg"))
    return True, "Reason"
    # sol1 = robot.ikine_NR(Tep)
    # return (sol1.success, sol1.reason)

pandarallel.initialize(progress_bar=True, nb_workers=8)

df = pd.read_csv("data_300.csv")
try:
    df[['solution', 'reason']] = df.parallel_apply(ik, axis=1)
except ValueError:
    print("ValueError")
    df['solution'] = df.parallel_apply(ik, axis=1)

Master10 seems to be a custom package for operating a robot, so I've commented that code out.

Upon inspection, it seems that if we DON'T use pandarallel, the code still runs twice.

It seems like you can fix your code by replacing df[['solution', 'reason']] = df.parallel_apply(ik, axis=1) with df[['solution', 'reason']] = df.parallel_apply(ik, axis=1).tolist().

This is not a problem with pandarallel. Unless there is other information that suggests otherwise, this issue can be closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe with 64000 rows is being processed twice #265

Dataframe with 64000 rows is being processed twice #265

reinzler commented Apr 4, 2024 •

edited

Loading

shermansiu commented Apr 12, 2024

reinzler commented Apr 13, 2024

shermansiu commented Apr 27, 2024

Dataframe with 64000 rows is being processed twice #265

Dataframe with 64000 rows is being processed twice #265

Comments

reinzler commented Apr 4, 2024 • edited Loading

General

Bug description

Minimal but working code sample to ease bug fix for pandarallel team

shermansiu commented Apr 12, 2024

reinzler commented Apr 13, 2024

shermansiu commented Apr 27, 2024

reinzler commented Apr 4, 2024 •

edited

Loading

Minimal but working code sample to ease bug fix for `pandarallel` team