Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Resume Functionality #275

Open
bruce2233 opened this issue Aug 29, 2024 · 0 comments
Open

Feature Request: Add Resume Functionality #275

bruce2233 opened this issue Aug 29, 2024 · 0 comments

Comments

@bruce2233
Copy link

Here's a template for a GitHub issue requesting the addition of a resume functionality in pandarallel:


Feature Request: Add Resume Functionality

Feature Missing:
I would like to request the addition of a resume functionality in pandarallel. Currently, if a process using pandarallel is interrupted or fails for any reason, there is no built-in way to resume the computation from where it left off. This feature would be highly beneficial for long-running computations, allowing users to avoid reprocessing unchanged data.

Use Case:
When processing large datasets, interruptions can occur due to various reasons, such as system crashes or timeout errors. If users could resume from the last completed state, it would save time and resources, making pandarallel even more efficient and user-friendly.

Proposed Solution:

  • Implement a mechanism to periodically save the state of the computation (e.g., the progress and results of processed rows).
  • Provide an optional argument in pandarallel functions that would allow users to specify whether they want to use the resume functionality.
  • Introduce functions to load the saved state and continue processing from the last checkpoint.

Example:
An example implementation could look like this:

from pandarallel import pandarallel

# Initialize pandarallel
pandarallel.initialize()

# Function to process data (with state saving)
def process_data_with_resume(df):
    # Check for existing progress
    completed = load_progress()  # Function to load existing progress
    remaining = df[~df['id'].isin(completed['id'])]

    if not remaining.empty:
        results = remaining.parallel_apply(process_function, axis=1)
        save_progress(results)  # Function to save progress

# Provide an option to load from the last checkpoint
process_data_with_resume(df)

By adding this capability, pandarallel would greatly enhance its robustness in scenarios involving long computations, thus improving the user experience.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant