Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 28] No space left on device #277

Open
1 task done
DTDwind opened this issue Oct 23, 2024 · 6 comments
Open
1 task done

OSError: [Errno 28] No space left on device #277

DTDwind opened this issue Oct 23, 2024 · 6 comments

Comments

@DTDwind
Copy link

DTDwind commented Oct 23, 2024

General

  • Operating System: Docker(python:3.12-slim)
  • Python version: 3.12.5
  • Pandas version: 2.2.2
  • Pandarallel version: 1.6.5

Acknowledgement

  • My issue is NOT present when using pandas without alone (without pandarallel)

Bug description

Observed behavior

When I execute the program, I get "OSError: [Errno 28] No space left on device"

This is my code.

I referred to #127 and added MEMORY_FS_ROOT and JOBLIB_TEMP_FOLDER, but it doesn't work.

import pandas as pd
from pandarallel import pandarallel
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)

pandarallel.initialize(progress_bar=False, use_memory_fs = False)

table['count.'] = table['url'].parallel_apply(lambda x: x.count('.')) # parallel_apply apply
table

df -h for my docker:

Filesystem      Size  Used Avail Use% Mounted on
overlay         1.8T  260G  1.5T  15% /
tmpfs            64M     0   64M   0% /dev
shm              64M   64M     0 100% /dev/shm
/dev/nvme1n1    1.8T  260G  1.5T  15% /app
tmpfs            63G     0   63G   0% /proc/asound
tmpfs            63G     0   63G   0% /proc/acpi
tmpfs            63G     0   63G   0% /proc/scsi
tmpfs            63G     0   63G   0% /sys/firmware
tmpfs            63G     0   63G   0% /sys/devices/virtual/powercap

I also try os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'

Anyone can help me?

@highvight
Copy link

Did you try setting the MEMORY_FS_ROOT env variable before importing pandarallel?

You can check the current location with pandarallel.core.MEMORY_FS_ROOT

@DTDwind
Copy link
Author

DTDwind commented Oct 24, 2024

Hi @highvight ,

I try setting the MEMORY_FS_ROOT env variable before importing pandarallel and check the current location.

Following is my code.

import pandas as pd
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

import pandarallel
print(pandarallel.core.MEMORY_FS_ROOT) # /app/tmp
pandarallel.pandarallel.initialize(progress_bar=False) # and I have try (progress_bar=False, use_memory_fs = False)

data = {'url': ['https://example.com/1', 'https://example.com/2'], 'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].parallel_apply(lambda x: x.count('.'))

MEMORY_FS_ROOT is /app/tmp.

ls -al for /app/tmp
drwxrwxrwx 2 root root 4096 Oct 23 17:32 tmp

ls -al for /app/tmp content
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_g3g0gh6k.pickle
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_k5bgg22r.pickle

I am still getting No space left on device, I don't know why.

Error message
image

@DTDwind
Copy link
Author

DTDwind commented Oct 24, 2024

I just confirmed that temporarily clearing /dev/shm allows small amounts of data to pass through the program, so it seems my modification is ineffective?

I tried modifying core.py, but it still doesn't work.

core.py
33| # MEMORY_FS_ROOT = os.environ.get("MEMORY_FS_ROOT", "/dev/shm")
34| MEMORY_FS_ROOT = /app/tmp

@usama3162
Copy link

@DTDwind Try setting use_memory_fs = False.

Note: MEMORY_FS_ROOT is only applied when use_memory_fs is set to True.

@chris-aeviator
Copy link

chris-aeviator commented Nov 9, 2024

I'm running into the same issues with no remedy from changing the settings or os.environ. I'm also inside docker, my dataset is ~3GB and my RAM 512 GB.

Looking at possible solutions:

  • docker has a laughable 64 MB as the default /dev/shm (df -h /dev/shm) but it can be changed using the --shm-size=1gb when starting the container
  • looking at DTDwind's re implementation, maybe SpooledFile or MemoryFS could help in ease the pains?!
class tempfile.SpooledTemporaryFile(max_size=0, mode='w+b', buffering=-1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)

This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().

Update:

I can confirm that running pandarallel outside of docker on the same machine does not error. It wants to consume huge amounts of RAM though. I'm using 36 workers (auto-selected, also I want max. speed), my dataset is 3GB and RAM consumption rises to >260 GB for this (this is more than 2x the overall dataset if each worker would hold it 100%)

@DTDwind
Copy link
Author

DTDwind commented Nov 26, 2024

I used parallel_pandas, and it works well in the Docker environment.
Here is a simple example:
pip install --upgrade parallel-pandas

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].p_apply(lambda x: x.count('.'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants