BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828

krehm · 2023-11-04T17:00:34Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
out_path_spec = 'file:///workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv'
df = pd.DataFrame(range(5))
df.to_csv(out_path_spec)

Issue Description

I happen to be using a python framework that provides multiple fsspec backends including file://. However, looking at the pandas code, I suspect the failure can happen with other backends besides file://. I am using pandas version 2.0.3, but the code in main is the same.

I want to create a not-yet-existing file named /workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv. The directory which will contain the file already exists:

ls -l /workspace/dlio/dlio_benchmark/data/default/train/0/

total 0

Here is a sample backtrace, where 'out_path_spec' contains the CSV path mentioned above:

df.to_csv(out_path_spec, compression=compression)

File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 3772, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/formats/format.py", line 1188, in to_csv
csv_formatter.save()
File "/usr/local/lib/python3.8/dist-packages/pandas/io/formats/csvs.py", line 242, in save
with get_handle(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 719, in get_handle
ioargs = _get_filepath_or_buffer(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 371, in _get_filepath_or_buffer
with urlopen(req_info) as req:
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 271, in urlopen
return urllib.request.urlopen(*args, **kwargs)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1489, in file_open
return self.open_local_file(req)
File "/usr/lib/python3.8/urllib/request.py", line 1528, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/workspace/dlio/dlio_benchmark/data/default/train/0/img_00_of_64.csv'>

The problem occurs in file io/common.py in function _get_filepath_or_buffer(). The code makes a call to is_url() using the pathname. Since 'file' is a valid scheme in _VALID_URLS the function returns True. That causes the code to call urlopen() using the path, which fails because the file doesn't exist yet, I am in the process of trying to create it.

This logic seems incorrect to me, when df.to_csv() is called it is likely that the fsspec path doesn't exist yet, so trying to open the URL seems guaranteed to fail?

Expected Behavior

The df.to_csv() call should succeed, the file (or whatever fsspec backend object) should be created.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-425.3.1.el8.x86_64
Version : #1 SMP Wed Nov 9 20:13:27 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 45.2.0
pip : 23.3.1
Cython : None
pytest : 7.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

krehm · 2023-11-04T19:12:21Z

As an experiment, I removed 'file' from the set of schemes in _VALID_URLS and the csv file was correctly created. So it seems to me that any URL for a scheme in _VALID_URLS is guaranteed to fail for df.to_csv() if the destination object does not already exist.

Flytre · 2023-11-07T20:25:56Z

take

krehm added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 4, 2023

rhshadrach added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Nov 5, 2023

github-actions bot assigned Flytre Nov 7, 2023

Flytre mentioned this issue Dec 3, 2023

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

Closed

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828

krehm commented Nov 4, 2023

INSTALLED VERSIONS

krehm commented Nov 4, 2023 •

edited

Loading

Flytre commented Nov 7, 2023

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828

Comments

krehm commented Nov 4, 2023

Pandas version checks

Reproducible Example

Issue Description

ls -l /workspace/dlio/dlio_benchmark/data/default/train/0/

Expected Behavior

Installed Versions

INSTALLED VERSIONS

krehm commented Nov 4, 2023 • edited Loading

Flytre commented Nov 7, 2023

krehm commented Nov 4, 2023 •

edited

Loading