BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

Flytre · 2023-12-03T21:59:39Z

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

[✅ ] closes BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based #55828 (Replace xxxx with the GitHub issue number)
[✅ ] Tests added and passed if fixing a bug or adding a new feature
[ ✅] All code checks passed.
[✅ ] Added type annotations to new arguments/methods/functions.
[✅ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

mroeschke · 2023-12-04T18:54:24Z

pandas/io/common.py

+            file_path = urllib.request.url2pathname(parsed_url.path)
+            file_path = os.path.normpath(file_path)
+            return IOArgs(
+                filepath_or_buffer=open(file_path, "rb"),


Shouldn't this still respect mode?

mroeschke

Could you also include a unit test (for a file reading and file writing)

twoertwein · 2023-12-05T01:07:56Z

pandas/io/common.py

@@ -382,6 +382,19 @@ def _get_filepath_or_buffer(
        # urlopen function defined elsewhere in this module
        import urllib.request

+        # Fix for GH #55828
+        parsed_url = parse_url(filepath_or_buffer)


I believe is_url should not be true for fsspec urls. So that might be a much nicer way of fixing this issue (I think @krehm was also hinting at that in the issue) - I'm not familiar with the urllib regex, we might need to exclude more fsspec URLs from it.

@twoertwein it did seem to me that any fsspec url in is_url is guaranteed to fail in this case, which seemed like a logic flaw to me. But I'm not familiar with the urllib code either, so was hesitant to specify a particular solution.

The overlap between urllib/fsspec is at the moment:

>>> set(uses_relative + uses_netloc + uses_params).intersection(fsspec.available_protocols()) {'sftp', 'ftp', 'file', 'git', 'http', 'https'}

Could have something like this:

_VALID_URLS = set(uses_relative + uses_netloc + uses_params).difference( fsspec.available_protocols()) _VALID_URLS.update(("http", "https")) _VALID_URLS.discard("")

Technically this is a behavior change for sftp, git, ... (might be okay, probably not used frequently?). fsspec should have available_protocols since early 2022 fsspec/filesystem_spec#913 might need to double check whether we need to bump the minimum version of fsspec. This might make the regex in is_fsspec_url obsolete.

@mroeschke

@twoertwein I didn't want to touch is_url as its used in a few other places and I wasn't sure if it would break anything. Is it okay to do so?

These are the places is_url is used without a corresponding is_fsspec_url call:

pandas.io.html._LxmlFrameParser._build_doc pandas.io.html._read pandas.io.formats.html.HTMLFormatter._write_cell

Thank you for checking that!

Do you think it is possible to replace if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer): with if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer) and not is_fsspec_url(filepath_or_buffer):?

github-actions · 2024-01-08T00:06:12Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-01-31T18:59:05Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Flytre and others added 5 commits December 3, 2023 16:50

Fixed GH # 55828:

5c3449a

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

Fixed GH # 55828:

543594c

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

Fixed GH # 55828:

4da1380

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

Fixed GH # 55828:

329c17d

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

Fixed GH # 55828:

1d9c577

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

mroeschke reviewed Dec 4, 2023

View reviewed changes

mroeschke requested changes Dec 4, 2023

View reviewed changes

mroeschke added the IO Data IO issues that don't fit into a more specific label label Dec 4, 2023

mroeschke requested a review from twoertwein December 4, 2023 18:55

twoertwein requested changes Dec 5, 2023

View reviewed changes

github-actions bot added the Stale label Jan 8, 2024

mroeschke closed this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

Flytre commented Dec 3, 2023

mroeschke Dec 4, 2023

mroeschke left a comment

twoertwein Dec 5, 2023

krehm Dec 5, 2023

twoertwein Dec 5, 2023

Flytre Dec 8, 2023

twoertwein Dec 8, 2023

github-actions bot commented Jan 8, 2024

mroeschke commented Jan 31, 2024

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

Conversation

Flytre commented Dec 3, 2023

mroeschke Dec 4, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

twoertwein Dec 5, 2023

Choose a reason for hiding this comment

krehm Dec 5, 2023

Choose a reason for hiding this comment

twoertwein Dec 5, 2023

Choose a reason for hiding this comment

Flytre Dec 8, 2023

Choose a reason for hiding this comment

twoertwein Dec 8, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2024

mroeschke commented Jan 31, 2024