Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based (#55828) #56309

Closed
wants to merge 5 commits into from

Conversation

Flytre
Copy link
Contributor

@Flytre Flytre commented Dec 3, 2023

When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception

Flytre and others added 5 commits December 3, 2023 16:50
When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception
When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception
When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception
When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception
When specifying local to_csv file paths with the file scheme, Pandas will now create the file instead of raising an exception
file_path = urllib.request.url2pathname(parsed_url.path)
file_path = os.path.normpath(file_path)
return IOArgs(
filepath_or_buffer=open(file_path, "rb"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still respect mode?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include a unit test (for a file reading and file writing)

@mroeschke mroeschke added the IO Data IO issues that don't fit into a more specific label label Dec 4, 2023
@mroeschke mroeschke requested a review from twoertwein December 4, 2023 18:55
@@ -382,6 +382,19 @@ def _get_filepath_or_buffer(
# urlopen function defined elsewhere in this module
import urllib.request

# Fix for GH #55828
parsed_url = parse_url(filepath_or_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe is_url should not be true for fsspec urls. So that might be a much nicer way of fixing this issue (I think @krehm was also hinting at that in the issue) - I'm not familiar with the urllib regex, we might need to exclude more fsspec URLs from it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein it did seem to me that any fsspec url in is_url is guaranteed to fail in this case, which seemed like a logic flaw to me. But I'm not familiar with the urllib code either, so was hesitant to specify a particular solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap between urllib/fsspec is at the moment:

>>> set(uses_relative + uses_netloc + uses_params).intersection(fsspec.available_protocols())
{'sftp', 'ftp', 'file', 'git', 'http', 'https'}

Could have something like this:

_VALID_URLS = set(uses_relative + uses_netloc + uses_params).difference( fsspec.available_protocols())
_VALID_URLS.update(("http", "https"))
_VALID_URLS.discard("")

Technically this is a behavior change for sftp, git, ... (might be okay, probably not used frequently?). fsspec should have available_protocols since early 2022 fsspec/filesystem_spec#913 might need to double check whether we need to bump the minimum version of fsspec. This might make the regex in is_fsspec_url obsolete.

@mroeschke

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein I didn't want to touch is_url as its used in a few other places and I wasn't sure if it would break anything. Is it okay to do so?

These are the places is_url is used without a corresponding is_fsspec_url call:

pandas.io.html._LxmlFrameParser._build_doc
pandas.io.html._read
pandas.io.formats.html.HTMLFormatter._write_cell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking that!

Do you think it is possible to replace if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer): with if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer) and not is_fsspec_url(filepath_or_buffer):?

Copy link
Contributor

github-actions bot commented Jan 8, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 8, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: df.to_csv() fails to a not-yet-created file when the path is fsspec-based
4 participants