-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying files between filesystems via GenericFileSystem broken for filesystems requiring init parameters #1167
Comments
I am sorry that the parameters you were using disappeared. The functionality can still be achieved, though, with the following intended workflow: # create the instances as you did before
fsspec.generic._generic_fs[protocol1] = fsspec.filesystem(...) fsspec.generic._generic_fs[protocol2] = fsspec.filesystem(...)
fs = fsspec.filesystem("generic", default_method="generic")
fs.cp(...) The "generic" method means look up instances in the in-module dict rather than call the typical fsspec machinery for creating a new instance. Indeed, this is very experimental, and your feedback is important to get the process smooth. Perhaps an official "set_generic_instance" function would be helpful. Also, the "generic" method has no fallback if the instance isn't in the dict. The other default_methods of interest are:
|
Honestly, I was not aware of anyone using genericFS yet, and I'm glad you are! Do you have an interesting use case you can share? |
Thank you for the suggestion! As the dict key would only be the protocol used, that would mean you would not be able to copy files between two (for example) SSH filesystems, correct? Or (as is one of our use cases) copy from N>1 remote machines over SSH to the local filesystem (unless you alter the dict before every call, but that would really be a work-around). Our current use case it this: We currently use the GenericFileSystem exclusively to facilitate the downloading of specific batches of files (multiple async function calls, called in bulk using The 'source' ( I hope that helps, and of course thank you for all the work put into fsspec! |
I had not anticipated that this could be used for different instances of the same type of filesystem! I suppose you could still make it work by adding instances to the _generic_fs dict as I mention, but give the two FSs different protocols, like sftp1://, sftps2://. This dict is not used by the general fsspec lookup mechanism, so you wouldn't be breaking anything. I am surprised that you want to use this for the async functionality, though, since the SFTP implementation is not itself async, so I don't think you'll see any benefit. Are you using https://github.com/fsspec/sshfs , which is based on asyncssh rather than paramiko? |
i am using Not certain if there is a better way, but i found this to work: import fsspec
import fsspec.generic
fsspec.url_to_fs("sftp://host:12345")
fsspec.generic.rsync(
"sftp:///path/to/fake_sftp/",
"gs://le_bucket/fake_sftp/",
inst_kwargs={"default_method": "current"},
) note: this method does not work if i use
so, that's weird genericgeneric, with fsspec.generic._generic_fs["sftp"] = fsspec.filesystem("sftp", host="host", port=12345, username="meh")
fsspec.generic._generic_fs["gcs"] = fsspec.generic._generic_fs["gs"] = fsspec.filesystem("gs")
generic_fs = fsspec.filesystem("generic", default_method="generic")
generic_fs.rsync("sftp:///path/to/fake_sftp", "gs://le_bucket/ake_sftp") might get more cumbersome for other filesystems. |
ah, just saw #1398. slightly different issues, but very similar. |
Using "current" or setting keys in |
The largest problem i'm facing now is the |
Two issues i think as i dive in to this 1. isdir issueah, perhaps we are missing something like 2. copy fails: file not foundfor the second problem, i think this might be a problem in also, why isn't probably best if i move my stuff to a new/different ticket as the original issue of being able to pass init parameters in is "working"... |
Please link the sshfs issue or PR is there is one, and I agree we can close this given the discussion in #1578 |
@martindurant , here is the SSHFS fix for And here is a proposed fix for issue 2 above. |
In our project, we use the GenericFileSystem's
_cp_file
function to (asynchronously) copy files from a remote SSH filesystem to our local filesystem (as suggested in this comment).Before the release of 2023.1.0, this function accepted the parameters
fs
andfs2
, which could be used to pass in existing filesystem instances between which the copy operation should take place. Since 2023.1.0 however, these parameters have been removed (silently and without reference in the changelog, though perhaps the 'experimental' note in the docs means changes won't get a notice?).With the removal of these parameters, this function now only uses filesystem instances as resolved by the
_resolve_fs
function based on theurl
parameter. But, since the resolve function returns the result of a call toregistry.filesystem()
with only the inferred protocol as parameter (and no**storage_options
), this fundamentally breaks for filesystems that require some parameter for their__init__
function, like theSFTPFileSystem
(the default implementation for the SSH protocol, and which requires ahost
parameter).This issue can easily be reproduced by calling the
_cp_file
function on a GenericFileSystem where one of theurl
parameters starts with "ssh://" as protocol, or any other protocol whose FileSystem implementation requires additional parameters.Of course, if there is another workflow to get the _cp_file function working for pre-defined filesystems that is the 'intended' way, I would very much like to hear it and this issue should be closed.
The text was updated successfully, but these errors were encountered: